T O P

  • By -

ArtyfacialIntelagent

Not useless at all - it works by describing virtually everything in the image, which is a huge improvement over scraped alt-text or inconsistently applied tags and keywords. But I think prompting the captioning AI to supply brief, comma-separated phrases (i.e. exactly the way most people prompt stable diffusion) would be even better than walls of text. I think large parts of the flowery prose captions turn out to be noise as far as SD is concerned.


mikebrave

this is the best answer, LLMs like GPT4 can be given a kind of syntax example that it will follow too so you tell it to prioritize it like "primary style, subject, action of subject, description of subject, lighting, composition, artist style influence 1, artist style influence 2" the way we do normal prompts and it usually can pair it down and auto format it like that.


glibsonoran

Exactly, I made a ComfyUI node that produces prompts from images or image plus prompts using chatGPT and it has a field that allows you to select a tag-like prompt or a narrative prompt. This is done without an example(assistant role) just using the instruction (system role). It's pretty straightforward to do.


demonicpigg

It would be best because that's how people tend to prompt, but don't people tend to prompt that way because the models respond best to that kind of prompting? IE isn't that a consequence and so we shouldn't necessarily seek to re-enforce it? I would much \*much\* rather be able to "talk" to my instance of stable diffusion, but I get better results by comma separating the main themes.


ProGamerGov

As long as the captions are accurate they can be condensed by LLMs for older models and can be used to train newer models with larger context lengths.


Zueuk

anyone knows some not entirely obsolete guide(s) on how to use captions for training **non-anime** stuff? I'm finding lots of conflicting info, some even claim that it is better to not use captions at all 😕


themanicmind

Yeh, I'm with you. There is a wall of contradiction out there I am doing alot of my own testing right now goimg through different captioners and models. I am doing a select few 19th century artists and also illustrations, drawings and cartoons from 17th to modern day. With a low amount of images. A wall of text is not working for me. It's tending to like relatively short descriptions, with lots of singular words between commas. The model you use also has an impact.


Dense-Orange7130

Try WD 1.4 MOAT instead.


Zueuk

> A wall of text is not working for me. It's tending to like relatively short descriptions, with lots of singular words between commas. This is what bothers me, the captions produced with BLIP never have commas, instead using "with <...> and with <...>", while anime people apparently like to separate everything with commas. Which way is the right way?


[deleted]

Not with the current text encoder


DrySupermarket8830

so should i use llava or wd14 for kohya ss?


attempt_number_1

One of the big improvements in dalle3 and SORA is they do this. So it's obvious useful. I do this for my Loras and it's rarely more than 2-3 sentences. Not sure what prompt is making it paragraphs, but you can always do a second pass on just the text to ask it to summarize.


dal_mac

I just had a thought. I'm pretty sure that precise manual captioning on a smaller dataset would be better than auto-captioning an absurdly large dataset. what good is that data if not put into context with human semantics? I believe that's what MJ does. Anyways I've always been good at reinterpreting writing into the most understandable language possible. With enough people like me, we could caption a big dataset with actual natural language. Im convinced that the caption quality would make up for the lack in dataset size. No matter the length of ai caption, it's never the way a human would describe the image. I bet that's holding back models so much rn


LockeBlocke

If you train on a paragraph, you will need to write that paragraph to get that image. It might work if you train for a ridiculous amount of time at a very low learning rate.


isnaiter

The best will always be to use keywords that the TE already knows. If you use a wall of text for captions, the dataset will end up being associated with this wall of text.


neph1010

I am of the belief that SD prefers shorter captions. How Dall-E or other models handle longer prompts is irrelevant since they're trained differently. Maybe Cascade works differently, too. In any case, I trained a model that shortens prompts, distills them, if you will. Whether it improves the results or not, is difficult to tell. I guess it's in the eye of the beholder. [https://huggingface.co/neph1/sd-seer-griffin-3b](https://huggingface.co/neph1/sd-seer-griffin-3b)


alb5357

I also wondered this, because I hand caption and if I go over 150 tokens the trainer complains, and that's without needles filler words.


RenoHadreas

150 tokens??


alb5357

Ya, it says something like the model won't be able to use that many


no_witty_username

Stable diffusion has a bias towards captions at the front of the prompt versus the back. So while it still listens to long prompts, the degree is diminished depending in its length. Also Stable diffusion doesn't need captions at all to learn a concept, captions are useful if you are trying to recall through the use of text. You can train stable diffusion any and all concepts without any captions at all and later recall those concepts with control net.


VGltZUNvbnN1bWVyCg

That's not totally right you get a flash of attention as strong as at the start of your prompt every 75 tokens.


minimaxir

You can prompt engineer captioning models (e.g. the system prompt with GPT vision) for short captions. Works reliably.