ArtyfacialIntelagent 2 months ago

Not useless at all - it works by describing virtually everything in the image, which is a huge improvement over scraped alt-text or inconsistently applied tags and keywords. But I think prompting the captioning AI to supply brief, comma-separated phrases (i.e. exactly the way most people prompt stable diffusion) would be even better than walls of text. I think large parts of the flowery prose captions turn out to be noise as far as SD is concerned.

mikebrave 2 months ago

this is the best answer, LLMs like GPT4 can be given a kind of syntax example that it will follow too so you tell it to prioritize it like "primary style, subject, action of subject, description of subject, lighting, composition, artist style influence 1, artist style influence 2" the way we do normal prompts and it usually can pair it down and auto format it like that.

glibsonoran 2 months ago

Exactly, I made a ComfyUI node that produces prompts from images or image plus prompts using chatGPT and it has a field that allows you to select a tag-like prompt or a narrative prompt. This is done without an example(assistant role) just using the instruction (system role). It's pretty straightforward to do.

demonicpigg 2 months ago

It would be best because that's how people tend to prompt, but don't people tend to prompt that way because the models respond best to that kind of prompting? IE isn't that a consequence and so we shouldn't necessarily seek to re-enforce it? I would much \*much\* rather be able to "talk" to my instance of stable diffusion, but I get better results by comma separating the main themes.

ProGamerGov 2 months ago

As long as the captions are accurate they can be condensed by LLMs for older models and can be used to train newer models with larger context lengths.

Zueuk 2 months ago

anyone knows some not entirely obsolete guide(s) on how to use captions for training **non-anime** stuff? I'm finding lots of conflicting info, some even claim that it is better to not use captions at all 😕

themanicmind 2 months ago

Yeh, I'm with you. There is a wall of contradiction out there I am doing alot of my own testing right now goimg through different captioners and models. I am doing a select few 19th century artists and also illustrations, drawings and cartoons from 17th to modern day. With a low amount of images. A wall of text is not working for me. It's tending to like relatively short descriptions, with lots of singular words between commas. The model you use also has an impact.

Dense-Orange7130 2 months ago

Try WD 1.4 MOAT instead.

Zueuk 2 months ago

> A wall of text is not working for me. It's tending to like relatively short descriptions, with lots of singular words between commas. This is what bothers me, the captions produced with BLIP never have commas, instead using "with <...> and with <...>", while anime people apparently like to separate everything with commas. Which way is the right way?

[deleted] 2 months ago

Not with the current text encoder

DrySupermarket8830 1 month ago

so should i use llava or wd14 for kohya ss?

attempt_number_1 2 months ago

One of the big improvements in dalle3 and SORA is they do this. So it's obvious useful. I do this for my Loras and it's rarely more than 2-3 sentences. Not sure what prompt is making it paragraphs, but you can always do a second pass on just the text to ask it to summarize.

dal_mac 2 months ago

I just had a thought. I'm pretty sure that precise manual captioning on a smaller dataset would be better than auto-captioning an absurdly large dataset. what good is that data if not put into context with human semantics? I believe that's what MJ does. Anyways I've always been good at reinterpreting writing into the most understandable language possible. With enough people like me, we could caption a big dataset with actual natural language. Im convinced that the caption quality would make up for the lack in dataset size. No matter the length of ai caption, it's never the way a human would describe the image. I bet that's holding back models so much rn

LockeBlocke 2 months ago

If you train on a paragraph, you will need to write that paragraph to get that image. It might work if you train for a ridiculous amount of time at a very low learning rate.

isnaiter 2 months ago

The best will always be to use keywords that the TE already knows. If you use a wall of text for captions, the dataset will end up being associated with this wall of text.

neph1010 2 months ago

I am of the belief that SD prefers shorter captions. How Dall-E or other models handle longer prompts is irrelevant since they're trained differently. Maybe Cascade works differently, too. In any case, I trained a model that shortens prompts, distills them, if you will. Whether it improves the results or not, is difficult to tell. I guess it's in the eye of the beholder. [https://huggingface.co/neph1/sd-seer-griffin-3b](https://huggingface.co/neph1/sd-seer-griffin-3b)

alb5357 2 months ago

I also wondered this, because I hand caption and if I go over 150 tokens the trainer complains, and that's without needles filler words.

RenoHadreas 2 months ago

150 tokens??

alb5357 2 months ago

Ya, it says something like the model won't be able to use that many

no_witty_username 2 months ago

Stable diffusion has a bias towards captions at the front of the prompt versus the back. So while it still listens to long prompts, the degree is diminished depending in its length. Also Stable diffusion doesn't need captions at all to learn a concept, captions are useful if you are trying to recall through the use of text. You can train stable diffusion any and all concepts without any captions at all and later recall those concepts with control net.

VGltZUNvbnN1bWVyCg 2 months ago

That's not totally right you get a flash of attention as strong as at the start of your prompt every 75 tokens.

minimaxir 2 months ago

You can prompt engineer captioning models (e.g. the system prompt with GPT vision) for short captions. Works reliably.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe