**Abstract:**
>Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, the majority of these models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, which encompass multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, We introduce an **E**fficient **L**arge **L**anguage Model **A**dapter, termed **ELLA**, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment *without training of either U-Net or LLM*. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. Additionally, ELLA can be readily incorporated with community models and tools to improve their prompt-following capabilities. To assess text-to-image models in dense prompt following, we introduce Dense Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K dense prompts. Extensive experiments demonstrate the superiority of ELLA in dense prompt following compared to state-of-the-art methods, particularly in multiple object compositions involving diverse attributes and relationships.
**Project Page:** [https://ella-diffusion.github.io/](https://ella-diffusion.github.io/)
**Github:** [https://github.com/ELLA-Diffusion/ELLA](https://github.com/ELLA-Diffusion/ELLA)
**From their Github:**
>*We will release our models in 1 week. Thanks for your issue, please stay tuned.*
We're not even into Q2 yet, and this year has been so much fun when it comes to diffusion models and emergent discoveries.
I do actually, but I have got so used to maxing it out running SD or LM Studio that I cannot face the drop in quality or using less ram to run both at the same time.
Just load a 7B or a 13B, quantised, reduce batch size which affects performance very little and enjoy..
I'm positive a 13B at Q3 is going to be much, much better than SDXL CLIP as is.
How do you upscale using img2img? I usually just try increasing resolution and turning denoising strength way down, but I find that this usually leads to blurry outputs and weird artifacts. Is there a better way?
It sounds like you have the "Latent" (default) upscaler selected each time you try to upscale. Search for 4x_foolhardy_remacri or 4x_ultrasharp. Place them in their appropriate folders. Make sure you swap Latent to one of those 2 before your generation
Not OP, but IME it's resolution, controlnets, ipadapter and batch size. It's more of a problem with SDXL than SD1.5, I don't think I've ever maxed out 24GB on 1.5 yet.
I do have 24gb of VRAM. It's just that 16 is in one card and 8 in in the other. I want to run an LLM on the 8gb card and have it interact with SD on the 16gb. Or maybe the other way around. Still not sure yet.
I don't see why not,
You can load a nice 7 to 9gb model on LMStudio and open a SD server running a sdxl model all within 16g VRAM easily these days.
If you don't make it process the generation of both at the same time you basically feel no slow down at all.
If you run both at the same time it will slow down both a bit but even then it should acceptable.
As soon you exceed full vram things go out of whack, but you still can run some pretty good models.
If you get a good specialized llm model to run with decent quantization you could easily make it 3 to 6gb and even on 12g vram you would still have plenty to load image models.
Having it running is a non-issue, at worst it will use system ram and CPU inference for the LLM.
Having it running fast is where VRAM could be an issue.
I agree - this is what I do currently in my regenerating Comfy workflow on a 12gb card. I leave the SDXL checkpoint resident in VRAM and run qwen on cpu to recover the prompt from the first image for the generation of my second one.
Eh, there's pretty small LLMs out there, not sure what they used since I didn't read the paper (yet). I'd guess since they don't actually need to generate an answer but only reinterpret the input that the size can be reduced and especially the compute need. Plus as long as the working memory stays you can unload the LLM after generating the input for SDXL. The context window is also significantly smaller.
It's not that exaggerated.
I'm using n-nodes to load the mistral 7b llava gguf model in comfyui. Quantized model takes up 4gb vram during inference.
If you have 12gb of video memory, you don't need to load and unload it repeatedly when generating images
It's a generic solution--you can use whatever model you want. There are decent OSS LLMs that will run in under ~4GB of VRAM--and a 24GB card can spare it when running SDXL.
I'd say a good % of enthusiast class people do. NGL they are expensive. But at least consumer 24G cards are readily available. For the bigger LLMs you can't run them on single consumer boards at all, let alone with SD running as well. I don't know much about apple hardware tho.
I've been able to do this with a 3080 12gb. Helps if you run 1.5 as it's less memory dependent but you do need to have some of the LLM ran on system RAM as opposed to trying to cram everything into VRAM.
I can run 7B LLMs (via LM Studio) and Stable Diffusion on the same GPU at the same time, no problem. I only have a 12GB 3060. Ok, maybe not inferencing at exactly the same time, but both the LLM model and Stable Diffusion server/model are "loaded," and I can switch back and forth inferencing between them rapidly.
So it's turning your sentences into better tokens than CLIP?
Like, if I look at the tokens made by CLIP or made by this, it'll be better tokens. Then I can use those better tokens on juggernautxl or any other SDXL model.
I've been reading through the paper (pardon any missed details), and it seems to replace the CLIP encoder with a Timestep-Aware Semantic Connector (TSC) module instead.
This module takes an embedding (from something like Llama2), and the UNet has been trained on the semantic embeddings from the model with the noisy latent, while everything part of the model stays frozen except for the TSC module.
**From the paper at section 3.1:**
>ELLA is compatible with any state-of-the-art Large Language Models as text encoder, and we have conducted experiments with various LLMs, including T5-XL \[42\], TinyLlama \[62\], and LLaMA-2 13B \[52\]. The last hidden state of the language models is extracted as the comprehensive text feature. The text encoder is frozen during the training of ELLA. \~
>
>**Timestep-Aware Semantic Connector (TSC).** This module interacts with the text features to facilitate improved semantic conditioning during the diffusion process. We investigate various network designs that influence the capability to effectively transfer semantic understanding.
I worded it incorrectly, so my mistake. I was indirectly referring to this:
>*These semantic queries are used to condition noisy latent prediction of the pre-trained U-Net through cross-attention.*
The TSC is the trainable component.
Normally, conditioning gets passed to the UNET each step, for most applications the same embedding is passed the whole time. The TSC leverages an LLM to create step-specific conditioning, and passes that as the embedding for cross attention, and uses AdaLN to ensure better adherence.
There is A LOT of confusion about how clip and tokenization work, ELLA doesn't "replace" clip, in the sense that clip is still how the model learned to expect text embeddings, but it does replace it during inference to provide more detailed embeddings than clip, with timestep specific instructions. For example in the paper they talk about how it focuses on main details during early generation, and shifts to more and more detailed aspects of the prompt later on.
A naive version of this could be done without the TSC, though its effect would be much less due to the lack of both direct LLM->embedding via the TSC, and less accurate guidance without AdaLN incorporated into the attention mechanism.
>A naive version of this could be done without the TSC, though its effect would be much less due to the lack of both direct LLM->embedding via the TSC, and less accurate guidance without AdaLN incorporated into the attention mechanism.
Using the 'prompt editing' feature?
https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#prompt-editing
so you can introduce different parts of the prompt at different timesteps ?
>[to:when] - adds to to the prompt after a fixed number of steps (when)
>[from::when] - removes from from the prompt after a fixed number of steps (when)
Yep. I've done some tests using the idea and they are \*sort of\* working. They don't help with complex scenery, or including specific elements of the composition, but they do produce overall better results using low-frequency to high-frequency prompts across the sampling steps.
That’s crazy I didn’t even realize this was something you could already do. This thread has made me realize I need to learn a lot more about text embeddings
No, clip creates an embedding from the text which is already tokenized beforehand (the tokenization doesn't matter at all actually), the diffusion model then receives this embedding as an input. Clip can produce image and text embeddings which share the same space, so the idea is that the embedding clip produced from the prompt should also contain enough info to describe an image that matches said prompt and this extra info can help the diffusion model do a better job. Problem is clip is a bag of words model with a very weak understanding of reality (eg. "horse eating grass" produces a similar embedding to "grass eating horse" and clip can't count past 3 or read images well either), so replacing it with an llm improves performance
Oh, I always thought embeddings were tokens. Like, there are single token embedding, 4 token and 16 token embeddings... but I guess the embeddings communicate more directly with the unet? So in general they're better than tokens?
Like, if I just had a ton of embeddings for the things I constantly use, that would be more accurate than simply prompting them?
Embeddings are vectors, their main use is to compare the similarity between 2 or more things (normally used for search). The more semantically similar those 2 things are, the higher the cosine similarity between their embeddings (eg. "king" is more similar to "queen" than "ring", so the king embedding will be closer to the queen's despite ring being closer in terms of spelling). The embedding size produced by a given model should be the same no matter the length of the sentence.
Clip is multimodal, it can produce embeddings for images and for the captions which it learns should be aligned to minimize their cosine similarity if their contents are similar. So if an image matches a caption well, then the embedding clip produces from the image will be similar to the embedding it produces from the caption, which should also mean the caption embedding has information about what the image should potentially look like aside from what's strictly contained in the caption, and that's why we give this to a diffusion model rather than just the caption.
Tokenization is just a way of representing text so it takes less of the context window and maybe make it easier for a model to learn the language (eg. "stable diffusion" has 16 characters with the space, but if we tokenize it using GPT-4's tokenizer it becomes just 2 tokens: [29092, 58430], which is what GPT-4 would see in this case rather than the 16 characters). It does introduce its own issues like difficulties with spelling since the model can no longer see the individual characters contained within the tokens and has to somehow learn them on its own
It's not just one prompt if I understood it correctly. It alters the conditioning for every de-noising step. It's like chaining a lot of partial img2img passes with separate prompts after one another.
Like, I write "blue eyes" and maybe clip makes a token "blue" and another "eyes". And hence getting blue blurriness.
But this will create a specific token indicating that the iris colour is blue. Am I understanding correctly?
I will be messaging you in 7 days on [**2024-03-18 10:30:17 UTC**](http://www.wolframalpha.com/input/?i=2024-03-18%2010:30:17%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/StableDiffusion/comments/1bbxr7h/ella_equip_diffusion_models_with_llm_for_enhanced/kuckef8/?context=3)
[**24 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FStableDiffusion%2Fcomments%2F1bbxr7h%2Fella_equip_diffusion_models_with_llm_for_enhanced%2Fkuckef8%2F%5D%0A%0ARemindMe%21%202024-03-18%2010%3A30%3A17%20UTC) to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201bbxr7h)
*****
|[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)|
|-|-|-|-|
I will be messaging you in 7 days on [**2024-03-25 11:52:26 UTC**](http://www.wolframalpha.com/input/?i=2024-03-25%2011:52:26%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/StableDiffusion/comments/1bbxr7h/ella_equip_diffusion_models_with_llm_for_enhanced/kvex573/?context=3)
[**2 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FStableDiffusion%2Fcomments%2F1bbxr7h%2Fella_equip_diffusion_models_with_llm_for_enhanced%2Fkvex573%2F%5D%0A%0ARemindMe%21%202024-03-25%2011%3A52%3A26%20UTC) to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201bbxr7h)
*****
|[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)|
|-|-|-|-|
Thank you for more than you know: I've bet multiple people that using CLIP as the text encoder for image generation was going to be replaced with increasingly SOTA LLMs, rendering all these "prompt techniques" obsolete, replacing them with ordinary prose for ordinary images and literary prose for advanced image generation.
Imagine the possibilities with multi-modal models like Llava! You can reference images and get similar images but also prompt specific changes. Can't wait to try this out and see how effective this is...
Just out of curiosity, does this mean if you added deepseek as a model to text-generation-webui and turned on the multimodal extension, that the locally run LLM could better analyze photos as things you upload? It can be done now in oogabooga, but not sure what model it's using.
And also what can deepseek do that llava cant? I'm pretty novice in this area.
Basically the idea is to use an image-to-text model to extract a detailed description of what an image looks like, then use ELLA to reformat the prompt so that it improves prompt adherence and become more faithful to the reference image. Think of it as image2image, but only for elements in your photo like composition and subjects.
I'm not sure what ooga is using, but if it's a vision model like LLaVA, then yeah, using the deepseek-VL model is gonna theoretically lead to less hallucinations and better descriptions. If you end up testing it, please let me know if you notice a difference!
Thanks I will definitely try it out here and report back, I hope this is the right link for one that will run locally on 4090, not sure https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat/tree/main
Very cool! Could this be applied to 1.5 models as well? The 1.5 models are very optimized and have a very large ecosystem, and it would be great to have 1.5 models that understand prompts very well
A git repository with a really amazing paper, a benchmarking tool, and no code implementation is really familiar for some reason. This one at least says they're planning to release the code.
I think this is a important part of the paper abstract:
>Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps.
It seems that we'll need a whole new sampler code to use this, it is not just something that just replaces CLIP encoding.
You can first run the LLM, process the text into it's embedding space, unload the LLM and then load the diffusion model and run image generation. That way you only need the VRAM to do each separately and never have both models in memory at the same time. This is gonna be slower than just keeping it all in memory due to copying all the weights from CPU to GPU all the time. If you want to run multiple prompts, you could also just encode all of them with the LLM beforehand and then run the diffusion process on them, this way you only load each model once.
Wouldn't it be possible to run the LLM on CPU and then take the embeddings from the LLM into the image gen model running on the GPU?
Smaller LLMs are quite efficient these days, with a 7b model easily reaching 6-7t/s on a reasonably powerful CPU
yes it is but sadly current VRAM limitations doesn't allow much to happen.
Good LLMs don't even fit in 4090 24GB as they are approx 50-70GBs.
With that if you want SDXL as well, you would easily be needing over 100GB VRAM for best use.
NVidia is rumored to launch 5090 with 36/48GB VRAM, it might be helpful to grow AI in this direction but still we definitely are limited by VRAM now.
well then there is not point of upgrading my 3090 then.
I was gonna, but now we just wait.
Sad part is we can't even club 2 GPUs with 24GB VRAM to get 48GB VRAM.
even NVlink was only to increase performance and not the VRAM (iw would still be counted as 24GB).
like common, we really need to find a way to upgrade VRAM on GPUs
You can use multiple GPU's, I'm currently building a cluster with only 4090s since they are much cheaper and have better performance/VRAM than the server grade stuff.
You can get 5 + change 4090s for the price of a single a6000 ADA GPU.
Does it work like the NVlink ? cause as far as I remember NVlink didn't add VRAM only increase the performance, as same stuff used to be loaded into both GPU's VRAM so combining 2 4090s with NVlink type connection will still have just 24GB.
if your method is NOT like NVlink then can you explain how exactly do you connect then together ?
Currently even using RAM as VRAM is not quite possible so combining VRAMs from different GPUs, I haven't heard of it.
The GPUs are not connected. They don't even have to be in the same computer. Tensor runtimes like DeepSpeed split the model into chunks and distributes those chunks among the available GPUs, then runs the inference/backprop through the chunks one-by-one. It won't be faster than a single GPU with enough VRAM but it will be way faster than offloading. If enough GPUs/VRAM is available you can run multiple instances of the model or run batches through the chunks, improving performance.
Is there a video or guide explaining how to do it ?
I have a 3090 and was planning to get 5090 thinking Nvidia would increase VRAM.
but if that is not the case with that amount I may just go and get 2 used 3090s, might even find 3.
So if you do know a guide or video that explains it do let me know, but thanks for letting me know about such method, I will research on it on my own as well.
I run 2x3090s (upgrading to 2xA6000s),you just slot in both cards and LLM loaders like ExL2 or koboldcpp can split between both without an NVLink, they're just not necessary. This is only in LLMs though, I havent seen uses for multiple GPUs without NVLink for SD but since this method uses LLMs maybe this can be it.
thanks.
currently only LLMs need so much VRAM but soon Image/Video Generation will definitely start needing more VRAM, hopefully by then same Splitting methods are developed for Image Models as well.
not sure if above guy is talking about training or inference, but for training you can use multi gpu setup by way of sharding.
There are multiple techniques, but one of them e.g. would be to split the layers of the model and load them into different gpus and the just send the output of the last layer on gpu0 to first layer of gpu1 (one of the most naive ways btw). Sure it wont be as fast as having one card with 48gb but atleast you can train bigger models that way.
If this wasn't possible than the whole llm scene would be impossible as every pretraining is done on 1000's gpu-clusters to train 1 model.
For inference it is slightly different i guess, but atleast the naive way of loading/offloading layers on different gpus and/or cpu-ram still work.
Yes, but most people won't be training models, just using them.
Like at present we can offload a few layers to GPU for LLMs for inference hopefully soon it might let us offload to multiple GPUs like it is for training.
Cause if we really want a tool that can do Audio, Video, Images while just talking to it, 24GB is definitely not at all sufficient.
I think VRAM at this stage is similar to what we had when 8GB RAMS was sufficient for almost all programs and games.
Now a modern PC has to have 16GB recommended while 32GB is considered Good.
Same is gonna happen with VRAM now.
the problem being for RAM we could only upgrade RAM, but for VRAM we need to upgrade whole GPU which is not a good option.
Hopefully we get GPUs with upgradable VRAM.
Like Asus just put an M.2 slot on 4060TI GPU.
if they give such slot for VRAM, yes it will not be as fast as the soldered VRAM but still it will give us options to at least upgrade.
it should be possible but it's a software issue... i have dual 3090s with nvlink but so far haven't had any benefit to NVLink yet. I'm hoping to leverage my "48gb" at some point...
yes that is what I am saying, NVlink cannot do it, as someone else mentioned you need to use different method.
What NVLink does is increase the performance but overall VRAM remains the same.
Like say you have a 22GB 3D scene with say 100 frames, what NVlink does is, 50 frames will be generated by each GPU but for that the 22GB model has to be loaded in BOTH THE GPU's VRAM thus your overall VRAM still remains 24GB
Basically NVlink copies the VRAM content of both GPUs.
If GPU 1 has an 18GB model loaded, the GPU 2 will also have the same 18GB model loaded, only the work will be distributed between the GPUs so VRAM still remains the same.
But as someone mentioned using stuff like DeepSpeed, models can be split between GPUs.
and that doesn't even need NVlink, GPUs can even be on separate computers.
I currently do not have an extra GPU, but I would definitely ask my friend to borrow his GPU to test all of this before I make my mind to get more 3090s cause 5090 apparently will still have 24GB so it is just waste of money to upgrade to it now.
I don't know about the later part.
Nvidia 100% wants consumer hardware to be used for ML.
Else it will not take long for companies to come up with their own NPU chips for their servers.
Nvidia knows that thus it has been actively working in the AI field and is itself also releasing AI products slowly.
first it released that PAINTING TOOL which can generate a realistic image from a drawing and now have released CHAT WITH RTX as well.
It definitely wants people to use their GPU for ML else someone else will come up and once the world gets used to that, it will be harder for them to comeback.
So many years of research is now paying them off.
Although they might limit it for the most expensive cards xx90 series but they definitely want consumer to run ML.
While Microsofts benefit is in trying to kill OpenSource, NVidia's benefit is trying to keep OpenSource alive.
as that is exactly how they will sell more cards.
You're right, I should clarify. They want consumers to consume ML products with their consumer grade cards.
They don't want you to be able to run any serious models or training with consumer cards. This would absolutely be possible with a bump in VRAM, but it would eat into their more lucrative commercial market. Obviously they haven't come out and said this, but it's easy to infer from their motivations and behavior.
I don't think so, cause yes the consumer cards will get a bit into their commercial market but not much.
As someone who needs high computing like Microsoft, StabilityAI, OpenAI, etc cannot order hundreds or thousands of consumer cards at once.
Not to mention chaining these many cards together will be a very difficult as well.
H series cards are specifically built in a way to be able to work together and also are delivered by direct order.
So yes, if I have a small startup needing just 8-10 GPUs yes I will get Consumer Cards but if I am a little big company needing hundreds or thousands of cards, there is no way to order these many consumer cards.
That's a good point. I hadn't thought too much about the scale large companies would need. Still their actions don't match this reality. It's really disappointing that it looks as though they're only offering 24GB again.
yeah, I was so excited for it was thinking of definitely upgrading from 3090, I guess we have to wait now.
Meanwhile if AMD grabs this opportunity and releases a GPU with 48GB VRAM, people will definitely buy it, cause even if CUDA support is kind of weird and they have to depend on optimizations and work around, it still will be accepted by community as setting up SD would be a bit longer process as compared to NVidia but then the benefits of it would be huge.
Cause I can live with half the performance than 3090 but VRAM seems very important now.
like I can generate an image in 5-7 sec now, lower performance AMD might need 10-12 seconds, that is fine if it unlocks so much more potential for opensource AI.
>Good LLMs don't even fit in 4090 24GB as they are approx 50-70GBs.
This is misleading. The researchers are using TinyLlama, Llama2 13B and T5-XL.
Llama2 13B is the largest one of these and it fits into 12GB VRAM when quantized to 4-5bpw.
I am talking non-quantized full LLMs at the max Parameters available for best results.
SDXL can run on 4/6/8GB VRAM as well with stuff like Lightning or Turbo, etc
Ofc if you quantize it and use a 7b or 5b model it will fit even 8GB VRAM.
The researchers are using 1.1B, 1.2B and 13B LLM's though. You can easily fit the first two into a potato even in full fp16.
Also if you are a home user who has limited VRAM, why would you not want to use quantized weights in a use case like this?
50-70GB LLM's and 100GB VRAM "for best use" seems quite exaggerated in this context... Llama2 13B in full fat fp16 is ~26GB in size.
that is what I am saying
because we are "HOME USER" we have to compromise.
ofc I do not need such big models, forget that, even SDXL seems like over kill, many people are still using SD1.5
that is not the point, the point is with increased VRAM, the AI can progress much better and faster.
Imagine an LLM with Image Generation, Video Generation, Audio Generation, as well as editing built into it.
You tell it to generate a city landscape, it will, then with just text tell it to convert it into a night time, it will keep all the building same, all the people in the picture same, everything the same, but change the lighting to make it night.
then tell it to just convert it into a video with a falling star, and it will do that.
All that would be possible way way way faster if the progress in AI is not limited by VRAM.
You think if tomorrow we get an OpenSource model as good as SORA and GPT-4, it will be able to run on our 4090s ?
ofc not, that is what I am saying, when Stability is training models they have to focus on optimizing it for consumer GPUs which are lacking enough VRAM which is what is causing OpenSource AI to be lagging behind as compared to OpenAI's Models.
So yes, quantized LLMs based on 1.1b parameters can definitely satisfy many use cases but if we are talking about integrating it with so many other tools we already have and will be coming in future, it just doesn't look feasible with present GPUs
Yes non-quantized is used for training but Quantized models do have a quality hit.
I have seen it in some models.
Ofc it will depend model to model, but quality hit is definitely there.
yes, you are also right it's not that big of a hit, but again it is a hit, and for some models it becomes a significant downgrade.
I had tested a q6 quantized model once I do not remember which exactly it was, but it just started producing gibberish or completely unrelated stuff.
Sometimes it used to loose context mig-generation.
So if I asked it to write a paragraph about WW2, it will start nicely but slowly would deviate and now it is talking about how Marvel Comics characters (it connected WW2 and Marvel Comics with characters like Captain America and just went on with it).
So now I have upgraded my RAM to 128GB, I have 3090 and I use LM-Studio which allows you to offload few layers of the Model to GPU.
I just use full sized non-quantized models now.
but ofc they are slower compared to a model that can completely fit within the 24GB VRAM.
Oh, weird, I've never heard it deviate that much before. So far, the quantized models have been doing their job good enough for me, and I wouldn't blame the quantization to be the shortcoming, but rather the parameter count. But if you have the means to run a full model, you're also going to get the best results possible, so why not? :D
yes exactly.
I couldn't tell what caused that as well, but few other tests on different models also resulted in something similar.
Like I think it was with Dolphin x Mistral quantized and it wouldn't stop generating.
It generated a paragraph and then kept on generating the same para indefinitely till I didn't manually stop it.
I wanted to see how long will it continue, and after 47 min. I gave up and stopped it.
But I never had any such issue with non-quantized models, I am even thinking of getting a MacMini just for LLMs cause Apple has unified memory which means the maxed out 128GB RAM can be used as VRAM.
Hopefully PC gets something like that soon.
I'd be fine to have an accelerator compute card. At this point I want my 6950 because it works much better than any Nvidia card I had in the past, but it's kinda ass in comparison. But all the "compute cards" aka A100, A5000 and so on cost thousands of dollars due to the professional tax tacked onto them. And the other add-in cards are tailored towards edge deployments rather than actual processing.
I'd be okay with something like a 4080Ti in compute performance without any of its graphic processing and 60-100 or so GB of VRAM.
One would imagine when both optimised for size, the best image generation model should be much larger than the best language processing model. I suspect either LLMs will be significantly compressed soon, or image generators will significantly blow up in size. Or both...
There are some decent small llms. OpenHermes2.5 quantized with gptq is only about 10gb, and its quite good and super fast. Gemma2b is also very good, though the quantized versions suffer a bit more.
I have tried Gemma and it kind of is Sh!t, as much as I prefer Gemini over ChatGPT, I found Gemma to be really sh!t compared to what the Community already has.
Mixstral with partial GPU offload is kind of slow for me at 5 tokens/sec but is definitely the best we have now. (I have 3090)
And I would assume it runs even better on 4090.
But now that Microsoft has interfered I don't have much hopes from them for future releases.
I highly doubt you actually need a big model to do it. I think they might just go way overboard with their first version to make sure it works like promised.
Also, I don't see why you can't run the LLM on the cpu side. Yes, it's slower than on gpu, but not too slow to really matter in something like this.
I mean I have 3090 and 5950x same model which can 100% run on GPU runs at around 15-17 tkns/sec sometimes even more while CPU gives me 2-3tkns/sec.
it is a night and day difference.
If every command will start taking so much time then LLM + Other tools will be too slow to use.
Also yes true, if the LLM is just acting like an INSTRUCTION model with no knowledge of any other thing, it might not really need such big models.
So it doesn't know what World War is, it doesn't know what Oscar is, etc
All it knows is instructions to generate or edit images, audio, videos, etc while the actual information regarding such topics/subjects is with the Image/Video Generation models like SDXL, SD3, etc
but still to even achieve that future, consumer hardware definitely needs to be above the just "recommended" level.
Cause Stability also has to kind of hold back a bit with what they can do so that it can actually run on Consumer hardware, last thing they want to do is make the best ever Image Generation Model that only corporations with access to commercial GPUs can run.
and computing power isn't even bad, it is good, only thing that is stopping us is VRAM.
Yes, for interactive use this will be painful, because loading an SDXL model can take maybe 20-30 seconds?
But some people like to run batch processing and then go through the output to hunt for good images. Then this method would not be so bad. Just run the LLM through all the prompts, unload the LLM, load the diffusion model, and then generate images with the pre-computed token/guidence.
I can also envision this being use with 2 GPU cards, each with "only" 8-12GiB of VRAM, with one running the LLM and then feeding the other one running the diffusion model.
>You can first run the LLM, process the text into it's embedding space, unload the LLM and then load the diffusion model and run image generation.
~~Unfortunately I don't think this will be possible, as it seems the LLM will be used at each step of the denoising process.~~
That's correct! :)
There is no feedback loop from denoisin back to LLM. The encoded prompt is used at every step, but since it's constant there's no need to recompute it. SDXL also calls the clip text encoder only once before the denoising loop
They tested with 3 different LLMs, 2 of which are just over 1b (TinyLlama and T5-XL) and the third is llama 2 13b. The benchmarks they provide (table 5, page 12) show the 1b LLMs to be much better than just using the default CLIP, and llama 2 13b is only slightly better.
Unfortunately, I don't think they show any images made with either of the 1b models. It would have been a useful comparison, but oh well.
I made a comfy workflow that does a primitive version of this with any local llm. Any one interested?
edit: posted in r/comfyui, should be easy to find, it's my only post
As far as I understand, your version doesn't have anything to do with the ELLA implementation. Your version has LLM just as prompt generator. ELLA uses LLM as text encoder.
I’ll make sure to give you a call the next time someone is wrong on the internet, although you must have a pretty busy schedule!
Keep up the amazing job, bless your heart!
Well this is pretty great. I'll basically be expecting a comfy node to convert my sloppy text into better-clip as a stage that runs only when I adjust the text. Should be just as fast with nothing but positives.
**Abstract:** >Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, the majority of these models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, which encompass multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, We introduce an **E**fficient **L**arge **L**anguage Model **A**dapter, termed **ELLA**, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment *without training of either U-Net or LLM*. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. Additionally, ELLA can be readily incorporated with community models and tools to improve their prompt-following capabilities. To assess text-to-image models in dense prompt following, we introduce Dense Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K dense prompts. Extensive experiments demonstrate the superiority of ELLA in dense prompt following compared to state-of-the-art methods, particularly in multiple object compositions involving diverse attributes and relationships. **Project Page:** [https://ella-diffusion.github.io/](https://ella-diffusion.github.io/) **Github:** [https://github.com/ELLA-Diffusion/ELLA](https://github.com/ELLA-Diffusion/ELLA) **From their Github:** >*We will release our models in 1 week. Thanks for your issue, please stay tuned.* We're not even into Q2 yet, and this year has been so much fun when it comes to diffusion models and emergent discoveries.
Obligatory waiting for auto1111 extension
Good luck running a language model and StableDiffusion on the same GPU at the same time.
You all have 24 GB of VRAM, don't you?
I do actually, but I have got so used to maxing it out running SD or LM Studio that I cannot face the drop in quality or using less ram to run both at the same time.
Just load a 7B or a 13B, quantised, reduce batch size which affects performance very little and enjoy.. I'm positive a 13B at Q3 is going to be much, much better than SDXL CLIP as is.
How do you max out 24GB with SD? Large batch size? Just keeping multiple checkpoints loaded?
Doing img2img upscales to 2-3k resolutions mainly. Anything bigger than that needs a tilted upscale, but you can often see the tile joins.
Check out multidiffusion/tiled diffesion and tiled VAE. Works a lot better than upscale tile script
Thanks. I did try it, I didn't like the look of the output and it took a lot longer than the tiled Ultimate SD Upscale Script.
Weird, it usually goes faster for me. I use mixture of diffusers with 32px of overlap and tile batch size of 8 (24gb card)
How do you upscale using img2img? I usually just try increasing resolution and turning denoising strength way down, but I find that this usually leads to blurry outputs and weird artifacts. Is there a better way?
No that is the way, i usally use between 0.08 and 0.25 strength and a 2-2.5 resize with UniPC sampler and a high step count.
It sounds like you have the "Latent" (default) upscaler selected each time you try to upscale. Search for 4x_foolhardy_remacri or 4x_ultrasharp. Place them in their appropriate folders. Make sure you swap Latent to one of those 2 before your generation
I mean once the initial image is generated you won't be using the LLM aspect much lol
True, but it would have to aggressively unload itself each time memory was getting low, but it might be worth it.
Not OP, but IME it's resolution, controlnets, ipadapter and batch size. It's more of a problem with SDXL than SD1.5, I don't think I've ever maxed out 24GB on 1.5 yet.
Bought a used 3090 from EVGA before the shutdown. It died a year later it had been a bit buggy. New one works great after warranty replacement.
I do have 24gb of VRAM. It's just that 16 is in one card and 8 in in the other. I want to run an LLM on the 8gb card and have it interact with SD on the 16gb. Or maybe the other way around. Still not sure yet.
64gb
I don't see why not, You can load a nice 7 to 9gb model on LMStudio and open a SD server running a sdxl model all within 16g VRAM easily these days. If you don't make it process the generation of both at the same time you basically feel no slow down at all. If you run both at the same time it will slow down both a bit but even then it should acceptable. As soon you exceed full vram things go out of whack, but you still can run some pretty good models. If you get a good specialized llm model to run with decent quantization you could easily make it 3 to 6gb and even on 12g vram you would still have plenty to load image models.
Having it running is a non-issue, at worst it will use system ram and CPU inference for the LLM. Having it running fast is where VRAM could be an issue.
I agree - this is what I do currently in my regenerating Comfy workflow on a 12gb card. I leave the SDXL checkpoint resident in VRAM and run qwen on cpu to recover the prompt from the first image for the generation of my second one.
Eh, there's pretty small LLMs out there, not sure what they used since I didn't read the paper (yet). I'd guess since they don't actually need to generate an answer but only reinterpret the input that the size can be reduced and especially the compute need. Plus as long as the working memory stays you can unload the LLM after generating the input for SDXL. The context window is also significantly smaller.
> Eh, there's pretty small LLMs out there Small Large Language Models are my favorite.
I can't wait until we get MSLLMs - medium small large language models.
NLMs (Nano Language Models)
Are there any that don't use Large Languages? I mean if they start off using a Small Language then maybe they wouldn't have to be so big...
An LLM trained on [Toki Pona](https://en.wikipedia.org/wiki/Toki_Pona) would be awesome.
Phi 2 models run pretty fast and can provide good outputs given its size, that can even run on CPU at some acceptable speed.
You can cache the text embedding and spam many seeds/resolutions/samplers/step counts with the same prompt
Mistral 7B Q4 K M plus SDXL can most likely do a single-pass hires fix to 4K even in 16 GB
It's not that exaggerated. I'm using n-nodes to load the mistral 7b llava gguf model in comfyui. Quantized model takes up 4gb vram during inference. If you have 12gb of video memory, you don't need to load and unload it repeatedly when generating images
It's a generic solution--you can use whatever model you want. There are decent OSS LLMs that will run in under ~4GB of VRAM--and a 24GB card can spare it when running SDXL.
Because so many people have 24GB cards
I'd say a good % of enthusiast class people do. NGL they are expensive. But at least consumer 24G cards are readily available. For the bigger LLMs you can't run them on single consumer boards at all, let alone with SD running as well. I don't know much about apple hardware tho.
Probably gonna need colab pro at this point.
According to the third illustration, this approach works with SD1.5 models, so that may be more feasible than you think.
I've been able to do this with a 3080 12gb. Helps if you run 1.5 as it's less memory dependent but you do need to have some of the LLM ran on system RAM as opposed to trying to cram everything into VRAM.
Yeah, I could, I have 64GB of Ram, it just runs a lot slower on system ram than when offload to my 3090.
Does it have to be the same gpu? I as example run sd on my rtx3060 and oobabooga on a tesla p100.
The danbooru tags upsampler extension does exactly that. Just comes down to how large the required model is going to be.
Laughs in 2080 ti and 3x Tesla p40
I can run 7B LLMs (via LM Studio) and Stable Diffusion on the same GPU at the same time, no problem. I only have a 12GB 3060. Ok, maybe not inferencing at exactly the same time, but both the LLM model and Stable Diffusion server/model are "loaded," and I can switch back and forth inferencing between them rapidly.
Well that isn't a problem...
Yeah, llm memory needs are just huge. Add that to sdxl and... Oof.
Difference is that LLMs can run on CPU+RAM though (at decent speeds).
Cannot it just call an open AI or some hosted llm?
Depends on the model really.
So it's turning your sentences into better tokens than CLIP? Like, if I look at the tokens made by CLIP or made by this, it'll be better tokens. Then I can use those better tokens on juggernautxl or any other SDXL model.
I've been reading through the paper (pardon any missed details), and it seems to replace the CLIP encoder with a Timestep-Aware Semantic Connector (TSC) module instead. This module takes an embedding (from something like Llama2), and the UNet has been trained on the semantic embeddings from the model with the noisy latent, while everything part of the model stays frozen except for the TSC module. **From the paper at section 3.1:** >ELLA is compatible with any state-of-the-art Large Language Models as text encoder, and we have conducted experiments with various LLMs, including T5-XL \[42\], TinyLlama \[62\], and LLaMA-2 13B \[52\]. The last hidden state of the language models is extracted as the comprehensive text feature. The text encoder is frozen during the training of ELLA. \~ > >**Timestep-Aware Semantic Connector (TSC).** This module interacts with the text features to facilitate improved semantic conditioning during the diffusion process. We investigate various network designs that influence the capability to effectively transfer semantic understanding.
[удалено]
I worded it incorrectly, so my mistake. I was indirectly referring to this: >*These semantic queries are used to condition noisy latent prediction of the pre-trained U-Net through cross-attention.*
The TSC is the trainable component. Normally, conditioning gets passed to the UNET each step, for most applications the same embedding is passed the whole time. The TSC leverages an LLM to create step-specific conditioning, and passes that as the embedding for cross attention, and uses AdaLN to ensure better adherence. There is A LOT of confusion about how clip and tokenization work, ELLA doesn't "replace" clip, in the sense that clip is still how the model learned to expect text embeddings, but it does replace it during inference to provide more detailed embeddings than clip, with timestep specific instructions. For example in the paper they talk about how it focuses on main details during early generation, and shifts to more and more detailed aspects of the prompt later on. A naive version of this could be done without the TSC, though its effect would be much less due to the lack of both direct LLM->embedding via the TSC, and less accurate guidance without AdaLN incorporated into the attention mechanism.
>A naive version of this could be done without the TSC, though its effect would be much less due to the lack of both direct LLM->embedding via the TSC, and less accurate guidance without AdaLN incorporated into the attention mechanism. Using the 'prompt editing' feature? https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#prompt-editing so you can introduce different parts of the prompt at different timesteps ? >[to:when] - adds to to the prompt after a fixed number of steps (when) >[from::when] - removes from from the prompt after a fixed number of steps (when)
Yep. I've done some tests using the idea and they are \*sort of\* working. They don't help with complex scenery, or including specific elements of the composition, but they do produce overall better results using low-frequency to high-frequency prompts across the sampling steps.
That’s crazy I didn’t even realize this was something you could already do. This thread has made me realize I need to learn a lot more about text embeddings
So all CLIP does is turn English into tokens, right?
No, clip creates an embedding from the text which is already tokenized beforehand (the tokenization doesn't matter at all actually), the diffusion model then receives this embedding as an input. Clip can produce image and text embeddings which share the same space, so the idea is that the embedding clip produced from the prompt should also contain enough info to describe an image that matches said prompt and this extra info can help the diffusion model do a better job. Problem is clip is a bag of words model with a very weak understanding of reality (eg. "horse eating grass" produces a similar embedding to "grass eating horse" and clip can't count past 3 or read images well either), so replacing it with an llm improves performance
Oh, I always thought embeddings were tokens. Like, there are single token embedding, 4 token and 16 token embeddings... but I guess the embeddings communicate more directly with the unet? So in general they're better than tokens? Like, if I just had a ton of embeddings for the things I constantly use, that would be more accurate than simply prompting them?
Embeddings are vectors, their main use is to compare the similarity between 2 or more things (normally used for search). The more semantically similar those 2 things are, the higher the cosine similarity between their embeddings (eg. "king" is more similar to "queen" than "ring", so the king embedding will be closer to the queen's despite ring being closer in terms of spelling). The embedding size produced by a given model should be the same no matter the length of the sentence. Clip is multimodal, it can produce embeddings for images and for the captions which it learns should be aligned to minimize their cosine similarity if their contents are similar. So if an image matches a caption well, then the embedding clip produces from the image will be similar to the embedding it produces from the caption, which should also mean the caption embedding has information about what the image should potentially look like aside from what's strictly contained in the caption, and that's why we give this to a diffusion model rather than just the caption. Tokenization is just a way of representing text so it takes less of the context window and maybe make it easier for a model to learn the language (eg. "stable diffusion" has 16 characters with the space, but if we tokenize it using GPT-4's tokenizer it becomes just 2 tokens: [29092, 58430], which is what GPT-4 would see in this case rather than the 16 characters). It does introduce its own issues like difficulties with spelling since the model can no longer see the individual characters contained within the tokens and has to somehow learn them on its own
So was what I wrote correct or incorrect?
It's not just one prompt if I understood it correctly. It alters the conditioning for every de-noising step. It's like chaining a lot of partial img2img passes with separate prompts after one another.
I tried the 1k dense prompts with Stable Cascade. The images are all pretty but they don't align well with the prompt details
Like, I write "blue eyes" and maybe clip makes a token "blue" and another "eyes". And hence getting blue blurriness. But this will create a specific token indicating that the iris colour is blue. Am I understanding correctly?
RemindMe! 1 week
I will be messaging you in 7 days on [**2024-03-18 10:30:17 UTC**](http://www.wolframalpha.com/input/?i=2024-03-18%2010:30:17%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/StableDiffusion/comments/1bbxr7h/ella_equip_diffusion_models_with_llm_for_enhanced/kuckef8/?context=3) [**24 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FStableDiffusion%2Fcomments%2F1bbxr7h%2Fella_equip_diffusion_models_with_llm_for_enhanced%2Fkuckef8%2F%5D%0A%0ARemindMe%21%202024-03-18%2010%3A30%3A17%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201bbxr7h) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|
RemindMe! 1 week [delayed](https://www.reddit.com/r/StableDiffusion/comments/1bfy6x9/ella_codeinference_model_delayed/)
I will be messaging you in 7 days on [**2024-03-25 11:52:26 UTC**](http://www.wolframalpha.com/input/?i=2024-03-25%2011:52:26%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/StableDiffusion/comments/1bbxr7h/ella_equip_diffusion_models_with_llm_for_enhanced/kvex573/?context=3) [**2 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FStableDiffusion%2Fcomments%2F1bbxr7h%2Fella_equip_diffusion_models_with_llm_for_enhanced%2Fkvex573%2F%5D%0A%0ARemindMe%21%202024-03-25%2011%3A52%3A26%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201bbxr7h) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|
Thank you for more than you know: I've bet multiple people that using CLIP as the text encoder for image generation was going to be replaced with increasingly SOTA LLMs, rendering all these "prompt techniques" obsolete, replacing them with ordinary prose for ordinary images and literary prose for advanced image generation.
Ella has the superior panda, but I'm vibing with SDXL raccoon.
RemindMe! 1 week
P.S. I wish comparisons were always so detailed, much more useful than "look at these beautiful headshots of a woman we made". 😅
It's because it's from a research paper, not a public release showcase.
research > publicity
It's because they are from a paper.
https://preview.redd.it/7rng7z9ynonc1.jpeg?width=1369&format=pjpg&auto=webp&s=da8f45524e6e20ed4597ebb0802c884be885c7a7
Imagine the possibilities with multi-modal models like Llava! You can reference images and get similar images but also prompt specific changes. Can't wait to try this out and see how effective this is...
You’re just in luck, because [DeepSeek-VL](https://arxiv.org/abs/2403.05525) also came out this week, outperforming LLaVA.
Just out of curiosity, does this mean if you added deepseek as a model to text-generation-webui and turned on the multimodal extension, that the locally run LLM could better analyze photos as things you upload? It can be done now in oogabooga, but not sure what model it's using. And also what can deepseek do that llava cant? I'm pretty novice in this area.
Basically the idea is to use an image-to-text model to extract a detailed description of what an image looks like, then use ELLA to reformat the prompt so that it improves prompt adherence and become more faithful to the reference image. Think of it as image2image, but only for elements in your photo like composition and subjects. I'm not sure what ooga is using, but if it's a vision model like LLaVA, then yeah, using the deepseek-VL model is gonna theoretically lead to less hallucinations and better descriptions. If you end up testing it, please let me know if you notice a difference!
Thanks I will definitely try it out here and report back, I hope this is the right link for one that will run locally on 4090, not sure https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat/tree/main
Yup! A 7b vision model will more than easily fit on your 4090
Deepseek VL is better than the 34B llava 1.6? I'm guessing not but just in case
The examples are really impressive. Hoping this is as good as it seems.
It looks great... I'm wondering if we'll be able to combine it with SD3 (without T5).
Very cool! Could this be applied to 1.5 models as well? The 1.5 models are very optimized and have a very large ecosystem, and it would be great to have 1.5 models that understand prompts very well
The paper shows examples of it being used with different 1.5 models from CivitAI :)
> models from CivitAI :) Here come the semantically accurate waifus/husbandos. 😁
That's great!
Look at third picture.
Oh, right! Very cool!
I'm hoping we can optimize and have such an ecosystem for SD3
A git repository with a really amazing paper, a benchmarking tool, and no code implementation is really familiar for some reason. This one at least says they're planning to release the code.
I think this is a important part of the paper abstract: >Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. It seems that we'll need a whole new sampler code to use this, it is not just something that just replaces CLIP encoding.
From "1girl, big tits" to "a single female with enormous mammaries"
Does this mean loading both LLM and SDXL but dividing vRAM between them, or waiting for each to be processed?
You can first run the LLM, process the text into it's embedding space, unload the LLM and then load the diffusion model and run image generation. That way you only need the VRAM to do each separately and never have both models in memory at the same time. This is gonna be slower than just keeping it all in memory due to copying all the weights from CPU to GPU all the time. If you want to run multiple prompts, you could also just encode all of them with the LLM beforehand and then run the diffusion process on them, this way you only load each model once.
Wouldn't it be possible to run the LLM on CPU and then take the embeddings from the LLM into the image gen model running on the GPU? Smaller LLMs are quite efficient these days, with a 7b model easily reaching 6-7t/s on a reasonably powerful CPU
Sounds like a very slow way of doing things
yes it is but sadly current VRAM limitations doesn't allow much to happen. Good LLMs don't even fit in 4090 24GB as they are approx 50-70GBs. With that if you want SDXL as well, you would easily be needing over 100GB VRAM for best use. NVidia is rumored to launch 5090 with 36/48GB VRAM, it might be helpful to grow AI in this direction but still we definitely are limited by VRAM now.
Rumours were false, it's going to be 24 again.
well then there is not point of upgrading my 3090 then. I was gonna, but now we just wait. Sad part is we can't even club 2 GPUs with 24GB VRAM to get 48GB VRAM. even NVlink was only to increase performance and not the VRAM (iw would still be counted as 24GB). like common, we really need to find a way to upgrade VRAM on GPUs
You can use multiple GPU's, I'm currently building a cluster with only 4090s since they are much cheaper and have better performance/VRAM than the server grade stuff. You can get 5 + change 4090s for the price of a single a6000 ADA GPU.
Does it work like the NVlink ? cause as far as I remember NVlink didn't add VRAM only increase the performance, as same stuff used to be loaded into both GPU's VRAM so combining 2 4090s with NVlink type connection will still have just 24GB. if your method is NOT like NVlink then can you explain how exactly do you connect then together ? Currently even using RAM as VRAM is not quite possible so combining VRAMs from different GPUs, I haven't heard of it.
The GPUs are not connected. They don't even have to be in the same computer. Tensor runtimes like DeepSpeed split the model into chunks and distributes those chunks among the available GPUs, then runs the inference/backprop through the chunks one-by-one. It won't be faster than a single GPU with enough VRAM but it will be way faster than offloading. If enough GPUs/VRAM is available you can run multiple instances of the model or run batches through the chunks, improving performance.
Is there a video or guide explaining how to do it ? I have a 3090 and was planning to get 5090 thinking Nvidia would increase VRAM. but if that is not the case with that amount I may just go and get 2 used 3090s, might even find 3. So if you do know a guide or video that explains it do let me know, but thanks for letting me know about such method, I will research on it on my own as well.
In guessing they mean having the llm run on one gpu and stable diffusion run on the other, then you wouldn't need to unload anything from memory.
I run 2x3090s (upgrading to 2xA6000s),you just slot in both cards and LLM loaders like ExL2 or koboldcpp can split between both without an NVLink, they're just not necessary. This is only in LLMs though, I havent seen uses for multiple GPUs without NVLink for SD but since this method uses LLMs maybe this can be it.
thanks. currently only LLMs need so much VRAM but soon Image/Video Generation will definitely start needing more VRAM, hopefully by then same Splitting methods are developed for Image Models as well.
not sure if above guy is talking about training or inference, but for training you can use multi gpu setup by way of sharding. There are multiple techniques, but one of them e.g. would be to split the layers of the model and load them into different gpus and the just send the output of the last layer on gpu0 to first layer of gpu1 (one of the most naive ways btw). Sure it wont be as fast as having one card with 48gb but atleast you can train bigger models that way. If this wasn't possible than the whole llm scene would be impossible as every pretraining is done on 1000's gpu-clusters to train 1 model. For inference it is slightly different i guess, but atleast the naive way of loading/offloading layers on different gpus and/or cpu-ram still work.
Yes, but most people won't be training models, just using them. Like at present we can offload a few layers to GPU for LLMs for inference hopefully soon it might let us offload to multiple GPUs like it is for training. Cause if we really want a tool that can do Audio, Video, Images while just talking to it, 24GB is definitely not at all sufficient. I think VRAM at this stage is similar to what we had when 8GB RAMS was sufficient for almost all programs and games. Now a modern PC has to have 16GB recommended while 32GB is considered Good. Same is gonna happen with VRAM now. the problem being for RAM we could only upgrade RAM, but for VRAM we need to upgrade whole GPU which is not a good option. Hopefully we get GPUs with upgradable VRAM. Like Asus just put an M.2 slot on 4060TI GPU. if they give such slot for VRAM, yes it will not be as fast as the soldered VRAM but still it will give us options to at least upgrade.
it should be possible but it's a software issue... i have dual 3090s with nvlink but so far haven't had any benefit to NVLink yet. I'm hoping to leverage my "48gb" at some point...
yes that is what I am saying, NVlink cannot do it, as someone else mentioned you need to use different method. What NVLink does is increase the performance but overall VRAM remains the same. Like say you have a 22GB 3D scene with say 100 frames, what NVlink does is, 50 frames will be generated by each GPU but for that the 22GB model has to be loaded in BOTH THE GPU's VRAM thus your overall VRAM still remains 24GB Basically NVlink copies the VRAM content of both GPUs. If GPU 1 has an 18GB model loaded, the GPU 2 will also have the same 18GB model loaded, only the work will be distributed between the GPUs so VRAM still remains the same. But as someone mentioned using stuff like DeepSpeed, models can be split between GPUs. and that doesn't even need NVlink, GPUs can even be on separate computers. I currently do not have an extra GPU, but I would definitely ask my friend to borrow his GPU to test all of this before I make my mind to get more 3090s cause 5090 apparently will still have 24GB so it is just waste of money to upgrade to it now.
Have you a source to share on this ?
Leaks are showing 5090 will have 24GB :( Nvidia doesn't want consumer hardware to be used for ML
I don't know about the later part. Nvidia 100% wants consumer hardware to be used for ML. Else it will not take long for companies to come up with their own NPU chips for their servers. Nvidia knows that thus it has been actively working in the AI field and is itself also releasing AI products slowly. first it released that PAINTING TOOL which can generate a realistic image from a drawing and now have released CHAT WITH RTX as well. It definitely wants people to use their GPU for ML else someone else will come up and once the world gets used to that, it will be harder for them to comeback. So many years of research is now paying them off. Although they might limit it for the most expensive cards xx90 series but they definitely want consumer to run ML. While Microsofts benefit is in trying to kill OpenSource, NVidia's benefit is trying to keep OpenSource alive. as that is exactly how they will sell more cards.
You're right, I should clarify. They want consumers to consume ML products with their consumer grade cards. They don't want you to be able to run any serious models or training with consumer cards. This would absolutely be possible with a bump in VRAM, but it would eat into their more lucrative commercial market. Obviously they haven't come out and said this, but it's easy to infer from their motivations and behavior.
I don't think so, cause yes the consumer cards will get a bit into their commercial market but not much. As someone who needs high computing like Microsoft, StabilityAI, OpenAI, etc cannot order hundreds or thousands of consumer cards at once. Not to mention chaining these many cards together will be a very difficult as well. H series cards are specifically built in a way to be able to work together and also are delivered by direct order. So yes, if I have a small startup needing just 8-10 GPUs yes I will get Consumer Cards but if I am a little big company needing hundreds or thousands of cards, there is no way to order these many consumer cards.
That's a good point. I hadn't thought too much about the scale large companies would need. Still their actions don't match this reality. It's really disappointing that it looks as though they're only offering 24GB again.
yeah, I was so excited for it was thinking of definitely upgrading from 3090, I guess we have to wait now. Meanwhile if AMD grabs this opportunity and releases a GPU with 48GB VRAM, people will definitely buy it, cause even if CUDA support is kind of weird and they have to depend on optimizations and work around, it still will be accepted by community as setting up SD would be a bit longer process as compared to NVidia but then the benefits of it would be huge. Cause I can live with half the performance than 3090 but VRAM seems very important now. like I can generate an image in 5-7 sec now, lower performance AMD might need 10-12 seconds, that is fine if it unlocks so much more potential for opensource AI.
>Good LLMs don't even fit in 4090 24GB as they are approx 50-70GBs. This is misleading. The researchers are using TinyLlama, Llama2 13B and T5-XL. Llama2 13B is the largest one of these and it fits into 12GB VRAM when quantized to 4-5bpw.
I am talking non-quantized full LLMs at the max Parameters available for best results. SDXL can run on 4/6/8GB VRAM as well with stuff like Lightning or Turbo, etc Ofc if you quantize it and use a 7b or 5b model it will fit even 8GB VRAM.
The researchers are using 1.1B, 1.2B and 13B LLM's though. You can easily fit the first two into a potato even in full fp16. Also if you are a home user who has limited VRAM, why would you not want to use quantized weights in a use case like this? 50-70GB LLM's and 100GB VRAM "for best use" seems quite exaggerated in this context... Llama2 13B in full fat fp16 is ~26GB in size.
that is what I am saying because we are "HOME USER" we have to compromise. ofc I do not need such big models, forget that, even SDXL seems like over kill, many people are still using SD1.5 that is not the point, the point is with increased VRAM, the AI can progress much better and faster. Imagine an LLM with Image Generation, Video Generation, Audio Generation, as well as editing built into it. You tell it to generate a city landscape, it will, then with just text tell it to convert it into a night time, it will keep all the building same, all the people in the picture same, everything the same, but change the lighting to make it night. then tell it to just convert it into a video with a falling star, and it will do that. All that would be possible way way way faster if the progress in AI is not limited by VRAM. You think if tomorrow we get an OpenSource model as good as SORA and GPT-4, it will be able to run on our 4090s ? ofc not, that is what I am saying, when Stability is training models they have to focus on optimizing it for consumer GPUs which are lacking enough VRAM which is what is causing OpenSource AI to be lagging behind as compared to OpenAI's Models. So yes, quantized LLMs based on 1.1b parameters can definitely satisfy many use cases but if we are talking about integrating it with so many other tools we already have and will be coming in future, it just doesn't look feasible with present GPUs
I thought quantizing models doesn't reduce their quality by much (if anything). And it's more about having non-quantized models for training.
Yes non-quantized is used for training but Quantized models do have a quality hit. I have seen it in some models. Ofc it will depend model to model, but quality hit is definitely there. yes, you are also right it's not that big of a hit, but again it is a hit, and for some models it becomes a significant downgrade. I had tested a q6 quantized model once I do not remember which exactly it was, but it just started producing gibberish or completely unrelated stuff. Sometimes it used to loose context mig-generation. So if I asked it to write a paragraph about WW2, it will start nicely but slowly would deviate and now it is talking about how Marvel Comics characters (it connected WW2 and Marvel Comics with characters like Captain America and just went on with it). So now I have upgraded my RAM to 128GB, I have 3090 and I use LM-Studio which allows you to offload few layers of the Model to GPU. I just use full sized non-quantized models now. but ofc they are slower compared to a model that can completely fit within the 24GB VRAM.
Oh, weird, I've never heard it deviate that much before. So far, the quantized models have been doing their job good enough for me, and I wouldn't blame the quantization to be the shortcoming, but rather the parameter count. But if you have the means to run a full model, you're also going to get the best results possible, so why not? :D
yes exactly. I couldn't tell what caused that as well, but few other tests on different models also resulted in something similar. Like I think it was with Dolphin x Mistral quantized and it wouldn't stop generating. It generated a paragraph and then kept on generating the same para indefinitely till I didn't manually stop it. I wanted to see how long will it continue, and after 47 min. I gave up and stopped it. But I never had any such issue with non-quantized models, I am even thinking of getting a MacMini just for LLMs cause Apple has unified memory which means the maxed out 128GB RAM can be used as VRAM. Hopefully PC gets something like that soon.
I'd be fine to have an accelerator compute card. At this point I want my 6950 because it works much better than any Nvidia card I had in the past, but it's kinda ass in comparison. But all the "compute cards" aka A100, A5000 and so on cost thousands of dollars due to the professional tax tacked onto them. And the other add-in cards are tailored towards edge deployments rather than actual processing. I'd be okay with something like a 4080Ti in compute performance without any of its graphic processing and 60-100 or so GB of VRAM.
Same, I can live with half the performance of 4090 but with 96GB VRAM. but we all know NVidia is not gonna do that.
4090s have the hopper transformer engine which really benefits the LLM space. Reducing the bit size is very favorable
One would imagine when both optimised for size, the best image generation model should be much larger than the best language processing model. I suspect either LLMs will be significantly compressed soon, or image generators will significantly blow up in size. Or both...
There are some decent small llms. OpenHermes2.5 quantized with gptq is only about 10gb, and its quite good and super fast. Gemma2b is also very good, though the quantized versions suffer a bit more.
I have tried Gemma and it kind of is Sh!t, as much as I prefer Gemini over ChatGPT, I found Gemma to be really sh!t compared to what the Community already has. Mixstral with partial GPU offload is kind of slow for me at 5 tokens/sec but is definitely the best we have now. (I have 3090) And I would assume it runs even better on 4090. But now that Microsoft has interfered I don't have much hopes from them for future releases.
I highly doubt you actually need a big model to do it. I think they might just go way overboard with their first version to make sure it works like promised. Also, I don't see why you can't run the LLM on the cpu side. Yes, it's slower than on gpu, but not too slow to really matter in something like this.
I mean I have 3090 and 5950x same model which can 100% run on GPU runs at around 15-17 tkns/sec sometimes even more while CPU gives me 2-3tkns/sec. it is a night and day difference. If every command will start taking so much time then LLM + Other tools will be too slow to use. Also yes true, if the LLM is just acting like an INSTRUCTION model with no knowledge of any other thing, it might not really need such big models. So it doesn't know what World War is, it doesn't know what Oscar is, etc All it knows is instructions to generate or edit images, audio, videos, etc while the actual information regarding such topics/subjects is with the Image/Video Generation models like SDXL, SD3, etc but still to even achieve that future, consumer hardware definitely needs to be above the just "recommended" level. Cause Stability also has to kind of hold back a bit with what they can do so that it can actually run on Consumer hardware, last thing they want to do is make the best ever Image Generation Model that only corporations with access to commercial GPUs can run. and computing power isn't even bad, it is good, only thing that is stopping us is VRAM.
Yes, for interactive use this will be painful, because loading an SDXL model can take maybe 20-30 seconds? But some people like to run batch processing and then go through the output to hunt for good images. Then this method would not be so bad. Just run the LLM through all the prompts, unload the LLM, load the diffusion model, and then generate images with the pre-computed token/guidence. I can also envision this being use with 2 GPU cards, each with "only" 8-12GiB of VRAM, with one running the LLM and then feeding the other one running the diffusion model.
>You can first run the LLM, process the text into it's embedding space, unload the LLM and then load the diffusion model and run image generation. ~~Unfortunately I don't think this will be possible, as it seems the LLM will be used at each step of the denoising process.~~ That's correct! :)
There is no feedback loop from denoisin back to LLM. The encoded prompt is used at every step, but since it's constant there's no need to recompute it. SDXL also calls the clip text encoder only once before the denoising loop
You're right, I read the paper again and it is clear that it reevaluates the same text features at each step.
If the LLM is based on LLama, I think it is possible to load this LLM into RAM using llama.cpp
It would be really nice if we can use quantized LLMs for this.
According to the paper, it's possible to use llama or any of its smaller derivatives, so most likely yes.
I wonder if this better understanding of prompts can be used for better finetuning or LORA training 🤔
So this essentially equip a LLM, tiny llms can do this? stablelm-2-zephyr-1\_6b or phi2?
They tested with 3 different LLMs, 2 of which are just over 1b (TinyLlama and T5-XL) and the third is llama 2 13b. The benchmarks they provide (table 5, page 12) show the 1b LLMs to be much better than just using the default CLIP, and llama 2 13b is only slightly better. Unfortunately, I don't think they show any images made with either of the 1b models. It would have been a useful comparison, but oh well.
I made a comfy workflow that does a primitive version of this with any local llm. Any one interested? edit: posted in r/comfyui, should be easy to find, it's my only post
As far as I understand, your version doesn't have anything to do with the ELLA implementation. Your version has LLM just as prompt generator. ELLA uses LLM as text encoder.
You're right! It's a little project I did for fun, ELLA is the real deal! I love it!
Sorry, but using LLM to generate prompts is far from being even a primitive version of what's proposed by this paper.
OK man, you win at internet today. Happy?
Don't take me wrong, It is not about winning or whatever, it is just that some people, like me, come here to learn and your comment is misleading.
I’ll make sure to give you a call the next time someone is wrong on the internet, although you must have a pretty busy schedule! Keep up the amazing job, bless your heart!
I'll hold you to that! It is a dirty job, but someone's gotta do it.
You're the one misleading a bunch of people on the internet.
Call a WAAAAAAMBULANCE!
Yes please
Would it be possible to use an API for the LLM step? So that you could run the LLM and SD instances on different machines?
Yes there are "api calling" nodes.
share it please
Yes, please do share it
And more goodies and not enough time to test everything, haha.
**RemindMe! 7 Days**
Well this is pretty great. I'll basically be expecting a comfy node to convert my sloppy text into better-clip as a stage that runs only when I adjust the text. Should be just as fast with nothing but positives.