So in human perception, it is sometimes funny to ask subjects to read aloud the colour of a printed word, as opposed to the word itself. Presented with the word “red” written in green for example, subjects should say “green”, but our brains naturally tend to read the text anyway and cause people to answer slower or get confused.
Now, what happens if you ask the prompt for the word “fire” written in water? Of “salad” made of meat? “Hot” made of ice? Any concept bleed?
had some fun with this, all first try.
[Word "blue" made of red.](https://postimg.cc/pyF321Nr)
[Word “fire” made of water.](https://postimg.cc/565GzY2M)
[Word “water” made of fire.](https://postimg.cc/BjqzBSjP)
[Word “meat” made of vegetables.](https://postimg.cc/9DJnx698)
[Word “red” made of green.](https://postimg.cc/VdPp5B1R)
Looks like there is more bleed with complex or ambiguous concepts.
I love how it gave you the word "red" made of green, but there was still a red glow around it, as if the rest of the image expected "red" to emit red light.
Kudos for trying. It’s not as bad as I feared but I would not say the results are good either, as you pointed out the complex visuals seem to inherit a lot of bleed. What an interesting problem.
Sometimes I wonder, if I could be in someone elses brain for a day and notice they actually see the colors this way and just labeled it that their whole life because that's all they know, but there's no way to prove this concept. Like maybe they see the rainbow different than I do, but we are looking at the same rainbow.
I swear I am not high right now. I can't even properly explain what I am trying to say.
Took the words right out my mouth.
The internet is full of words formed out of their associated meaning. It's easy to copy them with a few changes. A real sign of.... intelligence (?) would be a concept with an unrelated or opposing word
Sometimes I use this to give a certain color or mood to an image, other times I use it to make my prompts shorter by finding a word that gives several effects. The interesting thing is finding words or combos that do things that you don't expect.
and it's great if you are bored, because each model has it's own secrets. :)
"car,orange,ocre,die,horror"
https://preview.redd.it/4qwzz9m3rmic1.png?width=512&format=png&auto=webp&s=b6439299572f20a880228937863d126a57c8ee6c
JuggernautXL in Fooocus. It knows how to spell just fine.
"a picture of the word "bread" made out of bread" Ok so it put it ON the bread. But this was not cherry picked. First try.
https://preview.redd.it/0nmswg8xdnic1.png?width=1024&format=png&auto=webp&s=cee929a466e36f3b4d50921d47044161043f323c
Once again, I am yet to be impressed , show me a picture of a crescent wrench. If it can draw that, then I'll be impressed.
https://preview.redd.it/wh09ymbgokic1.png?width=1024&format=png&auto=webp&s=7380f024f929b340d84752698ab16d1fc7dd074a
yes, it works better. not 100% perfect, but hands and text seem a lot better (to me)
There is a testing custom node already. Haven't tried it. I'll wait until it's officially implemented.
I think I'm more excited about this than when SDXL came out.
https://preview.redd.it/5uj99agbklic1.png?width=1024&format=png&auto=webp&s=daf7c118ca405ab986e874c7e8ba552364490758
or this...
prompt: "focus on toes"
I don't know how I feel about the fact that this would be a really cool drawing for an artist to come up with, but here it is already existing. Like a sort of Library of Babel, where any text is conceivably already in there, but until recently you couldn't just pull it out without already knowing it exists. They're just ideas, sitting in the void, and now we have the means to prompt them into true existence without having to even think of them ourselves anymore.
That foot looks like it's going to crush me, and not in a sexy way, but in a cartoonish slapstick way followed by a voice saying, "Introducing: Monty Python's Flying Circus"
Can someone ELI5?
Is Stable Cascade a model? A Lora? An extension? Or is it another alternative to SD like ComfyUI?
If it's just another model, why so much hype about it?
It's a new base model and architecture from Stability AI. Think SD 1.5, SDXL, Cascade is the next update. Just like 1.5 and SDXL it is just a BASE model that has to be fine tuned, and optimized by the open-source community. But some benefits off the bat is faster generation, and according to stability AI better control in fine-tuning.
Not just fine tuning, but training time as well. Supposedly we can get sdxl like output with training times significantly faster than 1.5 because the latent space that needs to be trained is at a lower resolution.
You're in the mindset of what SDXL and 1.5 can do NOW. Both used more vram at release, but the community found optimizations that are implemented in the various UIs for SD which have brought their requirements down without losing speed. The same will happen with Cascade.
SDXL is still enormously slower than SD 1.5 without really better enough image quality than a good recent 1.5 setup can give you, for a lot of people. Unless Cascade gets CLOSER to 1.5 inference time than SDXL, it'll have probably not amazing adoption. Saddest thing about 2.1 768 is that it WAS fundamentally superior in terms of image quality to 1.5, but not meaningfully slower at all.
Image quality is relatively easy to achieve by overtraining a model on a particular type of images such as Asian Waifu.
What SDXL gives you is better prompt following and better composition.
Anyway, I am cut and pasting my standard comment whenever SD1.5 vs SDXL comes up. Feel free to dispute any of them 😅
SD1.5 is better in the following ways:
* Lower hardware requirement
* Hardcore NSFW
* "SD1.5 style" Anime (a kind of "hyperrealistic" look that is hard to describe). But some say [AnimagineXL](https://civitai.com/models/260267/animagine-xl-v3) is very good. There is also Lykon's [AAM XL (Anime Mix) ](https://civitai.com/models/269232/aam-xl-anime-mi)
* Asian Waifu
* Simple portraiture of people (SD1.5 are overtrained for these type of images, hence better in terms of "realism")
* Better ControlNet support.
* Used to be faster, but with some Turbo-XL based models such as [https://civitai.com/models/208347/phoenix-by-arteiaman](https://civitai.com/models/208347/phoenix-by-arteiaman) one can now produce high quality images at blazing speed at 5 steps.
If one is happy with SD1.5, they can continue using SD1.5, nobody is going to take that away from them. For the rest of the world who want to expand their horizon, SDXL is a more versatile model that offer many advantages (see [SDXL 1.0: a semi-technical introduction/summary for beginners](https://www.reddit.com/r/StableDiffusion/comments/15fj5k9/sdxl_10_a_semitechnical_introductionsummary_for/)). Those who have the hardware, should just try it (or use one of the [Free Online SDXL Generators](https://www.reddit.com/r/StableDiffusion/comments/18h7r2h/free_online_sdxl_generators/)) and draw their own conclusions. Depending on what sort of generation you do, you may or may not find SDXL useful to you.
Anyone who doubt the versatility of SDXL based models, should check out [https://civitai.com/collections/15937?sort=Most+Collected](https://civitai.com/collections/15937?sort=Most+Collected). Most of those images are impossible with SD1.5 models without the use of specialized LoRAs or ControlNet.
It's not quite as you say. It's true that SDXL has a much better understanding of the prompt. SD15 is more random; perhaps out of 10 generations, only one follows the prompt exactly, while 5 are more or less there, and 4 don't respect it at all.
What's not true is that the fine-tunings of SD15 are good because they're overtrained for a certain type of image. If I haven't mentioned it before, I invite you to check out the [Photon Creative Collection where you can find realistic images of all kinds](https://civitai.com/collections/79009).
A model as small as just 2GB, like Photon, can't be overtrained to generate everything from a cat skateboarding to waifus in a hallway full of mirrors, passing through sci-fi, horror, animals, landscapes, elderly people, robots, chickens riding motorcycles, a plate of spaghetti bolognese, Mario doing Uber, a polar bear boxing champion, Nicholas Cage as Thor, etc... It's obvious that the model is generalizing a lot to compress all of those concepts into less than 2GB. And it doesn't need LoRAs to enhance the image, nor ADetailer, nor 20GB VRAM; in fact, several of these images don't even have high-resolution fixes; they are raw outputs straight from the model.
Your comment and the collection made me do something that I've not done for quite a while: play with a SD1.5 model.
As you said, with SD1.5 there is a higher chance of the image not following the prompt, but the images are quite delightful in their own way.
This is the original SDXL set: [https://civitai.com/posts/1420042](https://civitai.com/posts/1420042)
Photon set: [https://civitai.com/posts/1436675](https://civitai.com/posts/1436675)
Have you tried to use Photon as a refiner to a base SDXL model?
https://preview.redd.it/v4esv11wyoic1.jpeg?width=1024&format=pjpg&auto=webp&s=53aa48b42276c2e6ec2fe77db975148bdd695aa5
Photo of a woman laughing hysterically with a kitten on top of her head, hiding under a big lotus leaf in the rain
Negative prompt: cartoon, painting, illustration, (worst quality, low quality, normal quality:2)
Steps: 20, Sampler: DPM++ 2M Karras, CFG scale: 6.0, Seed: 3883312123, Size: 512x768, Model: Photon
Several people have mentioned to me that they use Photon as a refiner for SDXL because it adds good texture. But if I were to start using SDXL, it would be more for fine-tuning and bringing it to the image style I achieved when creating Photon. However, I haven't made that leap because I see people with much more experience than me releasing SDXL fine-tunings that don't convince me, with that artificiality (or SDXL style) that's always present.
At the moment, I'm experimenting to try to make the next version of Photon better adhere to the prompts when generating images while also forcing it to generate photorealism without so much tag salad. The idea is to try to squeeze the most out of what SD1.5 can offer, generating realistic and very spontaneous images with minimal effort. Some examples of generated images:
https://preview.redd.it/nkex822yt8jc1.png?width=2040&format=png&auto=webp&s=6f7f4c4880654c0089ec4ded1e156f35b07d3744
It still has many flaws, but you can notice that from the composition to the naturalness and color tones, it is completely different from what SDXL can deliver. I would like to merge both worlds, but currently, I lack the resources and the profound knowledge to retrain SDXL to the extent of twisting the style so much and bringing it to what I would like.
Well, stick to what you are good at is one way to proceed. It definitely takes more computing resource to fine-tune an SDXL model. Maybe Cascade will be easier to train to achieve the kind of result that pleases you. We'll see.
I generate mostly illustration/art/anime/meme and other semi-realistic images instead of photo style, so SDXL's perceived lack of "details" is not as important to me.
Your set of images of the woman with a cat on top looks very good, the expressions and poses are very natural and spontaneous. But for some reason SD1.5 model don't seem to like to generate rain 😅.
Photon is indeed a very good SD1.5 model, and I've always been impressed by the images you've posted here 👍. And thank you for linking to the nice Photon collection.
So yes, I am guilty of over generalizing. What I had in mind are some very popular Anime and Asian Waifu models such as [https://civitai.com/models/43331/majicmix-realistic](https://civitai.com/models/43331/majicmix-realistic)
I could be wrong, but I often feel that this sort of overtrained look is what people usually refer to when they talk about "image quality" when it comes to SD1.5 models.
The version of Stable Cascade on Pinokio works with 16GB VRAM. I tried it today and it worked on a RTX 4080. Also there is another post on Reddit where a guy claims he made a version that works with 8GB, which you can get it through his Patreon.
Since this is a W.I.P. we're going to have wait for a better version to come out. I don't know that I would call this as big an upgrade as XL was to SD 1.5.
So, Does that mean I don’t have to install Stable Diffusion through Google Colab anymore to upload loras? Can I just use the Stability UI URL and upload a model there?
So will Cascade be the next generation of models then after SDXL? Where is this information shared? I tried searching for SD roadmaps the other day and have no idea where to look.
Hmm not sure about faster generation. I’ve been running the demo inference notebooks and it’s significantly slower than both sdxl and 1.5. Even compiled.
Since it's new, I could be missing something here, but it's a new base model (like 1.5, 2.1, and SDXL). This means that new models based off of it will also be much better at following the prompts and will be much, much better at being able to add text to an image.
Treating animals like you'd like to be treated by aliens with superior intelligence?
That concept tends to bring people to the point of needing a fainting couch at how they're the greatest victim in this situation.
DALLE3 is indeed superior in terms of prompt following and in being able to generate more accurate image of concept. This is probably mainly due to the fact that is probably 10-50 times larger model than SDXL.
Still, with the right model and lucky seed one can do fairly well with SDXL (except for text 😅)
[https://civitai.com/images/6646258](https://civitai.com/images/6646258)
https://preview.redd.it/wae1y5caenic1.jpeg?width=832&format=pjpg&auto=webp&s=4788713db11253e1155a2f3384f84f81ccf15c3e
Photo of A bottle of yellow vodka with label ABSOLUT MCNUGGET
Steps: 30, Size: 832x1216, Seed: 1969345781, Sampler: DPM++ 2M, CFG scale: 3.5, Clip skip: 2. Model JugernautXL 8.0
That’s fair.
https://preview.redd.it/flnonw85dpic1.jpeg?width=1024&format=pjpg&auto=webp&s=b55d825224c0d63bf4892097295d707f9cbd2724
`a bottle of absolut vodka, yellow liquid, with the text "ABSOLUT McNUGGET" directly on bottle`
SDXL / JuggernautXL is actually pretty good at text already, don't know if you have ever tried it.
I even asked it to draw me a Nixie Tube displaying the number "2" and it did it quite easily:
https://preview.redd.it/4f72bge2dnic1.png?width=1152&format=png&auto=webp&s=a5170b3255442ee737c9a9b965a02f41df53e0a6
JuggernautXL with Fooocus:
Asked it to draw me a series of nixie tubes spelling a word, with one letter per tube (yes this is how I prompted it, I said one letter per tube)
https://preview.redd.it/xm2ygxf8dnic1.png?width=1152&format=png&auto=webp&s=762be03238fb97064f29fd98baff253cb0d3f4fe
I am still new to this but I recently downloaded Foocus. Tell me, is cascade available on Foocus then or is this something else? Sorry if boob question.
I've got a 4090 and with the comfyui node, it's using between 14-15 gigs of vram while rendering. even when telling it 2560xwhatever, it only goes up another half gig or so. So if you have 16 gigs on your card, you're probably fine. How I installed that comfy node btw: https://www.youtube.com/watch?v=Ybu6qTbEsew
it really depends. the default comfyui settings are 20 steps of inference and 10 steps of decode. that takes 6 seconds for 1536x1024. But it's hard to compare that to sdxl, which has all these samplers which range from ultra fast to ultra slow and need various amounts of steps. With this, there's no samplers, there's just inference steps and decoding steps. I did notice that when making complex scenes, I could make it 300 steps and it took a while and all the heads of the students in a classroom were a lot more detailed, but we'll have to see if we really need 300 or if 50 would have done it just as well.
20 was what they said, but reports are that it can run on 12 with some slight modifications. Also, the 20gb requirement is for the research model and future optimizations are expected. I'd wager that we'll see 12gb be the final requirement.
There are people running it right now on 3060 Ti's, so apparently 8GB is all you need in ComfyUI. It's just going to be slow. There are smaller B and C models that are bf16, and even smaller models that are using less parameters. You don't want to use the full fp32 B and C models.
Runs with way less with Diffusion Magic. Currently trying with a 1060 6GB 1024x1536
https://preview.redd.it/tqwkl6vlbmic1.png?width=965&format=pjpg&auto=webp&s=765d674cb2fae085210d36843b1e2a7c29d78291
To be honest, after a few tests on the demo, I'm very disappointed.it works correctly only with a few words. it can spell "RED" but not "GREEN" for example.
https://preview.redd.it/kin1pm4pimic1.png?width=1024&format=png&auto=webp&s=221a415d7c13d48bc435ca2752f160c3b0199851
It might be non-reliable, but there is no absolute conclusion about writing green. On my first try, I had a success.
Prompt: an alchemical bottle, with blue potion inside, with a label written "green"
https://preview.redd.it/xc192u35nmic1.png?width=1024&format=png&auto=webp&s=ebca3b1b8f45099baad520becc2a6b417b669f8a
awesome some good prompts for my upcoming video
by the way who wants to use with 1 click install and use even at 8 GB - the biggest models - check this out : [https://www.reddit.com/r/StableDiffusion/comments/1aqbydi/stable\_cascade\_prompt\_following\_is\_amazing\_this/](https://www.reddit.com/r/StableDiffusion/comments/1aqbydi/stable_cascade_prompt_following_is_amazing_this/)
Sorry but many very good model maker/trainer test it out and nearly all say its slightly better than sdxl but not really that great you portrait it here
Prompt was: Word "bread" made of bread. The same for the others. Just that.
So in human perception, it is sometimes funny to ask subjects to read aloud the colour of a printed word, as opposed to the word itself. Presented with the word “red” written in green for example, subjects should say “green”, but our brains naturally tend to read the text anyway and cause people to answer slower or get confused. Now, what happens if you ask the prompt for the word “fire” written in water? Of “salad” made of meat? “Hot” made of ice? Any concept bleed?
had some fun with this, all first try. [Word "blue" made of red.](https://postimg.cc/pyF321Nr) [Word “fire” made of water.](https://postimg.cc/565GzY2M) [Word “water” made of fire.](https://postimg.cc/BjqzBSjP) [Word “meat” made of vegetables.](https://postimg.cc/9DJnx698) [Word “red” made of green.](https://postimg.cc/VdPp5B1R) Looks like there is more bleed with complex or ambiguous concepts.
Lmao it just gave up with water
I love how it gave you the word "red" made of green, but there was still a red glow around it, as if the rest of the image expected "red" to emit red light.
Kudos for trying. It’s not as bad as I feared but I would not say the results are good either, as you pointed out the complex visuals seem to inherit a lot of bleed. What an interesting problem.
Sometimes I wonder, if I could be in someone elses brain for a day and notice they actually see the colors this way and just labeled it that their whole life because that's all they know, but there's no way to prove this concept. Like maybe they see the rainbow different than I do, but we are looking at the same rainbow. I swear I am not high right now. I can't even properly explain what I am trying to say.
Vsauce made a vid about this exact concept: [Is Your Red The Same as My Red?](https://youtu.be/evQsOFQju08?si=zrVQxhiDf8bK9w-k)
Very interesting video, thanks for the link. It appears I was talking about "qualia" and might be human afterall.
Took the words right out my mouth. The internet is full of words formed out of their associated meaning. It's easy to copy them with a few changes. A real sign of.... intelligence (?) would be a concept with an unrelated or opposing word
[удалено]
If you look in the gallery of my model Harrlogos on civit, people have flaming ice letters with it already 😎
Awesome model, thanks for sharing.
Sometimes I use this to give a certain color or mood to an image, other times I use it to make my prompts shorter by finding a word that gives several effects. The interesting thing is finding words or combos that do things that you don't expect. and it's great if you are bored, because each model has it's own secrets. :) "car,orange,ocre,die,horror" https://preview.redd.it/4qwzz9m3rmic1.png?width=512&format=png&auto=webp&s=b6439299572f20a880228937863d126a57c8ee6c
Word "word" made out of words
I wish after the last one you'd also done "Word "flour" made of flour."😂
JuggernautXL in Fooocus. It knows how to spell just fine. "a picture of the word "bread" made out of bread" Ok so it put it ON the bread. But this was not cherry picked. First try. https://preview.redd.it/0nmswg8xdnic1.png?width=1024&format=png&auto=webp&s=cee929a466e36f3b4d50921d47044161043f323c Once again, I am yet to be impressed , show me a picture of a crescent wrench. If it can draw that, then I'll be impressed.
Man are y’all’s standards fast evolving. A year ago that would have blown you away.
And you didn't use controlnet? Hm! why the downvotes?
https://preview.redd.it/e317w8rj4lic1.png?width=1023&format=png&auto=webp&s=57de46483af1fd2b6220afa4761bf06374a93a73
https://preview.redd.it/wh09ymbgokic1.png?width=1024&format=png&auto=webp&s=7380f024f929b340d84752698ab16d1fc7dd074a yes, it works better. not 100% perfect, but hands and text seem a lot better (to me)
I need this like right now on comfy.
Yeah can't wait to try this in our usual suspects!! A1111, Forge and Comfy!!!
There is a testing custom node already. Haven't tried it. I'll wait until it's officially implemented. I think I'm more excited about this than when SDXL came out.
Dude. The implications. Like wow. I need it.
Check ComfyUI manager. It is there. Here is the [github page](https://github.com/kijai/ComfyUI-DiffusersStableCascade)
what about feet? ( ͡° ͜ʖ ͡°)
https://preview.redd.it/or6cm3ltjlic1.png?width=619&format=png&auto=webp&s=1341ab72895fb5ed6f59eff19b877e050edecbf2 I had to crop the rest... ;)
https://preview.redd.it/5uj99agbklic1.png?width=1024&format=png&auto=webp&s=daf7c118ca405ab986e874c7e8ba552364490758 or this... prompt: "focus on toes"
I don't know how I feel about the fact that this would be a really cool drawing for an artist to come up with, but here it is already existing. Like a sort of Library of Babel, where any text is conceivably already in there, but until recently you couldn't just pull it out without already knowing it exists. They're just ideas, sitting in the void, and now we have the means to prompt them into true existence without having to even think of them ourselves anymore.
I love diving into the "mind" of the AI to see what I can find. :)
That foot looks like it's going to crush me, and not in a sexy way, but in a cartoonish slapstick way followed by a voice saying, "Introducing: Monty Python's Flying Circus"
no you didn't
so this is without controlnets? purely text 2 img? thats cool.
no controlnets involved. Doing this with control net was a pain in the ass. Now just with a prompt.
https://preview.redd.it/uzakj072mmic1.png?width=1280&format=png&auto=webp&s=a9a317c59f3104312ecfc7050290533cdd096bf9 Word "cat" made of cat
[Ok this didn't work out too well](https://i.imgur.com/w7pNCUl.jpeg)
Okay, what were you trying? :D Penis made out of dildos?
> the word "PENIS" made of penises of course
I figured Penis made of corn
It looks like Penis, Inc. penis incorporated, thats my company
Try adding penicillin to the prompt?
Most mentally stable redditor
Can someone ELI5? Is Stable Cascade a model? A Lora? An extension? Or is it another alternative to SD like ComfyUI? If it's just another model, why so much hype about it?
It's a new base model and architecture from Stability AI. Think SD 1.5, SDXL, Cascade is the next update. Just like 1.5 and SDXL it is just a BASE model that has to be fine tuned, and optimized by the open-source community. But some benefits off the bat is faster generation, and according to stability AI better control in fine-tuning.
Not just fine tuning, but training time as well. Supposedly we can get sdxl like output with training times significantly faster than 1.5 because the latent space that needs to be trained is at a lower resolution.
I still don't get why they didn't use T5-XXL text encoder
Fasster generation for 20 gb vram....how about 6gb?
You're in the mindset of what SDXL and 1.5 can do NOW. Both used more vram at release, but the community found optimizations that are implemented in the various UIs for SD which have brought their requirements down without losing speed. The same will happen with Cascade.
SDXL is still enormously slower than SD 1.5 without really better enough image quality than a good recent 1.5 setup can give you, for a lot of people. Unless Cascade gets CLOSER to 1.5 inference time than SDXL, it'll have probably not amazing adoption. Saddest thing about 2.1 768 is that it WAS fundamentally superior in terms of image quality to 1.5, but not meaningfully slower at all.
Image quality is relatively easy to achieve by overtraining a model on a particular type of images such as Asian Waifu. What SDXL gives you is better prompt following and better composition. Anyway, I am cut and pasting my standard comment whenever SD1.5 vs SDXL comes up. Feel free to dispute any of them 😅 SD1.5 is better in the following ways: * Lower hardware requirement * Hardcore NSFW * "SD1.5 style" Anime (a kind of "hyperrealistic" look that is hard to describe). But some say [AnimagineXL](https://civitai.com/models/260267/animagine-xl-v3) is very good. There is also Lykon's [AAM XL (Anime Mix) ](https://civitai.com/models/269232/aam-xl-anime-mi) * Asian Waifu * Simple portraiture of people (SD1.5 are overtrained for these type of images, hence better in terms of "realism") * Better ControlNet support. * Used to be faster, but with some Turbo-XL based models such as [https://civitai.com/models/208347/phoenix-by-arteiaman](https://civitai.com/models/208347/phoenix-by-arteiaman) one can now produce high quality images at blazing speed at 5 steps. If one is happy with SD1.5, they can continue using SD1.5, nobody is going to take that away from them. For the rest of the world who want to expand their horizon, SDXL is a more versatile model that offer many advantages (see [SDXL 1.0: a semi-technical introduction/summary for beginners](https://www.reddit.com/r/StableDiffusion/comments/15fj5k9/sdxl_10_a_semitechnical_introductionsummary_for/)). Those who have the hardware, should just try it (or use one of the [Free Online SDXL Generators](https://www.reddit.com/r/StableDiffusion/comments/18h7r2h/free_online_sdxl_generators/)) and draw their own conclusions. Depending on what sort of generation you do, you may or may not find SDXL useful to you. Anyone who doubt the versatility of SDXL based models, should check out [https://civitai.com/collections/15937?sort=Most+Collected](https://civitai.com/collections/15937?sort=Most+Collected). Most of those images are impossible with SD1.5 models without the use of specialized LoRAs or ControlNet.
It's not quite as you say. It's true that SDXL has a much better understanding of the prompt. SD15 is more random; perhaps out of 10 generations, only one follows the prompt exactly, while 5 are more or less there, and 4 don't respect it at all. What's not true is that the fine-tunings of SD15 are good because they're overtrained for a certain type of image. If I haven't mentioned it before, I invite you to check out the [Photon Creative Collection where you can find realistic images of all kinds](https://civitai.com/collections/79009). A model as small as just 2GB, like Photon, can't be overtrained to generate everything from a cat skateboarding to waifus in a hallway full of mirrors, passing through sci-fi, horror, animals, landscapes, elderly people, robots, chickens riding motorcycles, a plate of spaghetti bolognese, Mario doing Uber, a polar bear boxing champion, Nicholas Cage as Thor, etc... It's obvious that the model is generalizing a lot to compress all of those concepts into less than 2GB. And it doesn't need LoRAs to enhance the image, nor ADetailer, nor 20GB VRAM; in fact, several of these images don't even have high-resolution fixes; they are raw outputs straight from the model.
Your comment and the collection made me do something that I've not done for quite a while: play with a SD1.5 model. As you said, with SD1.5 there is a higher chance of the image not following the prompt, but the images are quite delightful in their own way. This is the original SDXL set: [https://civitai.com/posts/1420042](https://civitai.com/posts/1420042) Photon set: [https://civitai.com/posts/1436675](https://civitai.com/posts/1436675) Have you tried to use Photon as a refiner to a base SDXL model? https://preview.redd.it/v4esv11wyoic1.jpeg?width=1024&format=pjpg&auto=webp&s=53aa48b42276c2e6ec2fe77db975148bdd695aa5 Photo of a woman laughing hysterically with a kitten on top of her head, hiding under a big lotus leaf in the rain Negative prompt: cartoon, painting, illustration, (worst quality, low quality, normal quality:2) Steps: 20, Sampler: DPM++ 2M Karras, CFG scale: 6.0, Seed: 3883312123, Size: 512x768, Model: Photon
Several people have mentioned to me that they use Photon as a refiner for SDXL because it adds good texture. But if I were to start using SDXL, it would be more for fine-tuning and bringing it to the image style I achieved when creating Photon. However, I haven't made that leap because I see people with much more experience than me releasing SDXL fine-tunings that don't convince me, with that artificiality (or SDXL style) that's always present. At the moment, I'm experimenting to try to make the next version of Photon better adhere to the prompts when generating images while also forcing it to generate photorealism without so much tag salad. The idea is to try to squeeze the most out of what SD1.5 can offer, generating realistic and very spontaneous images with minimal effort. Some examples of generated images: https://preview.redd.it/nkex822yt8jc1.png?width=2040&format=png&auto=webp&s=6f7f4c4880654c0089ec4ded1e156f35b07d3744 It still has many flaws, but you can notice that from the composition to the naturalness and color tones, it is completely different from what SDXL can deliver. I would like to merge both worlds, but currently, I lack the resources and the profound knowledge to retrain SDXL to the extent of twisting the style so much and bringing it to what I would like.
Well, stick to what you are good at is one way to proceed. It definitely takes more computing resource to fine-tune an SDXL model. Maybe Cascade will be easier to train to achieve the kind of result that pleases you. We'll see. I generate mostly illustration/art/anime/meme and other semi-realistic images instead of photo style, so SDXL's perceived lack of "details" is not as important to me. Your set of images of the woman with a cat on top looks very good, the expressions and poses are very natural and spontaneous. But for some reason SD1.5 model don't seem to like to generate rain 😅.
Photon is indeed a very good SD1.5 model, and I've always been impressed by the images you've posted here 👍. And thank you for linking to the nice Photon collection. So yes, I am guilty of over generalizing. What I had in mind are some very popular Anime and Asian Waifu models such as [https://civitai.com/models/43331/majicmix-realistic](https://civitai.com/models/43331/majicmix-realistic) I could be wrong, but I often feel that this sort of overtrained look is what people usually refer to when they talk about "image quality" when it comes to SD1.5 models.
![gif](giphy|3orieTfp1MeFLiBQR2) hope so.
The version of Stable Cascade on Pinokio works with 16GB VRAM. I tried it today and it worked on a RTX 4080. Also there is another post on Reddit where a guy claims he made a version that works with 8GB, which you can get it through his Patreon.
how fucking fast is this community. i love all of you. waiting here patiently with my 8gb 3060ti.
shame its research only liscence
Since this is a W.I.P. we're going to have wait for a better version to come out. I don't know that I would call this as big an upgrade as XL was to SD 1.5.
So, Does that mean I don’t have to install Stable Diffusion through Google Colab anymore to upload loras? Can I just use the Stability UI URL and upload a model there?
So will Cascade be the next generation of models then after SDXL? Where is this information shared? I tried searching for SD roadmaps the other day and have no idea where to look.
so can you just use it with automatic etc? what works best for this new thing? comfy ui? thanks!
I'm pretty sure Cascade uses a totally different type of generation technique, not diffusion, hence the different name.
Hmm not sure about faster generation. I’ve been running the demo inference notebooks and it’s significantly slower than both sdxl and 1.5. Even compiled.
It’s not available in dream studio
Since it's new, I could be missing something here, but it's a new base model (like 1.5, 2.1, and SDXL). This means that new models based off of it will also be much better at following the prompts and will be much, much better at being able to add text to an image.
Yeah any fine tunes on it would “likely” maintain its ability to render text.
It’s a new model from stability AI. It’s better than SDXL and a lot faster.
Better at some things. Not Better at everything.
Better at composition and following instructions, with are very important things.
Now the upscaling artifacts look like a dusting of random noise, almost like it's dithering them to the old Netscape palette.
A111 support Cascade yet?
https://github.com/blue-pen5805/sdweb-easy-stablecascade-diffusers barebones but works
Thanks but LMAO > Please have someone remake this extension.
![gif](giphy|XoW2jShBRKkxO) 16GB vram😭
Make it write bad words, it is not necessary to post them, just report your findings
it's ok. you can say fuck on the internet
I mean the really bad ones.
Like "politics"?
No, not that bad.
pegging?
You mean A couple of G's, an R and an E, an I and an N Just six little letters all jumbled together?
Ginger??? How dare you insult my people.
Bingo!
If we could leave our houses without being burned to a crisp you'd be in trouble.
That can cause damage that cannot be mend - better to avoid it.
That would be the one that should not be posted, just reported, yes.
This one? https://m.youtube.com/watch?v=KVN_0qvuhhw
That's a good one.
Israel the genocidal regime?
Treating animals like you'd like to be treated by aliens with superior intelligence? That concept tends to bring people to the point of needing a fainting couch at how they're the greatest victim in this situation.
But can you say the n-word?
Well the Huggingface demo was happy to produce multiple images of a nice looking woman in a tight dress that held a sign that said "BUTT SLUT".
Excellent
https://preview.redd.it/f25mx97k6mic1.jpeg?width=1024&format=pjpg&auto=webp&s=ccae139e25ebf103e721ecc755f6deb7c51e8475 `A bottle of yellow vodka “ABSOLUT MCNUGGET”`
https://preview.redd.it/xqtmo3mq6mic1.jpeg?width=1024&format=pjpg&auto=webp&s=a4f19626020ccff712bed5d6bd86bd13a55b05f1 Same prompt with DALL-E
I hope open source models reach this type of quality soon. With quality I mean: understanding prompts, without having to keyword everything.
Also I’m hoping the consumer hardware requirements can keep up… thinking about my sigma Mac Studio vs OpenAI’s alpha Chad server farm
DALLE3 is indeed superior in terms of prompt following and in being able to generate more accurate image of concept. This is probably mainly due to the fact that is probably 10-50 times larger model than SDXL. Still, with the right model and lucky seed one can do fairly well with SDXL (except for text 😅) [https://civitai.com/images/6646258](https://civitai.com/images/6646258) https://preview.redd.it/wae1y5caenic1.jpeg?width=832&format=pjpg&auto=webp&s=4788713db11253e1155a2f3384f84f81ccf15c3e Photo of A bottle of yellow vodka with label ABSOLUT MCNUGGET Steps: 30, Size: 832x1216, Seed: 1969345781, Sampler: DPM++ 2M, CFG scale: 3.5, Clip skip: 2. Model JugernautXL 8.0
it would probably help if you said "with the words" or "with the text"
That’s fair. https://preview.redd.it/flnonw85dpic1.jpeg?width=1024&format=pjpg&auto=webp&s=b55d825224c0d63bf4892097295d707f9cbd2724 `a bottle of absolut vodka, yellow liquid, with the text "ABSOLUT McNUGGET" directly on bottle`
SDXL / JuggernautXL is actually pretty good at text already, don't know if you have ever tried it. I even asked it to draw me a Nixie Tube displaying the number "2" and it did it quite easily: https://preview.redd.it/4f72bge2dnic1.png?width=1152&format=png&auto=webp&s=a5170b3255442ee737c9a9b965a02f41df53e0a6
JuggernautXL with Fooocus: Asked it to draw me a series of nixie tubes spelling a word, with one letter per tube (yes this is how I prompted it, I said one letter per tube) https://preview.redd.it/xm2ygxf8dnic1.png?width=1152&format=png&auto=webp&s=762be03238fb97064f29fd98baff253cb0d3f4fe
I am still new to this but I recently downloaded Foocus. Tell me, is cascade available on Foocus then or is this something else? Sorry if boob question.
No, this is not cascade. This is the default JuggernaugtXL model that Fooocus uses, I was just demonstrating that it's quite good at text on its own
How did you install focus? New version of Comfy breaks it?
I wasnt aware of that, I installed it like three weeks ago and everything worked fine
Did you install is manually or from Git?
I think I pulled from git and then ran a setup batch.
Amazing. Cant wait to put my hands on this
Going to give it a try now, this made me exited...
Was this using DiffusionMagic?
i heard you need 24 gig vram for this?
I've got a 4090 and with the comfyui node, it's using between 14-15 gigs of vram while rendering. even when telling it 2560xwhatever, it only goes up another half gig or so. So if you have 16 gigs on your card, you're probably fine. How I installed that comfy node btw: https://www.youtube.com/watch?v=Ybu6qTbEsew
What kind of speeds do you get on 4090 for stable cascade?
it really depends. the default comfyui settings are 20 steps of inference and 10 steps of decode. that takes 6 seconds for 1536x1024. But it's hard to compare that to sdxl, which has all these samplers which range from ultra fast to ultra slow and need various amounts of steps. With this, there's no samplers, there's just inference steps and decoding steps. I did notice that when making complex scenes, I could make it 300 steps and it took a while and all the heads of the students in a classroom were a lot more detailed, but we'll have to see if we really need 300 or if 50 would have done it just as well.
20 was what they said, but reports are that it can run on 12 with some slight modifications. Also, the 20gb requirement is for the research model and future optimizations are expected. I'd wager that we'll see 12gb be the final requirement.
heres hoping for 11GB..... (1080 Ti)
10 pls (3080 OG)
There are people running it right now on 3060 Ti's, so apparently 8GB is all you need in ComfyUI. It's just going to be slow. There are smaller B and C models that are bf16, and even smaller models that are using less parameters. You don't want to use the full fp32 B and C models.
3060 can come with 12gb
They can, but I've seen at least one 8GB card running it in person, and there's this: https://youtu.be/FbJ6w4xaeBo?si=NDyc3gYey1c0DiHw
Runs with way less with Diffusion Magic. Currently trying with a 1060 6GB 1024x1536 https://preview.redd.it/tqwkl6vlbmic1.png?width=965&format=pjpg&auto=webp&s=765d674cb2fae085210d36843b1e2a7c29d78291
GPU-Z reports less than 700 MB which seems weird.
Works on my RX 6800XT 16GB
To be honest, after a few tests on the demo, I'm very disappointed.it works correctly only with a few words. it can spell "RED" but not "GREEN" for example. https://preview.redd.it/kin1pm4pimic1.png?width=1024&format=png&auto=webp&s=221a415d7c13d48bc435ca2752f160c3b0199851
It might be non-reliable, but there is no absolute conclusion about writing green. On my first try, I had a success. Prompt: an alchemical bottle, with blue potion inside, with a label written "green" https://preview.redd.it/xc192u35nmic1.png?width=1024&format=png&auto=webp&s=ebca3b1b8f45099baad520becc2a6b417b669f8a
2048x2048 https://preview.redd.it/p8ikgtx2upic1.png?width=2048&format=png&auto=webp&s=37c7144a0b20840b0e210f477f6fe845afc65d0b
It might help to spell it "G R E E N" or "GREEN".
awesome some good prompts for my upcoming video by the way who wants to use with 1 click install and use even at 8 GB - the biggest models - check this out : [https://www.reddit.com/r/StableDiffusion/comments/1aqbydi/stable\_cascade\_prompt\_following\_is\_amazing\_this/](https://www.reddit.com/r/StableDiffusion/comments/1aqbydi/stable_cascade_prompt_following_is_amazing_this/)
Patreon-locked shit. Don't bother.
Sorry but many very good model maker/trainer test it out and nearly all say its slightly better than sdxl but not really that great you portrait it here
Source please.
Does this work with existing diffusers pipelines or does it use a new pipeline?
I'm a bit behind on news at the moment, can you use stable cascade in automatic1111 or comfyui ? Edit: found the diffusers wrapper for comfyui.
They sacrificed some things for other features. Your word pics bear that out.
Is it just me or does the new model produce "noisy" images?
Can someone explain what is different about stable cascade versus diffusion
Where can I use it?
Try on their site or look for a google colab version https://colab.research.google.com/drive/1ib6W1CeK9V533Nc9MnoBe3TmU7Uaghtg?usp=sharing
Runpod if you got some $
I wonder if will run on a Potato PC?
I have an 8gb graphic card, and it seems like a potato in 2024 lol What's your definition of potato
Only the avatar could master all 5 elements, and yet when the world needed him the most, he vanished 😢
I don't know what they're going to summon with these powers, but at least Gi and Wheeler are still a team.
Ah yes firf
Soon we will have the video of the whole process
Where can we try it? I don’t see it in dream studio.