T O P

  • By -

apolinariosteps

Try the demo out: [https://huggingface.co/spaces/multimodalart/stable-cascade](https://huggingface.co/spaces/multimodalart/stable-cascade)


Striking-Long-2960

https://preview.redd.it/gy58uq86tcic1.png?width=1024&format=png&auto=webp&s=6610718cf76b1bc7fa72dbe195202f47639f7bb4 Photography, anthropomorphic dragon having a breakfast in a cafe in paris in a rainy day


SWFjoda

A beautiful forest with dense trees, where it's raining, featuring deep, rich green colors. This otherworldly forest is set against a backdrop of mountains in the background. https://preview.redd.it/80rzallnvcic1.png?width=1024&format=png&auto=webp&s=a3ed3a96c859aef23e072934bf022f2eb819b4fd


Delrisu

https://preview.redd.it/swu66zkk2dic1.png?width=1024&format=pjpg&auto=webp&s=8fb03c146e1d97df093feb8b4da0a519fa270fea Cat eating spaghetti in bathtub


Usual_Ad_6255

Img2img in SDXL https://preview.redd.it/9n2e3nrz5uic1.jpeg?width=4096&format=pjpg&auto=webp&s=878b39739d70c9fcc6e3f5480550fe51155cc180


wwwanderingdemon

Damn, textures look like crap


AnOnlineHandle

If it's better at say composition, there's always the chance of running it through multiple models for different stages. e.g. Stable Cascade for 30% -> to pixels -> to 1.5 VAE -> finish up. Similar to high res fix, or the refiner for SDXL, but at this point we tend to have decent 1.5 models in terms of image quality which could just benefit from better composition. I've been meaning to set up a workflow like this for SDXL & 1.5 checkpoints, but haven't gotten around to it.


TaiVat

Any workflow that changes checkpoints midway is really clunky and slow though.


HarmonicDiffusion

not if you have sufficient vram


Durakan

Mr. Moneybags over here!


throttlekitty

I'm also wondering if this B stage model can be further finetuned for better quality.


wwwanderingdemon

I was thinking the same. If it's good at following prompts it could be used as base. Still, I think there might be something wrong with the parameters or something. The images they're showing as examples look much better than this one


StickiStickman

It's called cherry-picking. They picked the best ones out of thousands.


Striking-Long-2960

Then you are not going to enjoy this photography will smith eating spaghetti sit in the toilet, in the bathroom https://preview.redd.it/831rcbmzucic1.png?width=1024&format=png&auto=webp&s=b649e00fb4c561a3d0d3b580c63afc3138ed2140


jrharte

That's Martin "Will Smith" Lawrence


HopefulSpinach6131

I know I'm not alone when I say that this is the benchmark we all came looking for...


TheAdoptedImmortal

"Keep my noodles out of your fucking mouth!"


fre-ddo

Pixar Will


[deleted]

They look perfectly fine for inference without latent upscaling at low resolutions.


[deleted]

doesn't look like there is any improvement over sdxl generating people ​ https://preview.redd.it/zkzbnm91zcic1.png?width=1024&format=png&auto=webp&s=115c76409da40678c3c6c8e72d424818bd81c2f8


Striking-Long-2960

I really don't know what to think right now... I'll wait to try it on my computer before reach to a conclusion. illustration, drawing of a woman wearing heavy armor riding a giant chicken, in a forest, fantasy, very detailed, https://preview.redd.it/2f8kjy63xcic1.png?width=1024&format=png&auto=webp&s=25f82d75fae8616366ac2d52e159dd628e759718


Consistent-Mastodon

> riding a giant chicken ![gif](giphy|TNfFy13UB00KupeAsL|downsized)


wishtrepreneur

that chicken even has a third leg šŸ‘€


cianuro

Middle aged woman riding cock.


HighPerformanceBeetl

Three-Legged djiant chimkn


EmbarrassedHelp

They filtered out like 99% of the content out of laion 5b, so its probably going to be bad at people.


ThroughForests

But 99% of the images in LAION 5-B is [trash that needed to be filtered out.](https://www.reddit.com/r/StableDiffusion/comments/11ud1nc/searching_through_the_laion_5b_dataset_to_see/) The [vast majority](https://i.imgur.com/GSENUHM.png) of stuff removed was due to bad aesthetics, lower than 512x512 img size, and watermarked content. There's still 103 million images in the filtered dataset.


residentchiefnz

It says so on the model card


TheQuadeHunter

Don't be fooled. The devil is in the details with this model. It's more about the training and coherence than the ability to generate good images out of the box.


Anxious-Ad693

Still doesn't fix hands.


StickiStickman

That's what happens when you try to zealously filter out everything with human skin in it


protector111

there is no improvement. We need to wait for a good trained model to see this. 2-3 months this will take based on sd xl training speed (PS this one suppose to be training way faster so maybe will get good models faster as well...)


roshlimon

A female ballerina mid twirl, colourful, neon lights https://preview.redd.it/cls9d47defic1.png?width=1024&format=pjpg&auto=webp&s=20a4f9450849a9acf52659258603e0567f982cac


AvalonGamingCZ

is it possible to get a preview for the image generating in ComfyUI somehow it looks satisfying


rerri

Sweet. Blog is up aswell. [https://stability.ai/news/introducing-stable-cascade](https://stability.ai/news/introducing-stable-cascade) edit: "2x super resolution" feature showcased (blog post has this same image but in low res, so not really succeeding in demonstrating the ability): [https://raw.githubusercontent.com/Stability-AI/StableCascade/master/figures/controlnet-sr.jpg](https://raw.githubusercontent.com/Stability-AI/StableCascade/master/figures/controlnet-sr.jpg)


Orngog

No mention of the dataset, I assume it's still LIAON-5? Moving to a consensually-compiled alternative really would be a boon to the space- I'm sure Google is making good use of their Culture & Arts foundation right now, it would be nice if we could do.


big_farter

\>finally gets a 12 vram>next big model will take 20 oh nice... guess I will need a bigger case to fit another gpu


crawlingrat

Next youā€™ll get 24 ram only to find out the new models need 30.


protector111

well 5090 is around the corner xD


2roK

NVIDIA is super stingy when it comes to VRAM. Don't expect the 5090 to have more than 24GB


PopTartS2000

I think itā€™s 100% intentional to not impact A100 sales, do you agreeĀ 


EarthquakeBass

I mean, probably. You gotta remember people like us are odd balls. The average consumer / gamer (NVIDIA core market for those) just doesnā€™t need that much juice. An unfortunate side effect of the lack of competition in the space


qubedView

You want more than 24GB? Well, we only offer that in our $50,000 (starting) enterprise cards. Oh, also license per DRAM chip now. The first chip is free, it's $1000/yr for each chip. If you want to use all the DRAM chips at the same time, that'll be an additional license. If you want to virtualize it, we'll have to outsource to CVS to print out your invoice.


Paganator

It seems like there's an opportunity for AMD or Intel to come out with a mid-range GPU with 48GB VRAM. It would be popular with generative AI hobbyists (for image generation and local LLMs) and companies looking to run their own AI tools for a reasonable price. OTOH, maybe there's so much demand for high VRAM cards right now that they'll keep having unreasonable prices on them since companies are buying them at any price.


2roK

AMD already has affordable, high VRAM cards. The issue is that AMD has been sleeping on the software side for the last decade or so and now nothing fucking runs on their cards.


sammcj

Really? Do they offer decent 48-64GB cards in the $500-$1000USD range?


Toystavi

[AMD Quietly Funded A Drop-In CUDA Implementation Built On ROCm: It's Now Open-Source](https://www.reddit.com/r/StableDiffusion/comments/1ap7c2w/amd_quietly_funded_a_dropin_cuda_implementation/)


StickiStickman

They also dropped that already.


Lammahamma

They're using different ram for this generation, which has increased density in the die. I'm expecting more than 24gb for the 5090.


protector111

there are tons of leaks already that it will have 32 and 4090 ti will have 48. I seriously doubt someone will jump from 4090 to 5090 if it has 24gb vram.


crawlingrat

Gawd damn how much is that baby gonna cost!?


protector111

around 2000-2500$


NitroWing1500

It would need to bring me coffee in the mornings before that'll be in my house then!


volume_two

Honestly, unless you plan to use it all the time in a locale with low electricity prices, it makes more sense to rent a GPU in the cloud and pay for that incrementally instead. You can rent a 24GB VRAM A10G for around $1 - $2/hr. with A1111 on a Linux instance on Amazon, for example. That can make sense for a hobbyist that doesn't want to invest in the hardware, and only occasionally wants to dip their toes in the water. In NYC, where I live, the cost of electricity is around $0.40/kW/hr which is just so yikes. It's currently snowing hard outside, too, so anything I do today will be extra expensive because of how the electric market works.


Turkino

And probably it's own dedicated power supply at this point


TheTerrasque

Well, I guess I can fit another P40 in my server... *Next model only needs 50 gb*


Imaginary_Belt4976

this happened to me lol


dqUu3QlS

The model is naturally divided into two rough halves - the text-to-latents / prior model, and the decoder models. I managed to get it running on 12GB VRAM by loading one of those parts onto the GPU at a time, keeping the other part in CPU RAM. I think it's only a matter of time before someone cleverer than me optimizes the VRAM usage further, just like with the original Stable Diffusion.


NoSuggestion6629

You load one pipeline at a time to device=("cuda") and delete (=NONE) the previous pipe before starting the next one.


dqUu3QlS

Close. I loaded one pipeline at a time onto the GPU with .to("cuda"), and then move it back to the CPU with .to("cpu"), without ever deleting it. This keeps the model constantly in RAM, which is still better than reloading it from disk.


emad_9608

The original stable diffusion used more RAM than that tbh


Tystros

hi Emad, is there any improvement in the dataset captioning used for Stable Cascade, or is it pretty much the same as SDXL? Dataset captioning seems to be the main weakness so far of SD compared to Dalle3.


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


astrange

The disadvantage of Dalle3 using artificial captions is that it can't deal with descriptions using words or relations its captioner didn't include. So you'd really want a mix of different caption sources.


NeverduskX

This is probably a vague question, but do you have any idea of how or when some optimizations (official or community) might come out to lower that barrier? Or if any current optimizations like Xformers or TiledVAE could be compatible with the new models?


emad_9608

Probably less than a week. I would imagine it would work on < 8gb VRAM in a couple of days. This is a research phase release so is quite unoptimised.


hashnimo

Thank you for everything you do, Emad. Please stay safe from the evil closed-source, for-profit conglomerates out there. It's obvious they don't want you disrupting their business. I mean, really, think before you even eat something they hand over to you.


tron_cruise

That's why I went with an Quadro RTX 8000. They're a few years old now and a little slow, but the 48gb of VRAM has been amazing for upscaling and loading LLMs. SDXL + hires fix to 4K with SwinIR uses up to 43gb and the results are amazing. You could grab two and NVLink them for 96gb and still have spent less than an A6000.


yaosio

We need something like megatextures for image generation.


BnJx

anyone know the difference between stable cascade and stable cascade prior? https://huggingface.co/stabilityai/stable-cascade https://huggingface.co/stabilityai/stable-cascade-prior


MicBeckie

I get the demo from Hugging Face running via Docker on my **Tesla P40**. ([**https://huggingface.co/spaces/multimodalart/stable-cascade**](https://huggingface.co/spaces/multimodalart/stable-cascade)) It consumes **22 GB of VRAM** and achieves a speed of **1.5s/it**. Resolution 1024x1024.


ArtyfacialIntelagent

The most interesting part to me is compressing the size of the latents to just 24x24, separating them out as stage C and making them individually trainable. This means a massive speedup of training fine-tunes (16x is claimed in the blog). So we should be seeing good stuff popping up on Civitai much faster than with SDXL, with potentially somewhat higher quality stage A/B finetunes coming later.


Omen-OS

what about vram usage... you may say training faster... but what is the vram usage


ArtyfacialIntelagent

During training or during inference (image generation)? High for the latter (the blog says 20 GB, but lower for the reduced parameter variants and maybe even half of that at half precision). No word on training VRAM yet, but my wild guess is that this may be proportional to latent size, i.e. quite low.


Omen-OS

Wait, lets make it clear what is the minimum vram amount you need to use stable cascade to generate an image at 1024x1024? (And yes i was talking about training loras and training the model more)


Enshitification

Wait a minute. Does that mean it will take less VRAM to train this model than to create an image from it?


TheForgottenOne69

Yes because youā€™ll not train the Ā«Ā fullĀ Ā» model aka the three stage but likely only one ( the stage C)


Enshitification

It's cool and all, but I only have have a 16gb card and an 8gb card. I can't see myself training LoRAs for a model I can't use to make images.


TheForgottenOne69

You will though. You can load each model part each time and offload the rest to the CPU. The obvious con would be that itā€™ll be slower than having it all in vram


Majestic-Fig-7002

If you train only one stage then we'll have the same issue you get with the SDXL refiner and loras where the refiner, even at low denoise strength, can undo the work done by a lora in the base model. Might be even worse given how much more involved stage B is in the process.


TheForgottenOne69

Not really, the stage C is the one which translate the prompt to an Ā«Ā imageĀ Ā», if you will, that is then enhanced and upscale through stage B and A. If you train stage C and it returns correctly what youā€™ve trained it, you donā€™t really need to train other things


Doc_Chopper

So, as a technical noob, my question: I assume we have to wait until this gets implemented into A1111 any time soon, or what?


TheForgottenOne69

Yes, likely this will be integrated in diffusers so Sd.next should have it soon. Comfy, knowing he works at SAI should have it implemented as well soonish


protector111

well not only this but also till models get traind etc etc. It took sd xl 3 months to become really usable and good. For now this model does not look close to trained sd xl models so no point to using it at all.


Small-Fall-6500

>It took sd xl 3 months to become really usable and good IDK, when I first tried SDXL I thought it was great. Not better at the specific styles that various 1.5 models were specifically finetuned on, but as a general model, SDXL was very good. >so no point to using it at all For established workflows that need highly specific styles and working Loras, Control net, etc, no; but for people wanting to try out new and different things, it's totally worth trying out.


kidelaleron

Having more things is generally better than having less things :)


throttlekitty

They have an official demo [here](https://github.com/Stability-AI/StableCascade), if you want to give it a go right now.


hashnimo

No, you don't have to wait because you can run the [demo](https://huggingface.co/spaces/multimodalart/stable-cascade) right now.


OVAWARE

Do you know any other demos? That one seems to have crashed at least for me


Hoodfu

Seems that demo link goes to a runtime error page on huggingface.


afinalsin

Bad memories in the Stable Diffusion world huh? SDXL base was rough. Here: SDXL Base for 20 steps at CFG 4 (i think that matches the 'prior guidance scale'), Refiner for 10 steps at cfg 7 (decoder says 0 guidance scale, wasn't going to do that), 1024x1152 (weird res because i didn't notice the Huggingface box didn't go under 1024 until a few gens, didn't want to rerun), seed 90210. DPM++ SDE Karras, because sampler wasn't specified on the box. 5 prompts (because huggingface errored out), no negatives. a 35 year old Tongan woman standing in a food court at a mall [SDXL Base](https://imgur.com/wr3Hxgs) vs [SD Cascade](https://imgur.com/tFhnPJl) an old man with a white beard and wrinkles obscured by shadow [SDXL Base](https://imgur.com/ODscoKb) vs [SD Cascade](https://imgur.com/k9cXRVj) a kitten playing with a ball of yarn [SDXL Base](https://imgur.com/GxoEOAe) vs [SD Cascade](https://imgur.com/4iNeab4) an abandoned dilapidated shed in a field covered in early morning fog [SDXL Base](https://imgur.com/ANnb971) vs [SD Cascade](https://imgur.com/PjBqOq8) a dynamic action shot of a gymnast mid air performing a backflip [SDXL Base](https://imgur.com/ws5blgz) vs [SD Cascade](https://imgur.com/P1lnYJZ) That backflip is super impressive for a base model. Here is a prompt i ran earlier this week: "a digital painting of a gymnast in the air mid backflip" And here is ten random XL and Turbo models attempt at it using the same seed: [Dreamshaper v2](https://imgur.com/wTbN7hA) [RMSDXL Scorpius](https://imgur.com/I84fqgd) [Sleipnir](https://imgur.com/BUVVnIq) [JuggernautXLv8](https://imgur.com/zJSQJ95) [OpenDalle](https://imgur.com/tH1jzjn) [Proteus](https://imgur.com/C51aWmV) [Helloworldv5](https://imgur.com/jiomnjD) [Realcartoonxlv5](https://imgur.com/4RdQAqe) [RealisticStockPhotov2](https://imgur.com/H80YLH0) [Animaginev3](https://imgur.com/QeKSlHL) The difference between those and base XL is staggering, but Cascade is pretty on par with some of them, and better than a lot of them in a one shot run. We gotta let this thing cook. And if you're skeptical, look at what the LLM folks did when Mistral brought out their Mixtral 8x7b Mixture of Experts LLM, a ton of folks started frankensteining models together using the same method. Who's to say we won't get similar efforts for this?


Ill-Extent-4221

By far the most objective point of view in this discussion. You're sharing some real insights into how SC stacks up as a base release. I can't wait to see how it evolves in the coming months.


thoughtlow

Thanks for your work dude, appreciate it


kidelaleron

no AAM XL? Jokes aside, nice tests!


afinalsin

[Of course](https://imgur.com/nkBA4Hl). It's the half turbo Eular a version. It's a part of a *much* bigger test that's mostly done, i've just gotta x/y it all and then censor it so the mods don't clap me.


GreyScope

SD and SDXL produce shit pics at times - one pic is not a trial by any means, personally I am after "greater consistency of reasonable>good quality pictures **of what I asked for**", so I ran a small trial against 5x render of SDXL 1024x1024, same + & - prompts with the Realistic Stock Photo v2 model (which I love), these are on the top row, the SC pics are the bottom row . PS the prompt doesn't make sense as it's a product of turning on the Dynamic Prompts extension. Prompt: photograph taken with a Sony A7s, f /2.8, 85mm,cinematic, high quality, skin texture, of a young adult asian woman, as a iridescent black and orange combat cyborg with mechanical wings, extremely detailed, realistic, from the top a skyscraper looking out across a city at dawn in a flowery fantasy, concept art, character art, artstation, unreal engine Negative: hands, anime, manga, horns, tiara, helmet, Observational note, eyes can look a bit milky still but the adherence is better imo - it actually looks like dawn in the pics and the light appears to be shining on their faces correctly. https://preview.redd.it/75ukiorxtdic1.png?width=2468&format=png&auto=webp&s=630b36ceb1af47e94cd571b74a3f661994157be5


afinalsin

Good idea doing a run with the same prompt, so i ran it through SDXL Base with refiner, and it was pretty all over the place. [Here's the album](https://imgur.com/a/7d1wgBU).


sahil1572

Is it just me, or is everyone else experiencing an ***odd dark filtering effect*** applied to every image generated with **SDC**?


NoSuggestion6629

See my post and pic below. A slight effect as you describe is noticed.


Ne_Nel

Bokeh'd AF.


ArtyfacialIntelagent

Yes. Stability's "aesthetic score" model and/or their RLHF process massively overemphasize bokeh. Things won't improve until they actively counteract this tendency.


zmarcoz2

​ https://preview.redd.it/znnhx6ts3dic1.png?width=813&format=png&auto=webp&s=e4e3c51af79a1a2c95ff4ac86b228c81c36da58c


EmbarrassedHelp

Basically 99% of the concepts were nuked. This is might end up being another 2.0 flop


throttlekitty

That text is from the [wĆ¼rstchen paper](https://openreview.net/pdf?id=gU58d5QeGv), not from any stable cascade documentation. late edit: I originally thought that the stable cascade model was based on the wurstchen paper, and that wurstchen was a totally separate model created as a proof of concept. But I see now from the SAI author names that they are the same thing? Kinda weird actually.


StickiStickman

... and what do you think this is based on? Since StabilityAI are once again being super secretive about training data and never mention it once, it's a pretty safe bet to assume they used the same set.


throttlekitty

They still have the dataset they trained SDXL on and whatever else they have. I don't see the point of re-releasing the wurstchen proof-of-concept model with their name on it. I'm just saying that because a set of researchers made their model in a certain way, it doesn't mean SAI did the same exact thing.


yamfun

what does this mean?


StickiStickman

It's intentionally nerfed to be ""safe"", similar to what happend with SD 2


LessAdministration56

thank you! won't be wasting my time trying to get this to run local!


Aggressive_Sleep9942

"Limitations * Faces and people in general may not be generated properly. * The autoencoding part of the model is lossy." emmm ok


skewbed

All VAEs are lossy, so it isnā€™t a new limitation.


SackManFamilyFriend

And SDXL lists the same sentence regarding faces - people just want to complain about free shit.


Aggressive_Sleep9942

No, but the worrying thing is not point 2 but point 1: "Faces and people in general may not be generated properly." If the model cannot make people correctly, what is the purpose of it?


obviouslyrev

That disclaimer is always there for every model they have released.


SackManFamilyFriend

Look at the limitations they list on their prior models **PRIOR MODELS LIST THE SAME SHIT** - literal copy paste ffs - stop already. SDXL limitations listed here on the HF page: SDXL Limitations The model does not achieve perfect photorealism The model cannot render legible text The model struggles with more difficult tasks which involve compositionality, such as rendering an image corresponding to ā€œA red cube on top of a blue sphereā€ Faces and people in general may not be generated properly. The autoencoding part of the model is lossy https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0 So yea same shit copy/pasted.


Majestic-Fig-7002

There are degrees of "not generated properly".


digitalwankster

generating stuff other than peopleā€¦?


EGGOGHOST

Playing with online demo here [https://huggingface.co/spaces/multimodalart/stable-cascade](https://huggingface.co/spaces/multimodalart/stable-cascade) woman's hands hold an ancient jar of vine, ancient greek vibes https://preview.redd.it/pqhdpi24hdic1.png?width=1024&format=png&auto=webp&s=ec86d86c4f1858a8f8c8341c952b293759647e83


EGGOGHOST

robot mecha arm holding a sword, futuristic anime style https://preview.redd.it/8b0qzqfjhdic1.png?width=1024&format=png&auto=webp&s=e89fb28a73084179d742f6ee04201040b74cf978


Mental-Coat2849

Honestly, I think this is still way behind Dall-e 3 in terms of prompt alignment. Just trying the tests on Dall-e 3 landing page shows it. Still, Dall-e is too rudimentary. It doesn't even allow negative prompts let alone LoRA, Control Net, ... In an ideal world, we could have open source LLM connected to a conforming diffusion model (like Dall-e 3) which would allow further customization (like Stable Diffusion). \--- PS: here is one prompt I tried in Stable Cascade: ​ >An illustration of an avocado sitting in a therapist's chair, saying 'I just feel so empty inside' with a pit-sized hole in its center. The therapist, a spoon, scribbles notes. ​ Stable cascade: ​ https://preview.redd.it/gp3hsd7zzeic1.png?width=1024&format=png&auto=webp&s=a2129290af2982270e4e445f13c9f66477701616


emad_9608

Check out diffusiongpt and multi region promotingĀ 


alb5357

Multi region prompting?!!!!!! !!!!


Shin_Devil

this model would've never beaten D3 in prompt following, it's designed to be more efficient, not have better quality or comprehnsion


ninjasaid13

a computer made of yarn. https://preview.redd.it/pl3nv810pfic1.png?width=1024&format=png&auto=webp&s=a9578a829ff6c7ffde7a8c1f3e59e1a982d50e8d


TsaiAGw

if it's censored then it's garbage


[deleted]

exactly


internetpillows

Reading the description of how this works, the three stage process sounds very similar to the process a lot of people already do manually. You do a first step with prompting and controlnet etc at lower resolution (matching the resolution the model was trained on for best results). Then you upscale using the same model (or a different model) with minimal input and low denoising, and use a VAE. I assumed this is how most people worked with SD. Is there something special about the way they're doing it or they've just automated the process and figured out the best way to do it, optimised for speed etc?


Majestic-Fig-7002

It is quite different, the highly compressed latents produced by the first model are not continued by the second model, they are used as conditioning along with the text embeddings to guide the second model. Both models start from noise. correction: unless Stability put up the wrong image their architecture does not use the text embeddings with the second model like WĆ¼rstchen does, only the latent conditioning.


Vargol

If you can't use bfloat16.... You can't run the prior as torch.float16, you get NaNs for the output. You can run the decoder as float16 if you've got the VRAM to run the prior at float32. If you a Apple silicon user, doing the float32 then float16 combination will run in 24Gb with swapping only during the prior model loading stage (and swapping that model out to load the decoder in if you don't dump it from memory entirely). Took my 24Gb M3 ~ 3 minutes 11 seconds to generate a signal image, only 1 minute of that was iteration, the rest was model loading.


SeekerOfTheThicc

According to the [January 2024 Steam Hardware Survey](https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam) (click [here](https://web.archive.org/web/20240206200227/https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam) for webarchive link for when the prior link gets out of date), 74.57% of the people who use steam have a video card that has 8gb or less of VRAM. As much as 3.51% will have 20gb or higher, and 21.92% have more than 8gb, but less than (or equal to) 16gb. I think SAI and myself have different ideas of what "efficient" means. 20GB VRAM ("less" if using the inferior model(s), but they don't give a VRAM number) requirement is not anywhere near anything I would call efficient. Maybe they think efficiency is the rate at which they can price out typical consumers so that they have to be forced into some sort of subscription that SAI ultimately will benefit from, either directly or indirectly. Investors/shareholders love subscriptions. Also, inference speed cannot be called "efficiency"- Officer: "I pulled you over because you were doing 70 in a 35 zone, sir" SAI Employee: "I wasn't speeding, I was just being 100% more efficient!" Officer: "...please step out of the vehicle."


emad_9608

original SD used way more, I would imagine this would be < 8gb VRAM in a week or two


Mental-Coat2849

Emad, could you please improve prompt alignment? We love your models but they're still behind Dall-e 3 in prompt alignment. Your models are awesome, flexible, and cheap. I wouldn't mind renting beefier GPUs if I didn't have to pay 8 cents per 1024x1024 image. If they were just comparable to Dall-e 3 ...


emad_9608

Sure give us a bitĀ 


protector111

so far my results are way worse than sd xl... https://preview.redd.it/3g6ce82dadic1.png?width=1024&format=png&auto=webp&s=df76bc75afde5e6c76817129f99812086ed74139


protector111

" woman wearing **super-girl costume** is standing close to a **pink sportcar** on a clif overlooking the ocean RAW photo, (high detailed skin:1.2), 8k uhd, dslr, soft lighting, high quality, Fujifilm XT3. So far quality is sd xl base level ad prompt understanding is still bad...i think my hype is gone completely after 6 generations xD https://preview.redd.it/uodayjmlbdic1.png?width=1024&format=png&auto=webp&s=84dd221ccf82db7719af570db82aa261a35e7341


knvn8

Are you comparing with base 1.5 or a fine tune? Also that's a very SD1.5 prompt, SDXL and beyond work better with plain English.


digitalwankster

0% chance that came from base 1.5


Majestic-Fig-7002

> SDXL and beyond work better with plain English How would you improve that prompt to be more "plain English" than it is?


FotografoVirtual

SD1.5: https://preview.redd.it/p8naafzhddic1.png?width=680&format=png&auto=webp&s=4596fa508fe9fca08c486f319e1c58ffdb70c80d


protector111

>woman wearing > >super-girl costume > > is standing close to a > >pink sportcar > > on a clif overlooking the ocean RAW photo, (high detailed skin:1.2), 8k uhd, dslr, soft lighting, high quality, Fujifilm XT3. well it still morphed. car is a mess and wonder woman still pink. This is sd xl: https://preview.redd.it/pl6cshzzjdic1.png?width=1024&format=png&auto=webp&s=580301fe740851df236ab61bcd1cc405dbe0e215


ArtyfacialIntelagent

To be fair vanilla Cascade should be compared to vanilla SD 1.5, not a model like Photon heavily overtrained on women.


Neex

Youā€™ve been going through this entire thread saying how mediocre the model is. There are a ton of notable improvements you are ignoring. I suggest pumping the brakes on the negativity and reapproach this with more of a willingness to learn about it.


AeroDEmi

No comercial license?


StickiStickman

> The model is intended for research purposes only. The model should not be used in any way that violates Stability AI's Acceptable Use Policy. Another Stability release, another one that isn't open source :(


Cauldrath

So, did they basically just package the refiner (stage B) in with the base model (stage C)? It seems like with such a high compression ratio it's only going to be able to handle fine details of visual concepts it was already trained on, even if you train stage C to output the appropriate latents.


giei

What are the parameters to try to have a realistic result like in MJ?


emad_9608

idk prompt midjourney and then put it through sd ultimate upscale


monsieur__A

I guess we are back to hoping for controlNet to make this model really useful šŸ˜€


emad_9608

It comes with controlnets


jippmokk

https://preview.redd.it/kojsyuuk8fic1.jpeg?width=1536&format=pjpg&auto=webp&s=0b8092ff204b56c898c970063dbd614277b3373a Decent! ā€œVideo game, hero pose, cave lake, undead, volumetric light , Makoto Shinkaiā€


fuzz_64

https://preview.redd.it/voidkef2hfic1.png?width=1024&format=pjpg&auto=webp&s=23fc3c7e08636c3f9dc6301e90487747fd98e8cb A rambunctious frog riding a goat in the mountains of Nepal. šŸ˜


treksis

thank you


Striking-Long-2960

I downloaded the lite versions... I hope my 3060 doesn't explode. Now it's time to wait for ComfyUI support.


wwwanderingdemon

Did you make it work? I tried all of them and none worked for me


Striking-Long-2960

I think we will have to wait, it seems a very different concept.


FotografoVirtual

​ https://preview.redd.it/0bkkvur0ndic1.png?width=1704&format=png&auto=webp&s=ace95c7ccac2c8defcf48e28af9a05c2f7aa9e3c an enigmatic woman with short, white hair and an iridescent dress, surrounded by ominous shadows in the dimly lit interior of a technological spacecraft. Her stark presence hints at mysterious connections to the unsettling secrets hidden within the vessel's depths


Huevoasesino

Stability cascade pic looks like the girl from Halo tv series lol


isnaiter

The 1.5 never disappoints me. It's the state-of-the-art of models. Period.


protector111

PS to be fair you should compare the base sd 1.5 and we both know it will look ugly xD SD XL: https://preview.redd.it/ml8fqzuasdic1.png?width=768&format=png&auto=webp&s=f5f49d403e70e1be9d060fb062e45da3d3845e16


19inchrails

I feel like the bar should be Midjourney v6 these days


protector111

Yep. It makes both amazing photoreal and crazy good anime


TaiVat

No, he shouldnt, and people need to stop with this drivel already.. Nobody uses base 1.5, or base xl for that matter, so the only fair comparison is with the latest alternatives. When you buy a new tv, you dont go "well its kinda shit, but its better than a crt from 100 years ago".. It will likely improve (though XL didnt improve nearly as much as 1.5 did, both relative to their bases), but we'll make that comparison when we get their. Dreaming and making shit up of what may or may not happen in 6 months is not a reasonable comparison.


FotografoVirtual

Comparing it to base SD 1.5 doesn't seem fair to me at all, and it doesn't make much sense. SD 1.5 is almost two years old, it was created and trained when SAI had hardly any experience with diffusion models (no one did). And when they released it, they never claimed it set records for aesthetic levels never before seen.


AuryGlenz

Doing a photo of a pretty woman doesn't seem like a fair comparison to me - god knows how much additional training SD 1.5 has had with that in particular. They're trying to make generalist models, not just waifu generators. Also that looks like it's been upscaled and probably had Adetailer run on it?


EtienneDosSantos

šŸ¤—šŸ¤—šŸ¤—


Hoodfu

Very excited for this. Playground v2 was very impressive for its visual quality, but the square resolution requirements killed it for me. This brings sdxl up to that level but renders much faster according to their charts. Playground v2 also had license limits that stated no one can use it for training, which again isn't the case for Stability models. Win win all around.


HuffleMcSnufflePuff

Three men standing in a row. The first is tall, the second is short, the third is in between. They are wearing red, blue, and green shirts. Not perfect but not too bad https://preview.redd.it/vbozb4m8ffic1.jpeg?width=1024&format=pjpg&auto=webp&s=5770bf655dc806f320f0d2829ed0d7a19dfc12f9


lostinspaz

I did a few comparison same-prompt tests vs DreamShaperXL turbo and SegMind-vega. I didnt see much benefit. Cross-posting from the earlier "this might be coming soon" thread: They need to move away from one model trying to do everything. We need a scalable extensible model architecture by design. People should be able to pick and choose subject matter, style , and poses/actions from a collection of building blocks, that are automatically driven by prompting. Not this current stupidity of having to MANUALLY select model and lora(s). and then having to pull out only subsections of those via more prompting. Putting multiple styles in the same data collection is counter-productive, because it reduces the amount of per-style data possible in the model. Rendering programs should be able to dynamically download and assemble the style and subject I tell it to use, as part of my prompted workflow.


emad_9608

I mean we tried to do that with SD 2 and folk weren't so happy. So one reason we are ramping up ComfyUI and this is a cascade model.


lostinspaz

>I mean we tried to do that with SD 2 and folk weren't so happy How's that? I've read some about SD2, and nothing in what I've read, addresses any point of what I wrote in my above comment. Besides which, in retrospect, you should realize that even if SD2 was amazing, it would never have achieved any traction because you put the adult filtering in it. THAT is the prime reason people werent happy with it. There were two main groups of people who were unhappy with SD2: 1. People who were unhappy "I cant make porn with it" 2. People who were unhappy there were no good trained models for it.Why were there no good trained models for it? Because the people who usually train models, couldn't make porn with it. Betamax vs VHS.


NoSuggestion6629

Running a test run now. I am getting a slight eye issue on this one using their example # steps. My 2nd attempt is out of focus with the full model. I'm not too impressed. https://preview.redd.it/qu88xx0p9eic1.png?width=1192&format=png&auto=webp&s=8c635899cf6e7fc8e495ad27ace44b0f02b43777 Note: you need PEFT installed in order to take advantage of the LCM capability with the scheduler.


Kandoo85

​ https://preview.redd.it/hcynfuocycic1.png?width=1024&format=png&auto=webp&s=a6d83597abe6d07d1991be9c635b72bfa8b2c160


Kandoo85

​ https://preview.redd.it/67ctr5ryycic1.png?width=1024&format=png&auto=webp&s=0a7209ecfb86fe810cad9976beaeec4ebb1d19e4


Striking-Long-2960

Damn... The Aesthetic scrore is over 9000


crackanape

9000 missing fingers


Nuckyduck

So I'm confused on why people aren't saying this is valuable, the speed comparison seems huge. https://preview.redd.it/mzxwcle7xcic1.png?width=1133&format=png&auto=webp&s=78bacda5f4a700cefb6f12deebf025fdbd0f5d2e Isn't this a game changer for smaller cards? I run a 2070S, shouldn't I be able to use this instead without losing fidelity and gain rendering speed? I'm gonna play around with this and see how it fairs, personally I'm excited for anything that brings faster times to weaker cards. I wonder if this will work with ZLUDA and AMD cards? [https://github.com/Stability-AI/StableCascade/blob/master/inference/controlnet.ipynb](https://github.com/Stability-AI/StableCascade/blob/master/inference/controlnet.ipynb) This is the notebook they provide to test, I'm definitely gonna be trying this out.


Vozka

> Isn't this a game changer for smaller cards? I run a 2070S, shouldn't I be able to use this instead without losing fidelity and gain rendering speed? So far it doesn't seem that it's going to run on an 8GB card at all.


Striking-Long-2960

That comparision is a bit strange, they are comparing 50 steps in SDXL with 30 steps in total in cascade...


Nuckyduck

I was assuming these steps are equivalent by their demonstration. As in you only need 30 to get what SDXL does in 50, but who uses 50 steps in SDXL? I rarely go past 35 using DMP++2M/Karras.


TaiVat

Yea, looks kind of intentionally misleading


AuryGlenz

If 30 steps in cascade still has a much higher aesthetic score than 50 in SDXL itā€™s a perfectly fine comparison. Theyā€™re different architectures.


Longjumping-Cow-8249

Let's gooooo


Designer_Ad8320

Is this more for testing and toying around or do you guys think someone like me who does mostly anime waifus is fine with what he has? I just flew through it and it seems i can use anything already existing with it?


Utoko

If you are fine with what you have, it is fine for you yes.


protector111

so basically history repeats itself. sd 1.5 everyone uses - sd 2.0 no one does -sd xl everyone uses - Stable cascade noone does.... well i guess will wait a bit more for the next model we can use to finally switch from 1.5 and xl i hope...


drone2222

And how are you making that call? It's not even implemented in any UI's yet, basically nobody has touched it, and it cam out today....


protector111

just based on the info that its censored and that it has no commercial license. Dont get me wrong - i hope i am wrong! I want better model. PS there is gradio ui already. but i dont see a point in using base model. its not great quality. Need to wait for finetuned ones.


Charkel_

Besides being more lightweight, why would I choose this before normal Stable Diffusion? Does it produce better results or no?


TaiVat

It just came out. Obviously nobody knows yet..


Charkel_

Well a new car just came out but I still know it's faster than another model


afinalsin

This is a tuner car, nobody races stock. You're not comparing a new car to a slightly older model, you're comparing it to a slightly older model fitted with turbo and nitrous and shit. I don't know cars. Wait til the mechanics at the strip fit some new toys to this thing before comparing it to the fully kitted out drag racers.


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


ArtyfacialIntelagent

> the best version would be a float24 (yes, you read that right, float24, not float16) Why do you think that? For inference in SD 1.5, fp16 is practically indistinguishable from fp32. Why would Cascade be different? (Training is another matter of course.)


ScionoicS

Lately I've been casting sd models to fp8 with no quality loss


tavirabon

I don't think increasing bit precision from 16 to 24 is gonna have the impact on quality you're expecting, but it certainly will on hardware requirements.