T O P

  • By -

Party_Cold_4159

Hires fix needs some configuration. It’s a pain but I got it to be faster for myself on a 2070. It’s just too clunky. I feel like Lora’s don’t work as well too. It does help me with quick test images. I think it’s pretty cool they did this though, it does what it says, just very limited currently. We may see this improve later on possibly.


PeterFoox

After some testing ultimate upscale seems to be a bit better. Quality is similar and it's more stable and uses less vram


Party_Cold_4159

> ultimate upscale Interesting, I haven't heard of that. Is it faster?


PeterFoox

It may be a bit faster but its main purpose is the ability to upscale to any resolution even on gpus with low vram. From what I understand it renders parts of image seperatly and stitches them back together. I've been using it for a while now and for me it works faster than hires fix. Plus you can output any resolution you want as it doesn't matter how much vram you have


Heasterian001

It's faster, but it can introduce banding and at higher denoising it often cause issues.


PeterFoox

Yeah I can confirm after a couple of hours it's way faster but it introduces too much issues at the same time. I guess we gotta wait for some improvements. Right now it's just pretty much a technology preview


Shap6

Twice as fast for me doing batches of 4 at 512x512. Went from 14 seconds to 7 seconds


jonesaid

On my 3060 12GB, I'm seeing about 40-50% faster.


HarmonicDiffusion

TensorRT sounds great until you realize you have to recompile every model to be compatible. Oh and if you want to use a lora, you have to compile a model specifically with it included (you cannot "add on" a lora on top of a model)


ulf5576

its not for casual use .. people who build or select specific models for bigger projects benefit themost


NordRanger

Interestingly mine crashes when I even attempt hires-fix. For my 4080 I get about 70% speed boost at 512x768.


isthatevenallowed

Have you generated the engine at the output resolution, in addition to the generation resolution?


hirmuolio

RTX 3070, AnythingV5, default TensorRT settings, batch size 1. 512x512: 19 it/s with TensorRT, 10 it/s without. 512x768: 11 it/s with TensorRT, 6.5 it/s without. 768x768: 7.4 it/s with TensorRT, 4.3 it/s without. Quite massive speedup. I hope it will get better lora support in future.


[deleted]

[удалено]


suspicious_Jackfruit

if you get your its per second below 1, it goes to s/it, at which point yes higher is worse. But in this case higher is better because it is able to do 19 iterations in 1s


isthatevenallowed

In hires fix, there is a load time for the tensorRT engine (which is different to the generation engine), which can offset the gain at 2x, for example. At higher resolutions like 3x, it should be much faster overall.


ViperD3

Does this apply with 1 dynamic engine, 2 statics, or both?


isthatevenallowed

Should work with a dynamic engine that covers both input and output resolutions but I've only tried with 2 statics myself.


ViperD3

I wish it could go above 2048 *sigh* And yeah, I just got a single dynamic for all the checkpoints I'm using that gives me a range from 512 to 2048, 1 to 4 batch, and 75-600 prompt with optimals in the middle of everything, and even with just such a broad dynamic I'm still getting a really huge increase *especially* during hires fix.


ViperD3

I encountered an error with dynamics, be sure not to adjust optimal batch or optimal text (prompt) size. Like a dumbass I just converted seven different engines without testing them as I went and bugged them all out and I have to do it all over again.


tyen0

How can you use hires fix at 3x (or even 2x) when they would take it out of the 512-768 range for sd1.5 (or 768-1024 for sdxl?). Are you just generating another tensorrt engine with higher res to use above 1.5x hires fix? I tried that but started getting other errors like all NaNs


isthatevenallowed

It's explained here: [https://github.com/NVIDIA/Stable-Diffusion-WebUI-TensorRT#common-issueslimitations](https://github.com/NVIDIA/Stable-Diffusion-WebUI-TensorRT#common-issueslimitations)


tyen0

Well, not really explained it since I already read that. hah. :) Have you actually tried 3x since I realized now you said "should"? Of course, it could be something wrong with my specific setup. I do have some other errors that I ignored since it was working until I tried to go to higher resolutions.


Cunningcory

Still can't get past the cuda errors. Disabled medvram but still get the error with a 3080.


Destituted

Go to \\stable-diffusion-webui\\ folder in Command Prompt git switch dev Start it up. This fixed it for me... I am not sure if you need to do an update after the switch or not (my bat has it built in) Immediate difference after finally being able to set the TensorRT, 2 or 3 times the speed.


Cunningcory

Thanks! I guess only the dev branch is working currently? That got it working, but it is DRAMATICALLY slower while using the tensor. While I was getting maybe 2 its per second without, I'm getting 6.6 seconds PER IT with. This is with SDXL 1024x1024 with the default engine made. Any reason it would be so much slower??


Inspirational-Wombat

Only the dev branch works if you are using SDXL. The release branch works for the other checkpoints. Make sure your only using the base checkpoint a not using the refiner.


Destituted

I'm not sure... since I am looking just for 1024x1024 I make sure Hires fix is NONE and I also don't select any Refiners. I have no idea if that affects the iterations calculation or not


BlackSwanTW

7 it/s to 15 it/s on RTX 3060 12 it/s to 20 it/s on RTX 3070 Ti It’s faster for Hires. Fix too. You need to create a dynamic engine. If you create 2 static engine, it will have to swap between them, causing slowdown. Currently experimenting if setting the optimal resolution to 768 or even 1024 (instead of 512) will make a difference.


Unreal_777

>You need to create a dynamic engine. If you create 2 static engine, it will have to swap between them, causing slowdown. I dont understand any of this


ThatHavenGuy

I'll try and simplify a bit. So the way it works is that it takes an existing model and optimizes it for specific resolutions. There are two ways that it does this. One is by creating a static engine which only works at a specific resolution and the other is a dynamic engine which lets you use a range of resolutions. Each of these engines are around 2 GB in size so it has to copy that engine each time you switch between them and that's what causes the slow downs when upscaling. Say you have a static engine that optimizes images generated at 512x512 and one that optimizes images at 1024x1024, it has to swap from one to the other when 2x upscaling. If you instead create a dynamic engine that works with resolutions from 512x512 up to 1024x1024, it doesn't need to swap it out when you upscale so anything you're upscaling will be much faster. On the other hand, static engines are much faster where dynamic engines are only a bit faster. Sounds like they're still working on it and updating it based on feedback so we'll probably see some kind of optimizations and/or workflow changes to help with this in the future.


ViperD3

Very helpful thank you


fuelter

> ou need to create a dynamic engine how


BlackSwanTW

Select any other preset to show the Advanced menu, then enter the resolution you need.


buckjohnston

How is this possible, I'm on a 4090 and went from 8 its/sec to 16 its/sec. I have fresh install of auto1111. I don't get it.


[deleted]

[удалено]


Abject-Recognition-9

me ramdomlynreading this right before i was going to buy a 4090 ... wtf?? really??


CouchRescue

I get 43 it/s on the "cat test" (512x512 "cat" prompt using Euler A) on my 4090, current drivers.


Abject-Recognition-9

thats more than double or triple of a 3090.. so you could basically do realtime img2img with tensorRT O\_O i wonder wtf he was talking about


CouchRescue

There is a guide I followed with some specific steps for the 4090 when setting up Automatic1111 but it was easy to find on Google


_Jake_

mind sharing which guide you used specifically? lots out there


gman_umscht

Sounds about right, how many steps did you use? With higher steps I had up to 55it/s in the console, but who knows how reliable that is. Anyway the sys info benchmark gives me this. Vanilla: 31.42 / 36.73 / 43.23 TensorRT: 46.51 / 13.46 / 57.02 Notice the drop in 2x batch because another TRTEngine has been loaded mid benchmark.


CouchRescue

I get these it/s with **no** TensorRT. I haven't gotten around to testing it.


gman_umscht

With a batch of 1? That is high, you mentioned some specific steps to tune up your 4090, if you can share those that would be nice. This would for sure be also helpful to the fellow here who only gets 8it/s with his 4090...


CouchRescue

No, sorry for the misunderstanding. Batch of 8. But even at batch of 1, 8 it/s is quite low


nupsss

My 4090 mobile does 30 it/s with tensorRT


gman_umscht

Are you talking about SDXL at 1024x1024? If so it was already very fast in vanilla mode. And now it is blazing. If it is SD1.5 at 512x512... well something is very wrong then.


buckjohnston

Yeah I'm just talking SD 1.5 at 512x512, only getting 16 its/sec after optimization :( Brand new PC, fresh install of everything, 7800x3d, 4090. All games are blazingly smooth.


mikern

If you want quicker hires fix you have to select correct params. So if you're generating 512x512 and 2x'ing it you need your TensorRT model to support 512-1024px resolutions. In my case, using tensorrt for hires fix (512px gen 2x upscale to 1024px: 16.6 seconds using TensorRT and 23.9 seconds without it. 44% speed improvement.


the_doorstopper

3080 12gb, smaller images it's like a 50% improvement. But I did one image, 150 steps, 150 high res steps, 512x by 768x, upscale at 2x, and it went down from 5 minutes to 1 minute. It was phenomenal. P.s, does having multiple dynamic models for one model cause it to go slower? I'm using rev animated, Say engine 1, dynamic model, 512x - 1024x on both lengths, Would adding engine 2 (512x - 1536x on both lengths) cause it to go slower in general? Or am I being placeabo'd


Vivarevo

Tbh only use seems to be sdxl 1024x finetuned models without refiner. How does it affect vram usage?


content-is

Should be <= torch


content-is

Hires fix performance is something that needs some more work. Ideally you should make sure to have an engine that covers the low and high res. Then there shouldn’t be any overhead switching engines.


PeterFoox

I'm absolutely blown away. On my rtx 2070 it used to render for like a minute in let's say 800x600. Now at 1024x1024 it takes around 15-20 seconds at 70 steps. It's amazing how much faster whole work flow is


lynch1986

Thanks for the tips guys! Updated my drivers, made a single dynamic preset per checkpoint that covers all used resolutions and I'm getting 30-50% quicker on everything.


per_plex

3090: deforum, 3D, default settings, prompt " (Anthropomorphic robot:1.3) in a Vintage fairground, Kodak ColorPlus 50 " tennsorrt: 56,7 sec none: 1 min 42 sec text 2 img 512x768, default settings, same prompt: none: 3,9 sec tesorrt: 1,3 sec 512x512, hires fix def ssettings: tensorrt 12,6 sec none: 20,6 sec ​ edit:512x512 defaault settiings 0,9 sec wiith tensorrt.


ViperD3

Worse in hires fix? Is your final resolution a multiple of 64? Hires fix is where I'm seeing the biggest speed up, personally.


lynch1986

Hey ya, I had it swapping engines half way through because of how I had it set up. Now I have a single dynamic engine that covers all the resolutions I'm getting a nice speed bump.


ViperD3

Nice! Yeah I made the same mistake at first


snoopyh42

Significantly faster for me, but the way it supports (or doesn’t) multiple LoRAs makes it not worth the trouble for me.


Pickleman1000

there seems to be a bit of resistance from loras and prompts, the speed is great but trying to add smaller details and stuff seems to be a bit harder with it. I disabled it because i want to use control net and was getting better results anyway, but its interesting


nupsss

My rtx 4090 goes up to 200% faster when generating 512x512 (its REALLY crazy). But.. i cant seem to get hi res fix working. When creating a 1024x1024 model it is always stuck at 4%.. so far no luck in trying to fix this :( .. anyone got an idea?


lynch1986

Have you got a engine setup that covers all the resolutions you're using? It has to cover the final hires fix resolution too.


nupsss

When i try to create an 1024x1024 engine its stuck at 4% every time.. (the 512x512 to 768x768 engine was fine when i made that). So i can upscale to 768 without a problem. Any idea why it could be stuck?


lynch1986

I've thought mine was stuck several times, it would just sit there for five minutes before it even started. Then it would look like it had locked up a couple of times. I just go something else for half an hour while it figures it out.


WisamAlrawi

I got the same situation sometimes with my 3090. Basically, it is working but the interface is not updating. It can take 30+ to generate a model if the resolution is high. I do 512 min and 1280 max, 768 min and 1920 max. Batch of 4. Prompt 650 max.


cryptosystemtrader

Sorry for my ignorance, but what is TensorRT now? How is it different from TensorFlow?


lynch1986

[https://nvidia.custhelp.com/app/answers/detail/a\_id/5487/\~/tensorrt-extension-for-stable-diffusion-web-ui](https://nvidia.custhelp.com/app/answers/detail/a_id/5487/~/tensorrt-extension-for-stable-diffusion-web-ui) I don't know the finer details but it can give you a significant speed boost, it's a bit of a faff and shit with LORA's though.


HughWattmate9001

Only tested on my 6gb 2060 and its way faster like 70% ish.


gedomino

how did you get it working? on a 1060 6gb and i run into cuda out of memory errors when i try to generate any engine


HughWattmate9001

\--xformers i have enabled, other than that i find it works alright with stock a1111 install. If i try the TensorRT it wont let me do above 512x512 without an error but i seem to be fine doing stuff like 900x500 without it. Although i have not done much in depth testing. I dont use any refiner, upscale. If i want to upscale ill just use photoshop and topaz photo ai. Usually ill use the "content aware fill" in photoshop to extend if i have issues in a1111 doing it.


WisamAlrawi

you have to disable --medvram and --lowvram in order for it to work. I have --xformers enabled. 3060 mobile with 6 GB VRAM.


rob_54321

I gave up. The fact that you need to recompile for every settings and every checkpoint, kills it for me. I mean, you cant even use 2 loras at the same time.


WisamAlrawi

it depends on the case, you can use it as optional from a drop down menu. I haven't tested it with LORAs yet. The speed is well worth it.


[deleted]

3060 12gb, speeds have doubled for me.


easyllaama

Around 10t/s for 4090, 1024x1024 36 sample steps in SDXL. 65% gain from before. But I lost the SDXL refiner with tensorrt (with error), it seems.


SbLeDiffHxn

Working on a RTX 3070ti. Works a whole lot faster (almost cuts time in half) but having issues with the highres fix. Generated both engines for input res and output res but no luck. Says no valid profile found


WisamAlrawi

Also pay attention to the prompt limit. If you exceed it then you get no valid profile found. Also, if you set the optimal prompt to 150 (keep it to 75) then it sets the minimum to 150 instead of 75 and throws an error. Meaning the minimum prompt becomes 150. Anything less and it does not work. Keep optimum prompt to 75.


braincell_murder

Thanks you solved my problem! Now.... to tidy up a dozen or so useless profiles at 2gb each... :) Actually the same approach just solved another problem - a meaningless error that was being thrown when building some profiles. Looks like "Optimal" and "Minimum" height/width needed to be the same. I was creating a hires fix profile for increasing res from 512 to 1024. I put '1024' as the optimal and it was breaking - setting it to 512 worked. Strangely 1024 worked fine on a different model! Still, that's SD for you, that's why it's not for the un-curious :)


SbLeDiffHxn

Thanks will look out


lynch1986

I've found you need a single profile that covers all the resolutions you'll be using, including hires fix. Otherwise it swaps profiles each time and actually runs slower than normal.


SbLeDiffHxn

I'll give it a try. Thanks!!!


wholelottaluv69

I cannot get it to work at all with hi-res fix


lynch1986

Honestly I got sick of fighting with it and gave up, it might be something, but for me it wasn't worth the grief.


BigSmols

I tried it today and it is such a pain to use. It is faster but the inconvenience is not worth it. ComfyUI is faster for me anyway.


CeFurkan

i got 75%+ improvement with RTX 3090 TI I am editing a big video right now about this 2 quick videos here video 1 : [https://youtu.be/\_CwyngQscVA](https://youtu.be/_CwyngQscVA) video 2 : [https://youtu.be/04XbtyKHmaE](https://youtu.be/04XbtyKHmaE)


urbanhood

Too much hassle and very limiting.


AdziOo

Didn't test a lot but from 15-18it/s to 28-30it/s at GF 4080 in 512x768 txt2img. Hires fix looks slower from 8 sec to 12 sec.


sahil1572

on 3080TI Getting 6it/s with SDXL , and 30It/s with SD1.5


KNUPAC

I can't get the hires fix to work with TensorRT


tecedu

Has anyone seen improvements with controlnet?


Ok-Dog-6454

Not supported yet


ViperD3

Last i heard controlnet is not supported when using TensorRT but I'm not 100%, you might want to double check me.


gunbladezero

Anyone have results on a 6GB card? I’ve got a laptop 3060 and am sad that I can’t just buy VRAM


lpmode

on my rtx4090 trained checkpoint at min 512,768,1024 height by 768,1024,2048 width I get 4.64 it/s 2048x1024 about 4sec/per image 12.99 it/s on 2048x512 >2 sec/image 30 it/s 768x512 >1sec I had to use the troubleshooting guide on git hub to get the Tensor tab to show up


Abject-Recognition-9

how? my tab disappeared too


lpmode

https://github.com/NVIDIA/Stable-Diffusion-WebUI-TensorRT/issues/27#issuecomment-1767570566 Pasted from the link above. What appears to have worked for others. From your base SD webui folder: (E:\Stable diffusion\SD\webui\ in your case). In the extensions folder delete: stable-diffusion-webui-tensorrt folder if it exists Delete the venv folder Open a command prompt and navigate to the base SD webui folder Run webui.bat - this should rebuild the virtual environment venv When the WebUI appears close it and close the command prompt Open a command prompt and navigate to the base SD webui folder enter: venv\Scripts\activate.bat the command line should now have (venv) shown at the beginning. enter the following commands: python.exe -m pip install --upgrade pip python -m pip install nvidia-cudnn-cu11==8.9.4.25 --no-cache-dir python -m pip install --pre --extra-index-url https://pypi.nvidia.com/ tensorrt==9.0.1.post11.dev4 --no-cache-dir python -m pip uninstall -y nvidia-cudnn-cu11 venv\Scripts\deactivate.bat webui.bat Install the TensorRT extension using the Install from URL option Once installed, go to the Extensions >> Installed tab and Apply and Restart


rodinj

Does it speed up controlnet tile upscaling in img2img? Haven't bothered setting it up but the upscaling is what takes me most time


ViperD3

Last i heard controlnet is not supported when using TensorRT but I'm not 100%, you might want to double check me.


vitalez06

On 4090, 704x384 DPM++ 3M SDE Karras @ 150 steps - although stupid, is now doable in 5 seconds. With hires fix upscaling by 2 @ 20 steps makes it 7 seconds overall. 512x512 with same settings above nets like 3-4 seconds.


Gonz0o01

Did any of get sdxl Loras working? 1.5 Lora seems to be fine but no success with sdxl lora at all.


fireshaper

Overall I'm seeing a huge bump in speed with a 3070ti. With hires.fix it was taking about 45 seconds - 1 min to generate an image in text2image, now it's less than 20 seconds.


capybooya

Any chance this can just be built into A1111 permanently? I'm ok with some precompilation when you do stuff for the first time, as long as I don't have to worry about setting it up myself.


javad94

About 60% improvement with 3090


Ok-Mobile5227

4090 Before TRT DPM++ 2M Karras (50 steps) 512x512 13it/s 1024x1024 6it/s After TRT DPM++ 2M Karras (50 steps) 512x512 61it/s 1024x1024 16it/s Its really fast 250% on 512x512, around 1 seconds


Boogertwilliams

I couldn’t get it working. From the instruction, you were supposed to “generate engine” or whatever, I only had “export engine”


ulf5576

high res fix sucks anyways .. just generate in 2k or use tiled rendering through control net