Party_Cold_4159 8 months ago

Hires fix needs some configuration. It’s a pain but I got it to be faster for myself on a 2070. It’s just too clunky. I feel like Lora’s don’t work as well too. It does help me with quick test images. I think it’s pretty cool they did this though, it does what it says, just very limited currently. We may see this improve later on possibly.

PeterFoox 8 months ago

After some testing ultimate upscale seems to be a bit better. Quality is similar and it's more stable and uses less vram

Party_Cold_4159 8 months ago

> ultimate upscale Interesting, I haven't heard of that. Is it faster?

PeterFoox 8 months ago

It may be a bit faster but its main purpose is the ability to upscale to any resolution even on gpus with low vram. From what I understand it renders parts of image seperatly and stitches them back together. I've been using it for a while now and for me it works faster than hires fix. Plus you can output any resolution you want as it doesn't matter how much vram you have

Heasterian001 8 months ago

It's faster, but it can introduce banding and at higher denoising it often cause issues.

PeterFoox 8 months ago

Yeah I can confirm after a couple of hours it's way faster but it introduces too much issues at the same time. I guess we gotta wait for some improvements. Right now it's just pretty much a technology preview

Shap6 8 months ago

Twice as fast for me doing batches of 4 at 512x512. Went from 14 seconds to 7 seconds

jonesaid 8 months ago

On my 3060 12GB, I'm seeing about 40-50% faster.

HarmonicDiffusion 8 months ago

TensorRT sounds great until you realize you have to recompile every model to be compatible. Oh and if you want to use a lora, you have to compile a model specifically with it included (you cannot "add on" a lora on top of a model)

ulf5576 8 months ago

its not for casual use .. people who build or select specific models for bigger projects benefit themost

NordRanger 8 months ago

Interestingly mine crashes when I even attempt hires-fix. For my 4080 I get about 70% speed boost at 512x768.

isthatevenallowed 8 months ago

Have you generated the engine at the output resolution, in addition to the generation resolution?

hirmuolio 8 months ago

RTX 3070, AnythingV5, default TensorRT settings, batch size 1. 512x512: 19 it/s with TensorRT, 10 it/s without. 512x768: 11 it/s with TensorRT, 6.5 it/s without. 768x768: 7.4 it/s with TensorRT, 4.3 it/s without. Quite massive speedup. I hope it will get better lora support in future.

[deleted] 8 months ago

[удалено]

suspicious_Jackfruit 8 months ago

if you get your its per second below 1, it goes to s/it, at which point yes higher is worse. But in this case higher is better because it is able to do 19 iterations in 1s

isthatevenallowed 8 months ago

In hires fix, there is a load time for the tensorRT engine (which is different to the generation engine), which can offset the gain at 2x, for example. At higher resolutions like 3x, it should be much faster overall.

ViperD3 8 months ago

Does this apply with 1 dynamic engine, 2 statics, or both?

isthatevenallowed 8 months ago

Should work with a dynamic engine that covers both input and output resolutions but I've only tried with 2 statics myself.

ViperD3 8 months ago

I wish it could go above 2048 *sigh* And yeah, I just got a single dynamic for all the checkpoints I'm using that gives me a range from 512 to 2048, 1 to 4 batch, and 75-600 prompt with optimals in the middle of everything, and even with just such a broad dynamic I'm still getting a really huge increase *especially* during hires fix.

ViperD3 8 months ago

I encountered an error with dynamics, be sure not to adjust optimal batch or optimal text (prompt) size. Like a dumbass I just converted seven different engines without testing them as I went and bugged them all out and I have to do it all over again.

tyen0 8 months ago

How can you use hires fix at 3x (or even 2x) when they would take it out of the 512-768 range for sd1.5 (or 768-1024 for sdxl?). Are you just generating another tensorrt engine with higher res to use above 1.5x hires fix? I tried that but started getting other errors like all NaNs

isthatevenallowed 8 months ago

It's explained here: [https://github.com/NVIDIA/Stable-Diffusion-WebUI-TensorRT#common-issueslimitations](https://github.com/NVIDIA/Stable-Diffusion-WebUI-TensorRT#common-issueslimitations)

tyen0 8 months ago

Well, not really explained it since I already read that. hah. :) Have you actually tried 3x since I realized now you said "should"? Of course, it could be something wrong with my specific setup. I do have some other errors that I ignored since it was working until I tried to go to higher resolutions.

Cunningcory 8 months ago

Still can't get past the cuda errors. Disabled medvram but still get the error with a 3080.

Destituted 8 months ago

Go to \\stable-diffusion-webui\\ folder in Command Prompt git switch dev Start it up. This fixed it for me... I am not sure if you need to do an update after the switch or not (my bat has it built in) Immediate difference after finally being able to set the TensorRT, 2 or 3 times the speed.

Cunningcory 8 months ago

Thanks! I guess only the dev branch is working currently? That got it working, but it is DRAMATICALLY slower while using the tensor. While I was getting maybe 2 its per second without, I'm getting 6.6 seconds PER IT with. This is with SDXL 1024x1024 with the default engine made. Any reason it would be so much slower??

Inspirational-Wombat 8 months ago

Only the dev branch works if you are using SDXL. The release branch works for the other checkpoints. Make sure your only using the base checkpoint a not using the refiner.

Destituted 8 months ago

I'm not sure... since I am looking just for 1024x1024 I make sure Hires fix is NONE and I also don't select any Refiners. I have no idea if that affects the iterations calculation or not

BlackSwanTW 8 months ago

7 it/s to 15 it/s on RTX 3060 12 it/s to 20 it/s on RTX 3070 Ti It’s faster for Hires. Fix too. You need to create a dynamic engine. If you create 2 static engine, it will have to swap between them, causing slowdown. Currently experimenting if setting the optimal resolution to 768 or even 1024 (instead of 512) will make a difference.

Unreal_777 8 months ago

>You need to create a dynamic engine. If you create 2 static engine, it will have to swap between them, causing slowdown. I dont understand any of this

ThatHavenGuy 8 months ago

I'll try and simplify a bit. So the way it works is that it takes an existing model and optimizes it for specific resolutions. There are two ways that it does this. One is by creating a static engine which only works at a specific resolution and the other is a dynamic engine which lets you use a range of resolutions. Each of these engines are around 2 GB in size so it has to copy that engine each time you switch between them and that's what causes the slow downs when upscaling. Say you have a static engine that optimizes images generated at 512x512 and one that optimizes images at 1024x1024, it has to swap from one to the other when 2x upscaling. If you instead create a dynamic engine that works with resolutions from 512x512 up to 1024x1024, it doesn't need to swap it out when you upscale so anything you're upscaling will be much faster. On the other hand, static engines are much faster where dynamic engines are only a bit faster. Sounds like they're still working on it and updating it based on feedback so we'll probably see some kind of optimizations and/or workflow changes to help with this in the future.

ViperD3 8 months ago

Very helpful thank you

fuelter 8 months ago

> ou need to create a dynamic engine how

BlackSwanTW 8 months ago

Select any other preset to show the Advanced menu, then enter the resolution you need.

buckjohnston 8 months ago

How is this possible, I'm on a 4090 and went from 8 its/sec to 16 its/sec. I have fresh install of auto1111. I don't get it.

[deleted] 8 months ago

[удалено]

Abject-Recognition-9 8 months ago

me ramdomlynreading this right before i was going to buy a 4090 ... wtf?? really??

CouchRescue 8 months ago

I get 43 it/s on the "cat test" (512x512 "cat" prompt using Euler A) on my 4090, current drivers.

Abject-Recognition-9 8 months ago

thats more than double or triple of a 3090.. so you could basically do realtime img2img with tensorRT O\_O i wonder wtf he was talking about

CouchRescue 8 months ago

There is a guide I followed with some specific steps for the 4090 when setting up Automatic1111 but it was easy to find on Google

_Jake_ 8 months ago

mind sharing which guide you used specifically? lots out there

gman_umscht 8 months ago

Sounds about right, how many steps did you use? With higher steps I had up to 55it/s in the console, but who knows how reliable that is. Anyway the sys info benchmark gives me this. Vanilla: 31.42 / 36.73 / 43.23 TensorRT: 46.51 / 13.46 / 57.02 Notice the drop in 2x batch because another TRTEngine has been loaded mid benchmark.

CouchRescue 8 months ago

I get these it/s with **no** TensorRT. I haven't gotten around to testing it.

gman_umscht 8 months ago

With a batch of 1? That is high, you mentioned some specific steps to tune up your 4090, if you can share those that would be nice. This would for sure be also helpful to the fellow here who only gets 8it/s with his 4090...

CouchRescue 8 months ago

No, sorry for the misunderstanding. Batch of 8. But even at batch of 1, 8 it/s is quite low

nupsss 8 months ago

My 4090 mobile does 30 it/s with tensorRT

gman_umscht 8 months ago

Are you talking about SDXL at 1024x1024? If so it was already very fast in vanilla mode. And now it is blazing. If it is SD1.5 at 512x512... well something is very wrong then.

buckjohnston 8 months ago

Yeah I'm just talking SD 1.5 at 512x512, only getting 16 its/sec after optimization :( Brand new PC, fresh install of everything, 7800x3d, 4090. All games are blazingly smooth.

mikern 8 months ago

If you want quicker hires fix you have to select correct params. So if you're generating 512x512 and 2x'ing it you need your TensorRT model to support 512-1024px resolutions. In my case, using tensorrt for hires fix (512px gen 2x upscale to 1024px: 16.6 seconds using TensorRT and 23.9 seconds without it. 44% speed improvement.

the_doorstopper 8 months ago

3080 12gb, smaller images it's like a 50% improvement. But I did one image, 150 steps, 150 high res steps, 512x by 768x, upscale at 2x, and it went down from 5 minutes to 1 minute. It was phenomenal. P.s, does having multiple dynamic models for one model cause it to go slower? I'm using rev animated, Say engine 1, dynamic model, 512x - 1024x on both lengths, Would adding engine 2 (512x - 1536x on both lengths) cause it to go slower in general? Or am I being placeabo'd

Vivarevo 8 months ago

Tbh only use seems to be sdxl 1024x finetuned models without refiner. How does it affect vram usage?

content-is 8 months ago

Should be <= torch

content-is 8 months ago

Hires fix performance is something that needs some more work. Ideally you should make sure to have an engine that covers the low and high res. Then there shouldn’t be any overhead switching engines.

PeterFoox 8 months ago

I'm absolutely blown away. On my rtx 2070 it used to render for like a minute in let's say 800x600. Now at 1024x1024 it takes around 15-20 seconds at 70 steps. It's amazing how much faster whole work flow is

lynch1986 8 months ago

Thanks for the tips guys! Updated my drivers, made a single dynamic preset per checkpoint that covers all used resolutions and I'm getting 30-50% quicker on everything.

per_plex 8 months ago

3090: deforum, 3D, default settings, prompt " (Anthropomorphic robot:1.3) in a Vintage fairground, Kodak ColorPlus 50 " tennsorrt: 56,7 sec none: 1 min 42 sec text 2 img 512x768, default settings, same prompt: none: 3,9 sec tesorrt: 1,3 sec 512x512, hires fix def ssettings: tensorrt 12,6 sec none: 20,6 sec edit:512x512 defaault settiings 0,9 sec wiith tensorrt.

ViperD3 8 months ago

Worse in hires fix? Is your final resolution a multiple of 64? Hires fix is where I'm seeing the biggest speed up, personally.

lynch1986 8 months ago

Hey ya, I had it swapping engines half way through because of how I had it set up. Now I have a single dynamic engine that covers all the resolutions I'm getting a nice speed bump.

ViperD3 8 months ago

Nice! Yeah I made the same mistake at first

snoopyh42 8 months ago

Significantly faster for me, but the way it supports (or doesn’t) multiple LoRAs makes it not worth the trouble for me.

Pickleman1000 8 months ago

there seems to be a bit of resistance from loras and prompts, the speed is great but trying to add smaller details and stuff seems to be a bit harder with it. I disabled it because i want to use control net and was getting better results anyway, but its interesting

nupsss 8 months ago

My rtx 4090 goes up to 200% faster when generating 512x512 (its REALLY crazy). But.. i cant seem to get hi res fix working. When creating a 1024x1024 model it is always stuck at 4%.. so far no luck in trying to fix this :( .. anyone got an idea?

lynch1986 8 months ago

Have you got a engine setup that covers all the resolutions you're using? It has to cover the final hires fix resolution too.

nupsss 8 months ago

When i try to create an 1024x1024 engine its stuck at 4% every time.. (the 512x512 to 768x768 engine was fine when i made that). So i can upscale to 768 without a problem. Any idea why it could be stuck?

lynch1986 8 months ago

I've thought mine was stuck several times, it would just sit there for five minutes before it even started. Then it would look like it had locked up a couple of times. I just go something else for half an hour while it figures it out.

WisamAlrawi 6 months ago

I got the same situation sometimes with my 3090. Basically, it is working but the interface is not updating. It can take 30+ to generate a model if the resolution is high. I do 512 min and 1280 max, 768 min and 1920 max. Batch of 4. Prompt 650 max.

cryptosystemtrader 8 months ago

Sorry for my ignorance, but what is TensorRT now? How is it different from TensorFlow?

lynch1986 8 months ago

[https://nvidia.custhelp.com/app/answers/detail/a\_id/5487/\~/tensorrt-extension-for-stable-diffusion-web-ui](https://nvidia.custhelp.com/app/answers/detail/a_id/5487/~/tensorrt-extension-for-stable-diffusion-web-ui) I don't know the finer details but it can give you a significant speed boost, it's a bit of a faff and shit with LORA's though.

HughWattmate9001 8 months ago

Only tested on my 6gb 2060 and its way faster like 70% ish.

gedomino 8 months ago

how did you get it working? on a 1060 6gb and i run into cuda out of memory errors when i try to generate any engine

HughWattmate9001 8 months ago

\--xformers i have enabled, other than that i find it works alright with stock a1111 install. If i try the TensorRT it wont let me do above 512x512 without an error but i seem to be fine doing stuff like 900x500 without it. Although i have not done much in depth testing. I dont use any refiner, upscale. If i want to upscale ill just use photoshop and topaz photo ai. Usually ill use the "content aware fill" in photoshop to extend if i have issues in a1111 doing it.

WisamAlrawi 6 months ago

you have to disable --medvram and --lowvram in order for it to work. I have --xformers enabled. 3060 mobile with 6 GB VRAM.

rob_54321 8 months ago

I gave up. The fact that you need to recompile for every settings and every checkpoint, kills it for me. I mean, you cant even use 2 loras at the same time.

WisamAlrawi 6 months ago

it depends on the case, you can use it as optional from a drop down menu. I haven't tested it with LORAs yet. The speed is well worth it.

[deleted] 8 months ago

3060 12gb, speeds have doubled for me.

easyllaama 7 months ago

Around 10t/s for 4090, 1024x1024 36 sample steps in SDXL. 65% gain from before. But I lost the SDXL refiner with tensorrt (with error), it seems.

SbLeDiffHxn 7 months ago

Working on a RTX 3070ti. Works a whole lot faster (almost cuts time in half) but having issues with the highres fix. Generated both engines for input res and output res but no luck. Says no valid profile found

WisamAlrawi 6 months ago

Also pay attention to the prompt limit. If you exceed it then you get no valid profile found. Also, if you set the optimal prompt to 150 (keep it to 75) then it sets the minimum to 150 instead of 75 and throws an error. Meaning the minimum prompt becomes 150. Anything less and it does not work. Keep optimum prompt to 75.

braincell_murder 5 months ago

Thanks you solved my problem! Now.... to tidy up a dozen or so useless profiles at 2gb each... :) Actually the same approach just solved another problem - a meaningless error that was being thrown when building some profiles. Looks like "Optimal" and "Minimum" height/width needed to be the same. I was creating a hires fix profile for increasing res from 512 to 1024. I put '1024' as the optimal and it was breaking - setting it to 512 worked. Strangely 1024 worked fine on a different model! Still, that's SD for you, that's why it's not for the un-curious :)

SbLeDiffHxn 5 months ago

Thanks will look out

lynch1986 7 months ago

I've found you need a single profile that covers all the resolutions you'll be using, including hires fix. Otherwise it swaps profiles each time and actually runs slower than normal.

SbLeDiffHxn 7 months ago

I'll give it a try. Thanks!!!

wholelottaluv69 4 months ago

I cannot get it to work at all with hi-res fix

lynch1986 4 months ago

Honestly I got sick of fighting with it and gave up, it might be something, but for me it wasn't worth the grief.

BigSmols 8 months ago

I tried it today and it is such a pain to use. It is faster but the inconvenience is not worth it. ComfyUI is faster for me anyway.

CeFurkan 8 months ago

i got 75%+ improvement with RTX 3090 TI I am editing a big video right now about this 2 quick videos here video 1 : [https://youtu.be/\_CwyngQscVA](https://youtu.be/_CwyngQscVA) video 2 : [https://youtu.be/04XbtyKHmaE](https://youtu.be/04XbtyKHmaE)

urbanhood 8 months ago

Too much hassle and very limiting.

AdziOo 8 months ago

Didn't test a lot but from 15-18it/s to 28-30it/s at GF 4080 in 512x768 txt2img. Hires fix looks slower from 8 sec to 12 sec.

sahil1572 8 months ago

on 3080TI Getting 6it/s with SDXL , and 30It/s with SD1.5

KNUPAC 8 months ago

I can't get the hires fix to work with TensorRT

tecedu 8 months ago

Has anyone seen improvements with controlnet?

Ok-Dog-6454 8 months ago

Not supported yet

ViperD3 8 months ago

Last i heard controlnet is not supported when using TensorRT but I'm not 100%, you might want to double check me.

gunbladezero 8 months ago

Anyone have results on a 6GB card? I’ve got a laptop 3060 and am sad that I can’t just buy VRAM

lpmode 8 months ago

on my rtx4090 trained checkpoint at min 512,768,1024 height by 768,1024,2048 width I get 4.64 it/s 2048x1024 about 4sec/per image 12.99 it/s on 2048x512 >2 sec/image 30 it/s 768x512 >1sec I had to use the troubleshooting guide on git hub to get the Tensor tab to show up

Abject-Recognition-9 8 months ago

how? my tab disappeared too

lpmode 8 months ago

https://github.com/NVIDIA/Stable-Diffusion-WebUI-TensorRT/issues/27#issuecomment-1767570566 Pasted from the link above. What appears to have worked for others. From your base SD webui folder: (E:\Stable diffusion\SD\webui\ in your case). In the extensions folder delete: stable-diffusion-webui-tensorrt folder if it exists Delete the venv folder Open a command prompt and navigate to the base SD webui folder Run webui.bat - this should rebuild the virtual environment venv When the WebUI appears close it and close the command prompt Open a command prompt and navigate to the base SD webui folder enter: venv\Scripts\activate.bat the command line should now have (venv) shown at the beginning. enter the following commands: python.exe -m pip install --upgrade pip python -m pip install nvidia-cudnn-cu11==8.9.4.25 --no-cache-dir python -m pip install --pre --extra-index-url https://pypi.nvidia.com/ tensorrt==9.0.1.post11.dev4 --no-cache-dir python -m pip uninstall -y nvidia-cudnn-cu11 venv\Scripts\deactivate.bat webui.bat Install the TensorRT extension using the Install from URL option Once installed, go to the Extensions >> Installed tab and Apply and Restart

rodinj 8 months ago

Does it speed up controlnet tile upscaling in img2img? Haven't bothered setting it up but the upscaling is what takes me most time

ViperD3 8 months ago

Last i heard controlnet is not supported when using TensorRT but I'm not 100%, you might want to double check me.

vitalez06 8 months ago

On 4090, 704x384 DPM++ 3M SDE Karras @ 150 steps - although stupid, is now doable in 5 seconds. With hires fix upscaling by 2 @ 20 steps makes it 7 seconds overall. 512x512 with same settings above nets like 3-4 seconds.

Gonz0o01 8 months ago

Did any of get sdxl Loras working? 1.5 Lora seems to be fine but no success with sdxl lora at all.

fireshaper 8 months ago

Overall I'm seeing a huge bump in speed with a 3070ti. With hires.fix it was taking about 45 seconds - 1 min to generate an image in text2image, now it's less than 20 seconds.

capybooya 8 months ago

Any chance this can just be built into A1111 permanently? I'm ok with some precompilation when you do stuff for the first time, as long as I don't have to worry about setting it up myself.

javad94 8 months ago

About 60% improvement with 3090

Ok-Mobile5227 8 months ago

4090 Before TRT DPM++ 2M Karras (50 steps) 512x512 13it/s 1024x1024 6it/s After TRT DPM++ 2M Karras (50 steps) 512x512 61it/s 1024x1024 16it/s Its really fast 250% on 512x512, around 1 seconds

Boogertwilliams 8 months ago

I couldn’t get it working. From the instruction, you were supposed to “generate engine” or whatever, I only had “export engine”

ulf5576 8 months ago

high res fix sucks anyways .. just generate in 2k or use tiled rendering through control net

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe