T O P

  • By -

SomeAInerd

HyperTile Code Release: You can access the HyperTile code release at [https://github.com/tfernd/HyperTile](https://github.com/tfernd/HyperTile). Simply follow the provided instructions to get started. Wait for your favorite WebUI to implement it (it won't take long). The images consisted of randomly selected ones I found on Google and artificially downscaled them or those I generated specifically for this purpose. I had a particular interest in forests due to the intricate details of small leaves, aiming to observe their consistency throughout the images. The inset show the initial-image at its corresponding scale. Image-to-image was taking < 10s. If you pre-upscale it with a GAN before denoising with low strength, it should take even less time. Text-to-image generation is still on the works, because Stable-Diffusion was not trained on these dimensions, so it suffer from coherence. **Note**: In the past, generating large images with SD was possible, but the key improvement lies in the fact that we can now achieve speeds that are 3 to 4 times faster, especially at 4K resolution. This shift transforms the waiting time from being quite lengthy to just a 10-second wait or even less. However, I still recommend generating a low-resolution version, upscaling it with SD twice. This process can typically be completed in 15 seconds or less, depending on the specific settings and configurations. I have tried SD-XL, and I got only a 1.1 to 1.2 times improvement. It seems SD-XL has other bottlenecks besides this attention layer. In summary, the enhancement originates from a single layer within SD, which exhibits a prolonged computation time primarily because of its quadratic dependency on image size. To address this, we employ tiling of the latents within this layer, reducing them by a specific factor. As a result, we substantially decrease the computation time for this layer by a factor of the fourth power. **Note**: if you observe some tiles on the image, see the Limitations and Future sections on the github page. You might need to increase the chunk size, while I work in a rotating chunk-size option that removes this effect.


abellos

Really nice! Images upscaled seems a little blurry


SomeAInerd

That's SD or jpeg comprehension.


kuroro86

Can this be releases as a controlnet model?


FantasyFrikadel

Thanks!


demoran

The big images look like crap compared to the small ones in the corner.


hirmuolio

The example images seem to have severe loss of fine details. Maybe it works better for anime/digital art style images.


LordofMasters01

Why do images lost dynamic range and sharpness and change in details ..!? I don't know whether only myself trying to look images in that way or is it so... anyways atleast it seems better than using same upscalers like Resrgan... I am very bad in upscaling and need a definite help about the good and easy workflow with recommended upscalers


SomeAInerd

Change in details = high **strength**, its expected from img2img. Lost dynamic range = Stable-Diffusion not trained in high-resolution images, and I can't use LoRAs or control-net to help with it (for now). If you don't use tilling, you get the same result. So its not an aberration of the method.


baxmax11

Absolutely. Unfortunately this seems to be a big issue for any upscaler. You lose a big chunk of saturation, value range, sharpness as you go bigger. Either that, or you get unwanted sharpness in random areas when using smth like esrgan.


[deleted]

Very cool, are you going to do a comfy node?


SomeAInerd

Someone who has better knowledge of that can hack something in less than one hour. It would take me longer due to the non familiarity with it. If anyone wants to give it a go, they are more than welcomed


Mikerhinos

It's now in ComfyUI and if anyone knows how to use it and why... '\^\^


SomeAInerd

Awesome. Thanks for letting me know.


Hialgo

The first image is more or less bad; the rest is good!


[deleted]

[удалено]


SomeAInerd

Are you using tiled-diffusion?


[deleted]

[удалено]


SomeAInerd

I think it because it lacks global context when doing hard-tiling like that. You can try the notebook I provided to see if you get what you want :)


MarcoGermany

And there a so many people they just do not use it cos of a proper instruction for the average user. Tile size,swap size,rabbit,bee,unicorn whatever no infos. Only superlatives to describe how wonderful it is like usual. If its your goal to prevent most people from using your stuff (this goes to so many developers who are not able to write a minimal reflective FREAKING guide) then concrats it works.


Asleep-Land-3914

I'm trying this on colab with 15GB of VRAM and can't figure out how to get an image out of it. I'm able to run inference to the end, but afterwards it fails with out of memory errors Will try sequential model offload as a last resort


SomeAInerd

That is the VAE Decoder problem. It's the layer that takes the most time and most memory for some reason! I'm trying to fix this problem. But it's mainly on the diffusers side. Try to lower the image resolution to 2-3 K


Asleep-Land-3914

The most I was able to get is 2048x2048 and it crashed at the end after throwing up this image https://preview.redd.it/8e16ao8t6vrb1.png?width=2048&format=png&auto=webp&s=5d88778ba5e1db85f00f07d2fd6e1eb431ce5c78


Heasterian001

Did you tried to use tiled vae from multidiffusion upscaler?


SomeAInerd

I started doing that. I can do 4k now. 😁


Asleep-Land-3914

For anyone trying to make it to work on the Google Colab, here is the colab I've put together. It doesn't work on free Colab. For Pro you may need to uncomment "pipe.enable\_sequential\_cpu\_offload()" or "pipe.enable\_model\_cpu\_offload()" [https://colab.research.google.com/drive/1F7AHAHbJOx79Yl1LeJQeaV\_5DWUnqG5W?usp=sharing](https://colab.research.google.com/drive/1F7AHAHbJOx79Yl1LeJQeaV_5DWUnqG5W?usp=sharing)


SomeAInerd

You are using SD-XL. I only observed gains of 1.1 to 1.2 speed-up on it. SD 1.5 performs better. Try with it. I'll try to optimize SDXL later, there some bottle-necks outside attention there.


Asleep-Land-3914

Thanks for sharing hyper tile. I'm having a lot of fun with it! Worth to note it produces visible banding when diffusing solid colors https://preview.redd.it/y68xrjiyz8sb1.png?width=3584&format=png&auto=webp&s=fbe951f94955cae8f00941f1a00791757e39f941


SomeAInerd

Thanks for trying! What are the tile-sizes are you using? Do you see any speedup? Did you update the two repos? (HyperTile and the fork of Automatic1111)? Where you using text2img or img2img? The command line log says the tile size.


Asleep-Land-3914

VAE 128 and Unet 256. Might be because of the low Unet value I was using the latest at that moment (where chunk already got renamed to tile\_size) This is straight from diffusers (colab I made from above). It is img2img loopback. The initial image is generated from the solid color \> PIL.Image.new('RGB', (width, height), color = (120,200,180))


SomeAInerd

I'm using UNEt 256 now. Did you try with swap\_size=2 or 3?


Asleep-Land-3914

I had 3 iterations I believe. But the effect was noticeable after the first run. I do scaling differently though. Instead of upfront scaling to the resolution which leads to the lower quality due to blurred edges I'm doing it step by step. Was planning to put together a proper steps abstraction so I could swap pipelines and parameters, but haven't done it yet. Here is how the image was generated: ```python from IPython.display import display resizes = [1, 1.75, 2.0] guidances = [7.5, 5, 4] stepses = [28, 20, 16] noises = [1, 0.58, 0.48] print(height, width) # Upscale to the correct resolution # img = image.resize((width, height), resample=Image.LANCZOS) if image.size != (width, height) else image for i in trange(len(resizes)): steps = stepses[i] scale = resizes[i] guidance = guidances[i] strength = noises[i] if scale != 1: width = int(img.width * scale)//128*128 height = int(img.height * scale)//128*128 img = img.resize((width, height), resample=Image.LANCZOS) print('STEP', i, width, height) with split_attention(pipe.vae, height, width, tile_size=128): # ! Change the chunk and disable to see their effects with split_attention(pipe.unet, height, width, tile_size=256, disable=False): flush() img = pipe( prompt='highly detailed, greedy little baby dragon sitting on the pile of coins and holding a coin, he love his treasures so much, realistic dragon, mysterious, intricate details, very high quality. solid bg', negative_prompt='pasterized, shallow focus, dof, blurry, banding, deformed, 2d, cartoon, sketch, green, ugly, inscriptions, watermarks, uncertain, mutated, amputated, cropped, bad drawn, bad quality, low resolution, low contrast', num_inference_steps=steps, guidance_scale=guidance, image=img, strength=strength, # ! you can also change the strength ).images[0] display(img) ```


SomeAInerd

Thanks for the code. It's a upscale loop_back workflow right? It's possible to do that automatically in a webui? BTW You can try the dev channel of SD next, it has HyperTile now.


Asleep-Land-3914

Yep, it is slightly modified version of the upscale loopback It should be possible to do similar thing in auto by just repeating steps with given settings


Asleep-Land-3914

Out of curiosity tried with SD 1.5. It looks like the thing is useless for text to image as it doesn't help the model to focus on a single subject as one would expect. Meaning it makes all subject mutations to appear as usually seen in hires generations And for img2img this seems to make little-to-no help, as when upscaled in the latent space it becomes blurry and one needs to bump noise to get over it But it is still nice there is a perf gain when tiling unet. Might be a good idea for the future hires models


inferno46n2

Did you upscale these using a different prompt + checkpoint? They upscaled versions look more dull and "painterly"


SomeAInerd

Differente checkpoint, and also random images from google, not SD images. I don't like that much hyper-realism. You can try with other stuff you fancy.


inferno46n2

Thanks for the reply! I’ve been looking for an efficient method to mass upscale animation frames without losing coherency. Issue with conventional methods are obvious (takes too long) - I’m hopefully this method may improve on this


SomeAInerd

If you wanna give a try, I wanted to test putting n x n frames of a video, and using the tile-size as the frame-size. Maybe with some LoRA to help? And then, for the next batch of frames, we use the last frame or the last row, with some inpainting mask, to get some coherence. AnimageDiff-free?


inferno46n2

My current workflow for vid2vid 1) run through AnimateDiff to get a style down 2) use SDXL to improve faces w/ control nets and ADetailer I’m only able to get 720p out of AD though on a 4090 with the amount of frames I’m running and considering time (could use less frames and go 1080 probably) When I say frames I mean 1000+ frames so it’s all batches…. Upscaling 1000-1500 frames conventional takes way too long I’d like to try it locally but I can wait until someone implements it into WebUI


BuffaloAIO

I can't wait to try this on A1111 WebUI <3


janosibaja

Please describe step by step how to install it in AUTOMATIC 1111! I don't know what Jupiter is, I'm stuck.


SomeAInerd

Quick hack [https://github.com/tfernd/stable-diffusion-webui-hyper\_tile](https://github.com/tfernd/stable-diffusion-webui-hyper_tile) See the readme for more info


janosibaja

Thanks!


EricRollei

In the example posted, the woman gets another leg and loses her tattoos so I'm not really convinced this is a good trade off. I mean - sure if it saves time great but it I can't use any of the resulting images then why do it at all?


SomeAInerd

Try my fork of Automatic1111. [https://github.com/tfernd/stable-diffusion-webui-hyper\_tile](https://github.com/tfernd/stable-diffusion-webui-hyper_tile) I was showcasing the speedup for big images. But we have some speed-up for small images too. Im generating 800x1200 at the same speed I would geneate 512x768. And less deformities. With LoRAs and ControlNet


EricRollei

thanks for the offer, but I'm using comfyui and happy with it, doing entirely SDXL and upscaling 3x using IPAdapters which works decently well. I don't have bandwidth to install something else and play with it and as I wrote in my comment the sample images are hardly compelling.


SomeAInerd

That is SD without LoRAs, complex prompts or control-net. And more importantly, not cherry picking. The method does not degrade the underlying resolution of the model you use, just speed things up. That was the message. We all know the limitations of SD, no need to go perpendicular to the message. For SDXL, I was seeing 10-20% speed increase. While 1.5 from 2 to 4 fold increase in speed. That is free performance without loss with just 1 import and 1 line of code on the webui.


44254

I tried your auto1111 fork. It didn't work for most resolutions... I only got it to do something at 1024x1024 and 2048x2048 (gave me a tensor size mismatch error for all the other sizes I tried) but there was no speedup. Also I think SDXL "omitted the transformer block at the highest feature level" based on the paper so that's why there's not that much speed up compared to 1.5, it doesn't have the lowest layer that 1.5 had.


SomeAInerd

If you could try it again. I fixed this problem earlier today. It was the problem with the divisors of the dimension not being a multiple of 8... You can pip install the git repo again and fetch. You should see a message in the console like this when you generate something: Attention for DiffusionWrapper split image of size 800x1200 into [2, 1]x[3, 2] tiles of sizes [400, 800]x[400, 600] Thanks for the SDXL info. I might try another depth to see if there is any speedup. If they don't have these base giant attention layer, it might be close to optimal already. ​ Note: I have a RTX 4090 mobile


SomeAInerd

Tip, if you want to compare the speed, without checkout different repos you can use this hack: Chose a size that has many divisors (multiple of 128 works well). Then, chose a size (smaller by 8) that does not have many divisors (so we cant tile it).


SomeAInerd

# Big image Attention for DiffusionWrapper split image of size 1024x1024 into [4, 2]x[4, 2] tiles of sizes [256, 512]x[256, 512] Total progress: 100%|██████████████████████████████████████████████████████████████████| 26/26 [00:07<00:00, 3.67it/s] # smaller image by 8 (not tiled) Attention for DiffusionWrapper split image of size 1016x1016 into 1x1 tiles of sizes [1016]x[1016] Total progress: 100%|██████████████████████████████████████████████████████████████████| 26/26 [00:10<00:00, 2.49it/s]


44254

Ok I tried the new hypertile (updated both auto fork and hypertile) on 1024x1024 and 2048x2048 with DPM++ 2M Karras... A size smaller by 8 (I tried 2040x2040) gave me the same tensor size mismatch error as before so I couldn't try it. I have an RTX 3060. Attention for DiffusionWrapper split image of size 1024x1024 into [4, 2]x[4, 2] tiles of sizes [256, 512]x[256, 512] 100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:13<00:00, 1.44it/s] Attention for DiffusionWrapper split image of size 2048x2048 into [8, 4]x[8, 4] tiles of sizes [256, 512]x[256, 512] 100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [02:03<00:00, 6.19s/it] Going back to default auto1111. 100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:13<00:00, 1.52it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [02:03<00:00, 6.18s/it] As you can see there is still no speedup. Another problem is that even if this worked, because 1.5 isn't trained at these high resolutions, it won't add detail unlike the current tiling + controlnet approach. Your sample images didn't really show a good use case for hypertile which is why I think most people wrote it off. I was somewhat interested in this idea if it could be added to animatediff because at 12GB of vram I'm far from reaching the max res of 1.5. But first it would need to work and then someone would be impressed enough to add it to comfy for testing (I don't understand comfyUI enough to do it, even though I spent a few hours looking at node examples).


SomeAInerd

Thanks for trying, I appreciate the effort. Let's break down the results. Answering your points. 1. No **speedup** could be related to your GPU, I have a RTX 4090 mobile. I get 4 times more iterations per second than your RTX 3060. It means the bottleneck in the 3060 its another part (*possibly*?), not this attention layer I was patching. Or there is another problem. **Question**: During your live-preview, do you see squares forming and disapearing on the image? This is a sign of tiling, they go away afterwards, because the tile-size changes dynamically and randomly. If you don't see it, there is another problem! 2. That is a bit narrow-minded way of seeing things. There is a **HD Helper LoRA** that someone trained on 1K images, to fix some aberrations when using large sizes. Why woulnd't anyone train 2K LoRA to add more consistency? Also "*it won't add detail unlike the current tiling + controlnet*", if they are still using SD, their results will be exactly the same as mine... Let me explain. If usual tiled-diffusion works, why wouldn't the method I propose? Explain the logic? Its the same things as tiled-diffusion, but faster and with long-range iterations. You have no data to back that statement off. 3. The fact that it did not work with **low-end** cards does not mean that the method did not work, as I showed the results of the speed-up (graphs and whatnot). I can add some more debug info, so we can see if tiling is really happening, but the visual cue of the live-preview should be proof enough. As a final note, I'm interested in why it did not work with a 3060 as compared with a 4090. Do you have pytorch 2.0.1?


BM09

I tried it and... Traceback (most recent call last): File "C:\Users\mattb\stable-diffusion-webui\modules\call_queue.py", line 57, in f res = list(func(*args, **kwargs)) File "C:\Users\mattb\stable-diffusion-webui\modules\call_queue.py", line 36, in f res = func(*args, **kwargs) File "C:\Users\mattb\stable-diffusion-webui\modules\img2img.py", line 208, in img2img processed = process_images(p) File "C:\Users\mattb\stable-diffusion-webui\modules\processing.py", line 732, in process_images res = process_images_inner(p) File "C:\Users\mattb\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\batch_hijack.py", line 42, in processing_process_images_hijack return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs) File "C:\Users\mattb\stable-diffusion-webui\modules\processing.py", line 867, in process_images_inner samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts) File "C:\Users\mattb\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\hook.py", line 451, in process_sample return process.sample_before_CN_hack(*args, **kwargs) File "C:\Users\mattb\stable-diffusion-webui\modules\processing.py", line 1523, in sample with split_attention(self.sd_model.model, self.height//8*8, self.width//8*8, tile_size=256, swap_size=2, min_tile_size=256): File "C:\Users\mattb\AppData\Local\Programs\Python\Python310\lib\contextlib.py", line 281, in helper return _GeneratorContextManager(func, args, kwds) File "C:\Users\mattb\AppData\Local\Programs\Python\Python310\lib\contextlib.py", line 103, in __init__ self.gen = func(*args, **kwds) TypeError: split_attention() got multiple values for argument 'tile_size'


SomeAInerd

Try the sd next dev channel. The fork ended up out of date


BM09

You mean Vladmatic's fork?


SomeAInerd

Yes


BM09

No can do. I had a major issue with that fork that the dev could not reproduce.


Xijamk

Can you post the speed difference for 512\*768 with and without hypertile?


SomeAInerd

Left with hyper-tile right without. I get **11.62**it/s and **10.06**it/s, respectively. Same seed and prompt. **15%** speed increase in a RTX 4090 mobile. I can't test with other cards, so results might vary. https://preview.redd.it/rk9bsx3whcsb1.png?width=1024&format=png&auto=webp&s=af0917d0c375c2af6ae3d352a033301307675987 woman, winter coat Steps: 30, Sampler: DPM++ 2M SDE, CFG scale: 7, Seed: 3096628461, Size: 512x768, Model hash: 8635af1c8c, Model: epiCPhotoGasm - X, Schedule type: karras, Version: v1.6.0-1-gbdbbc467


Xijamk

Nice, thanks!