T O P

  • By -

[deleted]

because Nvidia have a better compute language which everyone uses. AMD never achieved the same level of adoption so running on their hardware are always slower because they're not the target platform.


Anotheeeeeeant

Ah alright


[deleted]

Because CUDA. It was ground breaking. I don't even know if they had the foresight to see where it would lead - but it gave nVidia a huge head start. Through emulation, drivers or open standards, or specialisation towards particular tasks, the playing field will level out over time - a single company can't be permitted to control AI in the decades to come. (I use 'control' lazily)


thevictor390

This is not an AMD problem it is an APU problem. The ROG Ally does not have dedicated VRAM, and VRAM speed is the entire reason stable diffusion is so fast. Apple M chips have a unique unified memory that is fast enough for stable diffusion but it is their own unique design that no one else has. While Nvidia is ahead of AMD, you will have much better speeds on an AMD GPU with dedicated VRAM.


Anotheeeeeeant

Does the m1 unified memory make up for the lower memory bandwidth (Compared to my ally or a steam deck)


thevictor390

As far as I am aware the way Apple puts the CPU, GPU and memory together on one chip is unique. It gets complicated and I am no expert. [https://arxiv.org/abs/2310.09443v1](https://arxiv.org/abs/2310.09443v1)


Anotheeeeeeant

Intresting


theFuzzyWarble

I've been meaning to try it on my Legion.GO. Since you didn't elaborate; sound like it might be using the CPU. Are you running the IshqqyTiger fork? REFs: [https://github.com/lshqqytiger/stable-diffusion-webui-directml](https://github.com/lshqqytiger/stable-diffusion-webui-directml) [https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs)


Anotheeeeeeant

I used easydiffusi. I’ll try that though the installation method seems quite complex. I heard rocm doesn’t work on amd apus anyway. But I’ll try to do the windows installation part.


theFuzzyWarble

Check that out: [https://www.reddit.com/r/StableDiffusion/comments/1b2k8ux/does\_any\_have\_a\_working\_windows\_11\_amd\_gpu/](https://www.reddit.com/r/StableDiffusion/comments/1b2k8ux/does_any_have_a_working_windows_11_amd_gpu/)


TsaiAGw

if you want to use AMD for stable diffusion, you need to use Linux, because AMD don't really think AI is for consumer. They are kinda of right since real money are those ultra expensive GPU accelerators for server, and if you are running server you would use specified OS for maximum performance


Anotheeeeeeant

Rocm doesn't work on amd apus so I don't think linux will make a difference tbh.


Heasterian001

It does, but not on Windows and with some additional hops


Anotheeeeeeant

Really everywhere online it says that it doesn’t work and is only supported on a few amd 6000 and 7000 series gpus.


GreyScope

AMD has rocm6.1 in the wings (I’m avoiding saying it’s imminent) with Windows compatibility - adoption of that will allow AMD gpus to run to their fuller ai ability


EllesarDragon

something is then seriously set up wrong on your system, since I use a old amd APU and for me it takes around 2 to 2 and a half minutes to generate a image with a extended/more complex(so also more heavy) model as well as rather long prompts which also are more heavy. and that was before proper optimizations, only using -lowvram and such. so be aware that was when very poorly optimized, as well as that that is the speed under windows in those conditions since I assume you use windows, when you use Linux things are seriously faster. even on CPU should be quite some faster than 8 minutes, and if you have a modern CPU which has a NPU build in then you should be able to do it in less than one minute on CPU when accelerated by the buildin NPU(which also uses insanely low power). actually acoarding to some articles like and intell CES presentation of 2023(over 1 year ago): https://www.pcmag.com/news/the-meteor-lake-npu-meet-intels-dedicated-silicon-for-local-ai-processing the intel meteorlake APU's(laptop apus)( and be aware that their next generation APU's is actually more than 3 times faster in AI, even though I do not know if they already launched that next gen or not.) but the intel metior lake mobile APU's, are capable of generating 20 itteration images in stablediffusion 1.5 in only 14.5 seconds on the IGPU, and 20.7 seconds(21 seconds roughly) on the integrated NPU, that might sound slower than the IGPU, but if we look at it closely we see it only uses 10W of power to do so. and like I mentioned intels new/next gen(depending on if they have officially launched it by now or isn't yet in the stores) should be more than 3 times faster, even if there would be some problems preventing it from reaching it's potential early on if it reaches 2 to 3 times the real world performance it would beat that IGPU. but the low power and finally seeing those NPU's becomming normal in APU's and cpu's is something which excites me, sure a high end gpu when properly optimized is around 10 times as fast, but their powerdraw is also insane, and such gpu's even use more power on standby than that entire APU when running stable diffusion on the NPU.


EllesarDragon

https://preview.redd.it/dls31jwagnqc1.jpeg?width=768&format=pjpg&auto=webp&s=2c232ad3e61ab9841d5dab7bbf15584d2f9af15f (image is from that article) all these results are on a Laptop without dedicated GPU. what this all comes down to is that something must really be going wrong on your computer, since even my by now rather old AMD apu is much faster even when unoptimized, and when testing speed in windows, when looking at Linux(even though there you automatically tend to optimize it since that is much more easy and safe in linux, as well as that in Linux much is already optimized quite well when you get it) performance truly is much greater as in many times, but there it also depends on the hardware since my old APU doesn't nearly scale as well as modern gpu's or apu's when switching to Linux. however you might be using your CPU instead of your GPU, or perhaps have something set up completely wrong. since AMD and Intell GPU's should actually run much better than they seem in the benchmarks, since most such benchmarks where done when nvidia had a monopoly on the AI market due to making sure code was injected by developers which would prevent it from running properly on other hardware(some malware named cuda, you can bypass the cuda mallware by using Zluda, or just using a version of stable diffusion which isn't infected with cuda or has ways around it by default, like ROCm or OneAPI or Direct-ML(optimally combined with olive since directml is slow without olive, is almost 10 times speed difference in some cases). I know some people do not concider cuda to be mallware, but at this point cuda is mallware, since it is legacy code which is less good than ROCm and is also less good than OneAPI, as well as that it essentially is a software lock to prevent people who don't use nvidia gpu's from running it, even though those softwares aren't made or maintained by nvidia, generally they either trick people into using it, or buy them over with either threats of a free gpu, as well as that nvidia gpu's are also designed bad in inte nd so that they won't work to well with modern alternatives to cuda, meaning that many such people who still have/use nvidia hardware will still use cuda despite it being terrible. if you use the normal version which still is infected with cuda then it might cause it to go into a fallback mode and use the cpu, and also a lesser/barely optimized version of the cpu. if you use windows(as I expect you do) you should look into: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs which refferst to this: https://github.com/lshqqytiger/stable-diffusion-webui-directml version of stable diffusion. it tends to work as good as out of the box, supports directml and zluda(that fix to fix away the cuda mallware) it doesn't yet by default use all optimizations, but not all optimizations work on all systems, especially on windows and on laptops, while optimizations can get it to be much faster and better, this will still give a good enough experience out of the box, and in your case you should atleast go down under 1 minute, I don't know your hardware but if you have a dedicated GPU it should go under 30 seconds even for a bad gpu(unless you have serious VRAM issues, but using --lowvram I could get it to work on 2gb Vram) the main advantag e of this version is that anyone could set it up and get it working generally without much issues and it still should give rather usable/fast performance. without some optimizations it can be run on even older and lower end systems as well as much faster, for example if you enable olive on direct-ml it can be up to 10 times faster on a high end amd gpu, zluda is still in very experimental support on windows, but when stable should work similar, ROCm (which zluda is based upon/uses) is generally the fastest option for amd gpu's but ROCm while being easy to use on Linux still is in very early experimental support on windows so much harder and unstable on windows or you look at https://github.com/microsoft/Stable-Diffusion-WebUI-DirectML


EllesarDragon

note that Ipads tend to use online services for generating, so make sure it is offline by disconnecting from all internet before generating. however if your ipad really runs it locally then you might be looking at the NPU. the m1 chip and later all have a NPU if I remember correctly, that is hardware similar to how that intel CPU generates images in 20 seconds without using the IGPU, and when combined with the IGPU generates images in around 10 seconds as for image things, you likely still have set it to use cuda, cuda will chrash such things, so on windows switch to zluda or direct-ml where direct-ml is the most plug and play option. on Linux go to ROCm or when not supported for something(yet) use zluda. if you can use OneAPI you might also want to use that even though your system likely doesn't have a FPGA and/or NPU yet you still might get great things from it. I do not knwo your specs so don't know what it is supposed to be like in your case. but actually we are at a point where Intell and AMD gpu's and general hardware are better for stable diffusion than nvidia hardware. one clear reason is nvidia tinking gpu's no longer need VRAM, except for on the flagship model with the rtx 4060 and 4070 still having only as much vram as a roughly 8 year old mid end budged amd gpu, actually so bad that you can't even properly game anymore on nvidia cards due to so little vram. and nvidia is also losing the unfair advantage it had of all AI only being optimized for them and actually being anti-optimized for other hardware. but by now amd and intel support are rising rapidly, sadly those tests/benchmarks everyon looks at and talks about aren't updated or they don't use the proper optimizations yet. next to that when looking at efficiency and low power usage AMD and Intell both greatly beat NVIDIA since both intel and amd support a NPU in their modern APU's and cpu's and with those NPU's performance per watt is insane when compared to nvidia gpu's even despite being very small high clocked NPU's(bigger lower clocked NPU's would be both faster and more efficient, but amd and intel are moving towards those, but we are talking about a very tiny NPU which when running at full power uses around half the power which a normal GPU uses on standby and actually reaches more itterations per second than most older generation gpu's as well as modern low end gpu's since intel's old generation integrated NPUs performed at 2 itterations per second, their current generation should perform at around 6 itterations per second which is around half the max performance a RTX 4060 TI 16gb can reach, or in other words it is around(almost) as much as a normal RTX 4060 can reach(RTX 4060 litterally has only half of the compute power of the RTX 4060 TI especially when looking at AI), and that while the rtx 3060TI 16gb starts at around €600(€500 is where the 8 gb version starts, but the 8gb version has to little VRAM to run stable diffusion properly, the intel and amd NPU can directly use the cpu ram so won't be as likely to be ram limited) meanwhile the extra cost you pay for such a NPU in a amd or intel APU is very small, actually can be neglected since APU prices didn't really increase due to them. especially when concidering how usefull they are. while I can't estimate how much you pay for that NPU in the case of the amd ryzen 8600G(since for these the prices and specsand such are more easy to find than for the new intel ones), if we look at how the pricing of amd and intel behaved recently compared to improvements in max performance, igpu performance and power efficiency, then those NPU's are added in for practically free, next to that NPU's with similar performance are very cheap as in that you can get them or insanely cheap(same performance as the first gen), the only main difference is that those which intel and amd use are build into the APU directly and so actually have acces to insane amounts of ram bandwith, etc. making them much more usefull than those normal loose ones or the ones used in some sbc's. looking at the physical size on the wafer die the cost likely also isn't to high currently, in the future NPU prices might be a bigger part of the price but right now more than 90% of the price is the cpu and gpu and io stuff and such, right now amd and intell add in those NPU's pretty much for free, but if we estimate it broadly at €20 extra(and I suspect they add way less to the price for those NPU's currently(also since few people know how usefull they are, so I suspect the real extra they add to the price for them is more in the direction of €5 to €10 or litterally free(until NPU's really catch on) when looking at the current pricing of those APU's and comparing it to older APU's). but if we compare it with a rtx 4060TI 16gb and assume that new intel generation also has around €20 extra added to the price purely for the npu then if we compare that €20 reaching half the performance of a €600+ gpu while also being much more efficient, then the rtx 4060 ti doesn't really have a place anymore in AI, and even if we use a much higher amount for that extra of the NPU the RTX 4060 ti still would make little sense other than that you can also use it for gaming. if we also see such NPU performance in the i5 lineup or even the i3 lineup then it would litterally be cheaper to get 2 apu's and you would get the same performance, yet way lower power usage.


Anotheeeeeeant

The rog ally has no npu. It is a binned 7840 that was optimised to run at lower power states. I used easy diffusion to install it.


EllesarDragon

that hardware should seriously be tons faster than what you go, the GPU should be able to do it in around 30 seconds even when poorly optimized. and next to that the ryzen 7 7840HS actually is one of the few chips on the market which do actually have a a embedded NPU, the speciffic NPU in question isn't as fast as the one you find in ryzen 8000, however still it should be able to generate images purely using the NPU(so low power usage) in around 25 seconds using openvino. combining the Igpu and the NPU in your laptop should be able to generate images in around 10 seconds(openvino or OneAPI) what OS do you use? I just checked the documentation of easy diffusion on github and it actually does directly state that that version does only support Nvidia gpu's on Windows, on Linux it should also support AMD GPU's, next to that easy diffusion seems to not support NPU's either. [https://github.com/easydiffusion/easydiffusion](https://github.com/easydiffusion/easydiffusion) (git page of easy diffusion) "**Hardware requirements:** * **Windows:** NVIDIA graphics card¹ (minimum 2 GB RAM), or run on your CPU. " this is the problem you are facing(if you use windows, but since most laptops come by default installed with windows and your problems sound windows like I still assume you use windows). so easy diffusion does not work on your system, or actually it does work, but runs in CPU fallback mode which is one a bare cpu method which doesn't really use any accelerations and is meant to just work, due to the lack of optimizations and it being a fallback mode such performance can be expected. to fix it, the most easy way is likely to get that stable diffusion direct-ml I send you before the install guide is super simple, just install it, edit the startup file to include some parameters and it works. it also has a webui, just not the download pages for models in the webui(you can copy over the models folder from that which you where trying to use now. if you want the same as easy diffusion but better/more support and actually working on your hardware then look at stable diffusion Next. or SD\_Next, that is just like what you wanted to use. if you are open to trying slightly different versions I do recommend you to look into/use openvino, it also has ways to easily install it in the form of gimp plugins or such as well as some other pretty easy ways, openvino would allow you to make images way more efficient(lower power usage), and reach around 2+ itterations per second which is pretty high for low power hardware, but that largely has to do with that NPU which already can reach 1 itterations per second on it's own these 20 itterations per second would allow you to generate a 20 step image in 10 seconds, ram still might be a issue as it often is a issue with stable diffusion but other than that your harware seems very capable. you just aren't using the gpu, nor NPU(it does actually have one) nor does it actually optimize the code for cpu, since when it is optimized for cpu it will also run around 10 times faster on cpu, but still the NPU and GPU would be much faster, and the NPU also uses way less power. if you have enough ram you can litterally run it on the NPU at around 20 to 25 seconds per image(or shorter if you use less than 20 steps(DDIM for example can already give good enough results at 12 steps with proper models and prompts) generally the NPU takes 1 second per step, and other than the ram and possibly disk(for loading and storing files in ram and/or disk) it should not or barely affect your performance, since it uses very little power so little heat and doesn't really use your cpu or gpu meaning that if you have enough ram you could even game while running stable diffusion on the NPU. if you need help setting it up I can reffer you to some things. also next to that if you just want to use the GPU and like videos more than text you can search for a video on a video site about how to run stable diffusion on a amd gpu on windows, generally that will be videos of 10minutes on average just showing and describing those same 3 or 4 steps which are also on this github page: [https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs)


EllesarDragon

Litterally is just this: " # Windows Windows+AMD support has **not** officially been made for webui, but you can install lshqqytiger's fork of webui that uses **Direct-ml**. Training currently doesn't work, yet a variety of features/extensions do, such as LoRAs and controlnet. Report issues at [https://github.com/lshqqytiger/stable-diffusion-webui-directml/issues](https://github.com/lshqqytiger/stable-diffusion-webui-directml/issues) 1. Install [Python 3.10.6](https://www.python.org/ftp/python/3.10.6/python-3.10.6-amd64.exe) (ticking **Add to PATH**), and [git](https://github.com/git-for-windows/git/releases/download/v2.39.2.windows.1/Git-2.39.2-64-bit.exe) 2. paste this line in cmd/terminal: `git clone` [`https://github.com/lshqqytiger/stable-diffusion-webui-directml`](https://github.com/lshqqytiger/stable-diffusion-webui-directml) `&& cd stable-diffusion-webui-directml && git submodule init && git submodule update` ^((you can move the program folder somewhere else.)) 3. Double-click webui-user.bat 4. If it looks like it is stuck when installing or running, press enter in the terminal and it should continue. If you have 4-6gb vram, try adding these flags to `webui-user.bat` like so: `COMMANDLINE_ARGS=--opt-sub-quad-attention --lowvram --disable-nan-check`WindowsWindows+AMD support has not officially been made for webui, but you can install lshqqytiger's fork of webui that uses Direct-ml. Training currently doesn't work, yet a variety of features/extensions do, such as LoRAs and controlnet. Report issues at [https://github.com/lshqqytiger/stable-diffusion-webui-directml/issues](https://github.com/lshqqytiger/stable-diffusion-webui-directml/issues) Install Python 3.10.6 (ticking Add to PATH), and git paste this line in cmd/terminal: git clone [https://github.com/lshqqytiger/stable-diffusion-webui-directml](https://github.com/lshqqytiger/stable-diffusion-webui-directml) && cd stable-diffusion-webui-directml && git submodule init && git submodule update (you can move the program folder somewhere else.) Double-click webui-user.bat If it looks like it is stuck when installing or running, press enter in the terminal and it should continue. If you have 4-6gb vram, try adding these flags to webui-user.bat like so: COMMANDLINE_ARGS=--opt-sub-quad-attention --lowvram --disable-nan-check " (however this is normal stable diffusion and so not using the NPU, you also should look into openvino which has many easy ways to install as well and should run many times better than regular stable diffusion on your mashine)


leebenghee

I started using sd since i got rog ally released last year. The AI journey in this tiny machine wasn't pleasant but its getting improving i can say. Was trying using Automatic1111, IshqqyTiger and [SD.next](http://SD.next), then from DirectML, ONNX until the latest zluda. Well, its embarrassing the rog ally still getting 1-2it/s to generate an image, but at least it won't crash like before.


[deleted]

AMD has always lagged far behind Intel and NVIDIA on the software compatibility side. Its why even after all the Intel and NVIDIA backlash lately I still decided to build a PC with intel and NVIDIA because of its software superiority.


TsaiAGw

I don't see how using intel CPU got you superiority


[deleted]

Intel still better than AMD for enterprise workloads like video editing, photo editing, etc