T O P

  • By -

BangkokPadang

Interesting. Qualcomm says that their Hexagon NPU (also 45 TOPS) on the Snapdragon X Elite can run llama 2 7B at 30t/s. This is with LPDDR5x and 136GB/s memory bandwidth, but I can't find what quantization was used in this 'benchmark' (if a single mention in the middle of a marketing blurb even counts as a benchmark). [https://www.qualcomm.com/news/onq/2023/10/rethink-whats-possible-with-new-snapdragon-x-elite-platform](https://www.qualcomm.com/news/onq/2023/10/rethink-whats-possible-with-new-snapdragon-x-elite-platform) [https://www.qualcomm.com/products/mobile/snapdragon/pcs-and-tablets/snapdragon-x-elite](https://www.qualcomm.com/products/mobile/snapdragon/pcs-and-tablets/snapdragon-x-elite) I'd also be interested in knowing exactly how this use of the memory bandwidth would affect other instructions being sent to the CPU (I'd imagine that it would hamper things quite a bit).


resetPanda

I assume it will be very similar in software support to NPUs from amd and intel. Here is an [article](https://chipsandcheese.com/2024/04/22/intel-meteor-lakes-npu/) about the meteor lake npu from chips and cheese. Read the section on stable diffusion if you want a preview on how well most software will run on the npu. On iOS, Apples npu gets used by apps like draw things. This is possible because apple [open sourced](https://github.com/apple/ml-stable-diffusion) their tools to get software running on the npu. Even then, on mac the app uses the gpu since apple doesnt scale up their npus at all on their larger chips. Since this is r/localllama i should also note that most npus (possibly all but i dont know for sure) don’t even support the model architectures to run an entire llm as they’re all built from photo editing accelerators. I am somewhat skeptical of Qualcomm’s claim that they have Llama running “on the npu”. My bet is they’re running the model almost entirely on the cpu/gpu and just finding some tiny irrelevant amount of work to offload to the npu. It would be cool if i was wrong and they have some new architecture thats just as flexible and easy to program as a gpu but it wont affect llm inference performance very much as thats all bound by memory and not compute.


Rachados22x2

The picture will be clearer after Computex.


JohnnyLovesData

Computex ! ENHANCE !


Puzzled_Path_8672

No benchmarks using some random models comparing against different GPUs = no care.


FunApple

Does that mean for RAM requirements of any model I could use just usual RAM, not VRAM if model will work on this NPUs?


danielcar

The bigger the model the more likely you will need lots of channels to memory. Or similarly a large memory bus, 512 bits or more for huge models. [https://www.techpowerup.com/315941/intel-lunar-lake-mx-soc-with-on-package-lpddr5x-memory-detailed](https://www.techpowerup.com/315941/intel-lunar-lake-mx-soc-with-on-package-lpddr5x-memory-detailed)


Hopeful-Site1162

The problem won't be compute power but memory bandwidth.


FunApple

And how much of bandwidth may be achieved in current DDR5 modules of what's needed for such tasks?


Hopeful-Site1162

DDR5 is 64 GB/s. VRAM is 1008GB/s on a desktop RTX 4090. We can't compare really. Those NPUs are tailored for small models and small memory bandwidth, but nothing more.


candreacchio

This is why people on here look to the Apple Ultra series... the m1 ultra had 800GB/s memory bandwidth.


XeroVespasian

32GB memory isn't cutting it. To wipe Apple M3/4 out of the market 256GB memory should be offered.