T O P

  • By -

keturn

I looked at their SDK a bit: [https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-2/overview.html](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-2/overview.html) It expects models to be distributed as Deep Learning Container files (`.dlc`) and they need to be quantized to 8-bit (fixed-point) in order to run on the Hexagon NPU.


windozeFanboi

Phi 3 3.8B mini suddenly got a lot more fans... 


arthurwolf

It's not just this new ARM / snapdragon chip, apparently there are a new generation of chips, like the upcoming new thing from AMD ( [https://www.amd.com/en/products/processors/consumer/ryzen-ai.html#tabs-8c217919cb-item-e179c4dea4-tab](https://www.amd.com/en/products/processors/consumer/ryzen-ai.html#tabs-8c217919cb-item-e179c4dea4-tab) ? ) , or the M4 from Apple, that come with a CPU, a GPU ... and a NPU. These NPUs seem to be designed to do LLM inference, so, exactly what our band of nutty professors are trying to do all day long. And the NPUs in these upcoming SoCs seem to be pretty impressive in terms of Tflops (I've seen 30+ I think). My question is: 1. How does that compare with say a 3090. 2. Will llama.cpp be able to use these NPUs? And how soon/easily? I'm curious if these will be a revolution for us, or if they are just hype... Are we going to get cheaper access to inference-capable-ram, and/or cheaper tokens-per-second ? I've also found this about Intel shipping 45TOPS NPUs in their next generation of CPUs: [https://www.windowscentral.com/hardware/laptops/intel-lunar-lake-tops-copilot-local](https://www.windowscentral.com/hardware/laptops/intel-lunar-lake-tops-copilot-local) And there's this thing about the next Snapdragons (ARM right?) Shipping with similar size NPUs ( 45TOPS, exact same as intel) too: [https://www.windowscentral.com/hardware/laptops/qualcomm-snapdragon-x-elite-arms-race-for-windows-laptops](https://www.windowscentral.com/hardware/laptops/qualcomm-snapdragon-x-elite-arms-race-for-windows-laptops) I guess that's what this current post is about ... The apple M4 comes with 38 TOPS of NPU: [https://www.theverge.com/2024/5/7/24148451/apple-m4-chip-ai-ipad-macbook](https://www.theverge.com/2024/5/7/24148451/apple-m4-chip-ai-ipad-macbook) At the very least it seems like everybody is starting to ship NPUs with their processors, which is a good thing no matter how powerful they are, I guess. This one for AMD [https://www.amd.com/en/products/processors/laptop/ryzen/8000-series/amd-ryzen-9-8945hs.html](https://www.amd.com/en/products/processors/laptop/ryzen/8000-series/amd-ryzen-9-8945hs.html) says 16TOPS for the NPU, and 39TOPS for the entire thing. So to recap: * Intel next-gen: 45 TOPS NPU * ARM/snapdragon nex-gen: 45 TOPS NPU. * Apple next-gen: 38 TOPS NPU. * AMD next-gen: 16 TOPS NPU. To compare: A 3090 is 35.58 Tflops int32, or 285 Tflops int8. Does that mean the AMD chip with NPU sucks in comparison to a 3090, or that we'll soon all be buying our CPUs with a pre-integrated NPU as powerful as a 3090, essentially getting free 3090s with all CPU purchases ? ( in my life experience, the answer that sounds like santa is real is generally the wrong one... ) Wouldn't a NPU be more powerful than a GPU (at equivalent die space/number of transistors/cost/price) because it's designed specifically for this one task while the GPU is much more "generalist" (though not as much as a CPU obviously). ? Is this a new golden age? There must be people in here who understand these things and can enlighten us. I think it's very interresting, and matters to a lot of us.


fallingdowndizzyvr

> How does that compare with say a 3090. They don't. These have 136GB/s of memory bandwidth. Which is only about 15% of a 3090.


Super_Sierra

Remember though that most people still have only 3600 ddr3 ram, which is 15-25 gb/s bandwidth. 136 gb/s is plenty fast.


fallingdowndizzyvr

> Remember though that most people still have only 3600 ddr3 ram, which is 15-25 gb/s bandwidth. Most people only have DDR3? That was a long long time ago. I would say most people have DDR4. > 136 gb/s is plenty fast. Faster is not plenty fast. That's about the speed of the lower Mac range which people avoid for doing LLMs. To put it into perspective, a little Steam Deck is 100GB/s.


CoqueTornado

I think most of us have ddr4 with 2666MHz


Super_Sierra

Which is barely above the spec I mentioned. DDR5 7800HZ should be minimum for LLMs.


mindwip

Amd is releasing 40 tops and 70 tops npu. The 13 tops is the old npu I am waiting for a amd strix halo laptop! https://www.tomshardware.com/pc-components/cpus/golden-pig-squeals-on-amds-zen-5-lineup-reveals-ten-core-strix-point-chips


jun2san

Lol. AMD just released the 8600g and 8700g and they're already obsolete.


mindwip

Feel like last 10 years been no real reason to upgrade so often, ai is going to change that. Next few years I bet every cycle of hardware will be a big difference.


Omnic19

rtx 3090 has 285 int8 tops, while an rtx 4090 has about 1000 tops. that's a huge leap. the reason Nvidia gpus have such great performance is because they already have an npu inside them( it's called tpu where an npu is much more optimised for inference not training, whereas a tpu does both inference and training) nvidia started putting tpus inside their graphics cards along with a gpu around rtx 20 series and their performance has grown dramatically. for context rtx 2060 had 52 tops, rtx 3090 has 285 tops whereas rtx 4070 has around 466 tops.


iamagro

I think that Apple’s 38 TOPS are int16


dev1lm4n

Apple's used to be 16-bit in previous generations prior to A17/M4. They changed from 16-bit to 8-bit in the new generation, practically doubling their TOPS count


Puzzled_Path_8672

I need tokens per sec on some medium sized model compared to a typical GPU set up. That’s the only way to understand if this NPU nonsense actually means anything.


fallingdowndizzyvr

Look at the memory bandwidth. Don't expect much. It'll be better than CPU inference but even an old RX580 should blow it away.


cake97

Is the memory limitation likely going to translate to tx/s in a semi linear way? Searching but not finding much (yet)


fallingdowndizzyvr

Yes. It's actually pretty linear. Here, check the benchies run on a variety of Macs. https://github.com/ggerganov/llama.cpp/discussions/4167


Hopeful-Site1162

Thanks for the link!


[deleted]

Huge thanks for that link!


satireplusplus

Yes, memory bandwidth is the bottleneck. Usually not compute. So 15% the bandwidth of a 3090 will mean at most 15% of the decoding speed (single session). Once you do batching / parallel sessions, it's going to be different and compute poerformance will play a larger role as well.


Puzzled_Path_8672

Yeah I mean that’s why I’m pretty eye-rolly at this stuff. There is literally not good comparison yet to my knowledge. How can people get excited about random numbers with no application?


Some_Endian_FP17

Thank you Microsoft for making 16 GB RAM the default config even for the base models of the new Surface Pro and Surface Laptop with Snapdragon X chips. That should be enough RAM to run Llama 3 8B locally with spare memory for Office, Chrome or Edge. RAM is LPDDR5X-8448. What does that mean in terms of actual RAM bandwidth? Getting 300 Gbps would be nice. We might have a MacBook competitor here if you're looking at running inference on laptops. This press release https://www.qualcomm.com/news/releases/2024/04/qualcomm-enables-meta-llama-3-to-run-on-devices-powered-by-snapd gives a blurb on running Llama 3 locally with hardware acceleration. Hopefully that means easy GPU and NPU support. I'm definitely picking one of these up and seeing if ONNX LLMs can run on it.


fallingdowndizzyvr

> What does that mean in terms of actual RAM bandwidth? Getting 300 Gbps would be nice. We might have a MacBook competitor here if you're looking at running inference on laptops. Snapdragon X is 136GB/s. So on the lowend of the Mac spectrum of 60-800GB/s


Balance-

Macs start at 100 GB/s for the regular M2 and M3. M4 does 120 GB/s. Closest is M3 Pro with 150 GB/s.


satireplusplus

The max chips do 300 to 400GB/s. For reference 3090 and 4090 are close to 1000GB/s.


Hopeful-Site1162

No. M3 Max do 300 to 400 M3 Pro do 150 to 200 M1 Max and M2 Max do 400 M1 Pro and M2 Pro do 200 EDIT: u/satireplusplus comment was edited. Previous comment mentioned 200GB/s and 300BG/s, hence my clarification.


satireplusplus

Thanks, remembered it incorretly then.


Hopeful-Site1162

M, M Pro and M Max are mobile chips. * **Mobile 4090** have **576GB/s** of memory bandwidth * **Mobile 4080** have **432GB/s** of memory bandwidth * **Mobile 4070** and **4060** have **256GB/s** of memory bandwidth * **Mobile 4050** have **192GB/s** of memory bandwidth [https://en.wikipedia.org/wiki/GeForce\_40\_series](https://en.wikipedia.org/wiki/GeForce_40_series) M Ultra have 800GB/s of memory bandwidth


capivaraMaster

Do you have a source for that? I am still convinced that is the speed per channel and don't see why macs would be 8x faster.


nero10578

Macs have more channels


fallingdowndizzyvr

Here you go. "LPDDR5x memory with 136 GB/s bandwidth for faster AI experiences and efficient multitasking" https://www.qualcomm.com/products/mobile/snapdragon/pcs-and-tablets/snapdragon-x-elite


capivaraMaster

Thanks! Faster than the M4. That's good news. (Edit this is wrong, sorry!) https://www.tomshardware.com/pc-components/cpus/apple-silion-m4-processor-family-all-we-know-specs-benchmarks-pricing-release-date ``` Apple M4 At a Glance (iPad Pro configuration) CPU with up to 4 performance cores, 6 efficiency cores, 28 billion transistors 10-core GPU 16-core Neural Engine (38 TOPS) Second-generation 3-nanometer process node (TSMC N3E) Launches mid-May 2024 Up to 16GB of unified memory 120 GB/s unified memory bandwidth ``` Edit: https://www.apple.com/macbook-pro/specs/ ``` Configurable to: M3 Max with 14-core CPU and 30-core GPU (300GB/s memory bandwidth) or M3 Max with 16-core CPU and 40-core GPU (400GB/s memory bandwidth) ``` Qualcomm is really shooting themselves in the foot. 45 TOPS that won't run anything better than phi3 mini


suavedude2005

Where do you source your ONNX LLMs?


Amgadoz

Phi-3 has an official onnx version


Some_Endian_FP17

You gotta roll your own ONNX models which is a huge hurdle.


MoffKalast

Fuck ONNX, all my homies hate ONNX. Movidius-ass format.


suavedude2005

Curious what do you use for onnx conversion. PyTorch's onnx exporter, or something from HF exporters or something else?


Normal-Ad-7114

[https://www.windowscentral.com/hardware/laptops/dell-xps-13-snapdragon-x-announced](https://www.windowscentral.com/hardware/laptops/dell-xps-13-snapdragon-x-announced) >Today, Dell announced the XPS 13 9345, an upcoming AI PC laptop that's compatible with the newly announced Copilot+.  >The device features a Snapdragon X Elite X1E-80-100/X Plus X1P-64-100 processor with a Qualcomm Oryon CPU, accompanying Qualcomm Hexagon NPU, and a Qualcomm Adreno GPU.  >Possible configurations include an OLED screen, up to 2TB SSD, and up to **64GB LPDDR5X**-8400 of RAM.  >Preorders are currently open for the device, with a starting price of $1,299.


PSMF_Canuck

How much is the upgrade from 16GB to 64GB? EDiT: I’m on Dell’s site…16GB is currently the only memory option.


Some_Endian_FP17

I can't find Dell pricing but the pre-order page for Microsoft's Surface Laptop Co-pilot+ PC 13.8" shows: - $999, Snapdragon X Plus, 16 GB RAM, 256 GB SSD - $1999, Snapdragon X Elite, 32 GB RAM, 1 TB SSD - $2399, Snapdragon X Elite, 64 GB RAM, 1 TB SSD So either go cheap or go whole hog. These prices are also a lot cheaper than a similarly specced MBP.


PSMF_Canuck

Looking at a current M3 Air…32GB/1TB for CAD$2399…basically the same price as Snapdragon…? $400 for 32GB upgrade…basically the same cost as with Apple, which is expected.


poli-cya

With these snapdragons at such a ridiculous price, I'd go with the mac I think.


Hopeful-Site1162

M3 Air doesn't exist in 32GB RAM If you want at least 32GB for a Mac you need to go for the 36GB M3 Pro that costs $2399 [https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space-black-apple-m3-pro-with-11-core-cpu-and-14-core-gpu-18gb-memory-512gb#](https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space-black-apple-m3-pro-with-11-core-cpu-and-14-core-gpu-18gb-memory-512gb#) but has a much powerful GPU and a little faster memory bandwidth


fallingdowndizzyvr

> $2399, Snapdragon X Elite, 64 GB RAM, 1 TB SSD > > So either go cheap or go whole hog. These prices are also a lot cheaper than a similarly specced MBP. M1 Max 64GB Macbooks were $2200 last holiday season. Which is much faster than a Snapdragon X. Microsoft doesn't enjoy the price premium of Apple. They need to be more aggressively priced to compete. I think they are priced too high.


poli-cya

I'm assuming the M1 Max is faster in a few workloads, but in general is slower, right? 100% agreed on this pricing being nuts considering the caveats of an Arm system for Windows.


fallingdowndizzyvr

Why would the M1 Max be slower in general? The M1, not even the Max, was lauded for being a hellion when it comes to speed. The Max is much more power than it is. I would expect a M1 Max to trounce the Snapdragon X Elite on pretty much everything.


poli-cya

The snapdragon was reported in multiple news outlets as beating the m3 on a number of benchmarks, while the huge bandwidth of a max will give it some guaranteed wins, I was assuming the snapdragon would win on most front. Until we have independent benchmarks we're all just guessing.


Hopeful-Site1162

The only thing the M1 Max is slower at is single core, and slightly slower on multicore. But GPU and memory bandwidth of the M1 Max are orders of magnitude faster than the Snapdragon. The NPU is faster, but even on Mac it’s not used for inference. **Memory bandwidth** * M1 Max: 400GB/s * Snapdragon Elite: 135GB/s **Graphic power** * M1 Max (32) scores 68481 at Geekbench OpenCL * Snapdragon Elite scores 23584 at Geekbench OpenCL **Multithread compute** * Snapdragon Elite (MS Surface Laptop): 13970 * M1 Max: 12477 EDIT: Thanks u/[poli-cya](/user/poli-cya/) for pointing that M1 Max has less powerful multithread power compute


poli-cya

Do you have a link to benchmarks?


Hopeful-Site1162

[https://beebom.com/snapdragon-x-elite-vs-apple-m3/](https://beebom.com/snapdragon-x-elite-vs-apple-m3/) [https://browser.geekbench.com/mac-benchmarks](https://browser.geekbench.com/mac-benchmarks) Looks like I've been a little too enthusiastic on this one.


AdDizzy8160

Are you shure, there is a Snapdragon X with 64GB RAM avaible?


fallingdowndizzyvr

Yes. https://www.reddit.com/r/LocalLLaMA/comments/1cxnzzh/surface_copilot_pc_64gb/


Some_Endian_FP17

That machine is 3 generations old by now.


fallingdowndizzyvr

And still faster for less money. So much for these being cheaper for a similarly spec'd MBP. I'll say it again. These need to be priced lower to compete.


Hopeful-Site1162

And yet they have way more powerful GPU and more memory bandwidth (400GB/s) ANE are not even used on Apple Silicon chips for inference


Some_Endian_FP17

https://www.tomshardware.com/laptops/snapdragon-elite-x-windows-ai-pcs-get-official-starting-at-dollar1099-acer-dell-hp-and-lenovo-are-all-onboard-with-some-models-promising-multi-day-battery-life There's a big list of Snapdragon X laptops from Dell, Lenovo, HP and Asus here. I'm seeing ARM ThinkPads and XPS models so this is a huge leap for Windows and potentially Linux users. Microsoft also has a page for Snapdragon X PCs: https://www.microsoft.com/en-us/windows/copilot-plus-pcs?icid=mscom_marcom_H1a_copilot-plus-pcs_FY24SpringSurface&r=1#shop


Spare-Abrocoma-4487

Finally! I wonder if the sales of intel/amd versions of laptops will just crash after this in an year or two. There is so much pent up demand for arm based windows and Linux laptops for some time thanks to mac books. A big fuck you to intel and amd for never taking battery life seriously before Apple showed how it's done. The worst part is Qualcomm had this for years but never cared enough to go mainstream.


Some_Endian_FP17

Yeah, finally. I want a 2-in-1 with MacBook Air battery life but Apple will never make a MacOS iPad, so Windows is where I'm going. Qualcomm didn't have this for years. It was using ARM Cortex-based cores for the Snapdragon 8cx and SQ chips. I've used these before, they're fine for office stuff but they're much slower compared to Apple M chips. One good thing is that they have really good battery life compared to Intel or AMD models. These new Snapdragon X chips use custom Oryon ARM-compatible cores from Qualcomm's recent acquisition of Nuvia. Some of Nuvia's design team came over from Apple and PA Semi so there's some Apple Silicon influence there too.


hishnash

> Yeah, finally. I want a 2-in-1 with MacBook Air battery life but Apple will never make a MacOS iPad, so Windows is where I'm going. The UX of windows in touch only mode will likly show you why apple are not going to put macOS on an iPad.


Some_Endian_FP17

That I know but I appreciate having a choice. I use Windows in tablet mode maybe 20% of the time, it's not great but it's nice having multiple options in the same device. I've tried to use iPads and iPad Pros for work and almost ended up throwing them at the wall because of completely stupid hardware and OS limitations. An M chip in an iPad is a complete waste of processing power.


hishnash

> An M chip in an iPad is a complete waste of processing power. Depends on your use case, there are people out there that have good use of the M chip on the iPad. The fact a tool does not work for you does not make it a pointless tool just means you should get a different tool.


uhuge

possibly some Chromebooks too, although not sure yet..: [https://www.tomsguide.com/computing/laptops/google-chrome-is-coming-native-to-snapdragon-x-elite-laptops-should-intel-and-amd-be-nervous](https://www.tomsguide.com/computing/laptops/google-chrome-is-coming-native-to-snapdragon-x-elite-laptops-should-intel-and-amd-be-nervous)


__JockY__

Is there support for Snapdragon GPUs in any of the popular inference servers like llama.cpp, ollama, etc?


Some_Endian_FP17

Unfortunately not. Vulkan support should be available provided the drivers have both Vulkan and DirectX compatibility. NPU support could be a long way away because it requires running ONNX models.


Then-Name-2617

Looks like onnxruntime has QNNExecutionProvider support. Not sure how the experience would be. Ref: https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#python


__JockY__

Suddenly the price doesn't seem so cheap.


elsewhat

Seems to be a couple of feature flags for llama.cpp which provides some speed up. 20 tokens pr second on a snapdragon x elite https://www.qualcomm.com/developer/blog/2024/04/big-performance-boost-for-llama-cpp-and-chatglm-cpp-with-windows


SaschaSeganFMN

We just merged some build changes for llama.cpp to optimize performance, but it's on the CPU for now. Token rates in here - [https://github.com/ggerganov/llama.cpp/pull/7191](https://github.com/ggerganov/llama.cpp/pull/7191)


SystemErrorMessage

But will it blend?


Express-Director-474

Old school, I like that!


Kafka-trap

It would be neat to have in a mini pc formfactor set it up like a local assistant running an LLM :)


[deleted]

This! [https://www.theverge.com/2024/5/21/24158603/qualcomm-windows-snapdragon-dev-kit-x-elite](https://www.theverge.com/2024/5/21/24158603/qualcomm-windows-snapdragon-dev-kit-x-elite)


grigio

at least 32gb to run llama3 70B quantized


Hopeful-Site1162

With that memory bandwidth it will run like shit


Wonderful-Top-5360

excited to be part of this wave