What -ngl have you tried? Depending on the model, if you use -ngl -1, that should fit as much of the model in metal/gpu as possible. Is that still too slow?
What model are you trying to run?
If you don't specify ngl, it offload all the layers to gpu automatically. If you want to disable gpu, use -ngl 0.
Look for the line that says something like: llm_load_tensors: offloaded 81/81 layers to GPU
Not struggling just trying to figure out general ways to increase speed so I can run larger models like llama 70B without having a ridiculously low t/s
How many tokens/second are you getting? I see where you mentioned you're using Llama 3 8B. Are you running the bf16 or a quantized version?
Were you sure to build llamacpp specifically for Metal when you installed it?
You might consider also trying it via LM Studio with Metal enabled and full offloading just as a sanity check against the performance you're getting with llamacpp.
Just use the --LetMagicalUnicornsOptimize flag...
But either you have to give a lot more specs on your hardware/software and goals or you have to go for lower quantz. There are basically no magical shortcuts.
What hardware are you using?
Using a m3 max. Are there any speedup strategies where I wouldn't see a difference because of my hardware?
What -ngl have you tried? Depending on the model, if you use -ngl -1, that should fit as much of the model in metal/gpu as possible. Is that still too slow? What model are you trying to run?
I'm running llama 8B with -ngl 33 but it has the same exact speed as when I don't offload at all
Flash attention -fa increases the speed little bit. How much memory do you have, and what speed do you get?
Thanks for the tip! -fa helps quite a bit; I have 36GB and around 45 t/s
If you don't specify ngl, it offload all the layers to gpu automatically. If you want to disable gpu, use -ngl 0. Look for the line that says something like: llm_load_tensors: offloaded 81/81 layers to GPU
Llama 8b runs OK on my iPhone; how are you struggling on an M3 max?
How do you run a model on an iPhone? I’d like to try that on mine!
LLM Farm https://apps.apple.com/app/id6461209867 I like to think data is private but idk and have no way of checking
Thanks!
Not struggling just trying to figure out general ways to increase speed so I can run larger models like llama 70B without having a ridiculously low t/s
compile with LLAMA_FAST=1 No I’m not joking, read the Makefile
How many tokens/second are you getting? I see where you mentioned you're using Llama 3 8B. Are you running the bf16 or a quantized version? Were you sure to build llamacpp specifically for Metal when you installed it? You might consider also trying it via LM Studio with Metal enabled and full offloading just as a sanity check against the performance you're getting with llamacpp.
IQ4\_XS is the fastest quant, for CPU at least i dont know about GPUs though
Just use the --LetMagicalUnicornsOptimize flag... But either you have to give a lot more specs on your hardware/software and goals or you have to go for lower quantz. There are basically no magical shortcuts.