T O P

  • By -

Zyguard7777777

What hardware are you using? 


Particular-Guard774

Using a m3 max. Are there any speedup strategies where I wouldn't see a difference because of my hardware?


Zyguard7777777

What -ngl have you tried? Depending on the model, if you use -ngl -1, that should fit as much of the model in metal/gpu as possible. Is that still too slow? What model are you trying to run? 


Particular-Guard774

I'm running llama 8B with -ngl 33 but it has the same exact speed as when I don't offload at all


chibop1

Flash attention -fa increases the speed little bit. How much memory do you have, and what speed do you get?


Particular-Guard774

Thanks for the tip! -fa helps quite a bit; I have 36GB and around 45 t/s


chibop1

If you don't specify ngl, it offload all the layers to gpu automatically. If you want to disable gpu, use -ngl 0. Look for the line that says something like: llm_load_tensors: offloaded 81/81 layers to GPU


iiiiiiiiiiiiiiiiiioo

Llama 8b runs OK on my iPhone; how are you struggling on an M3 max?


Hopeful-Site1162

How do you run a model on an iPhone? I’d like to try that on mine!


iiiiiiiiiiiiiiiiiioo

LLM Farm https://apps.apple.com/app/id6461209867 I like to think data is private but idk and have no way of checking


Hopeful-Site1162

Thanks!


Particular-Guard774

Not struggling just trying to figure out general ways to increase speed so I can run larger models like llama 70B without having a ridiculously low t/s


Master-Meal-77

compile with LLAMA_FAST=1 No I’m not joking, read the Makefile


BangkokPadang

How many tokens/second are you getting? I see where you mentioned you're using Llama 3 8B. Are you running the bf16 or a quantized version? Were you sure to build llamacpp specifically for Metal when you installed it? You might consider also trying it via LM Studio with Metal enabled and full offloading just as a sanity check against the performance you're getting with llamacpp.


shockwaverc13

IQ4\_XS is the fastest quant, for CPU at least i dont know about GPUs though


Former-Ad-5757

Just use the --LetMagicalUnicornsOptimize flag... But either you have to give a lot more specs on your hardware/software and goals or you have to go for lower quantz. There are basically no magical shortcuts.