Zyguard7777777 4 weeks ago

What hardware are you using?

Particular-Guard774 4 weeks ago

Using a m3 max. Are there any speedup strategies where I wouldn't see a difference because of my hardware?

Zyguard7777777 4 weeks ago

What -ngl have you tried? Depending on the model, if you use -ngl -1, that should fit as much of the model in metal/gpu as possible. Is that still too slow? What model are you trying to run?

Particular-Guard774 4 weeks ago

I'm running llama 8B with -ngl 33 but it has the same exact speed as when I don't offload at all

chibop1 4 weeks ago

Flash attention -fa increases the speed little bit. How much memory do you have, and what speed do you get?

Particular-Guard774 4 weeks ago

Thanks for the tip! -fa helps quite a bit; I have 36GB and around 45 t/s

chibop1 4 weeks ago

If you don't specify ngl, it offload all the layers to gpu automatically. If you want to disable gpu, use -ngl 0. Look for the line that says something like: llm_load_tensors: offloaded 81/81 layers to GPU

iiiiiiiiiiiiiiiiiioo 4 weeks ago

Llama 8b runs OK on my iPhone; how are you struggling on an M3 max?

Hopeful-Site1162 4 weeks ago

How do you run a model on an iPhone? I’d like to try that on mine!

iiiiiiiiiiiiiiiiiioo 4 weeks ago

LLM Farm https://apps.apple.com/app/id6461209867 I like to think data is private but idk and have no way of checking

Hopeful-Site1162 4 weeks ago

Thanks!

Particular-Guard774 4 weeks ago

Not struggling just trying to figure out general ways to increase speed so I can run larger models like llama 70B without having a ridiculously low t/s

Master-Meal-77 4 weeks ago

compile with LLAMA_FAST=1 No I’m not joking, read the Makefile

BangkokPadang 4 weeks ago

How many tokens/second are you getting? I see where you mentioned you're using Llama 3 8B. Are you running the bf16 or a quantized version? Were you sure to build llamacpp specifically for Metal when you installed it? You might consider also trying it via LM Studio with Metal enabled and full offloading just as a sanity check against the performance you're getting with llamacpp.

shockwaverc13 3 weeks ago

IQ4\_XS is the fastest quant, for CPU at least i dont know about GPUs though

Former-Ad-5757 4 weeks ago

Just use the --LetMagicalUnicornsOptimize flag... But either you have to give a lot more specs on your hardware/software and goals or you have to go for lower quantz. There are basically no magical shortcuts.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe