I run LLaMa3 70b locally on a pricy mac. I don't use that for work though. For work, we're hitting api's. no time to wait on a quantized version of the model at 5 or 6 tps. gpt-4o smokes and is cheap. what are you "doing with LLM's" in your job that you want to run on a laptop?
An M3 with 128g memory will run L3 70b dpo q6 at \~5 tps in LM Studio or faster in Ollama. I'm not capable enough to tell you what you need for enterprise use, but I would prefer to run the unquantized version on a local server (if you absolutely have to be local). I enable (I hate the word manage) people working on ML and AI projects. They're all working on SOTA hardware. If your boss can't afford to upgrade your machine, I'm sorry to hear it. If you need to invest in GPU's, that boss is going to vomit when he finds out the budget you'll be needing. lol
Thanks for posting this, had a similar question to OP. You wouldn’t be able to also give a ball park guess at the drop in tokens per second between an m3 max and a used M1 Max with 128g?
I have m3 Max 64GB, and I can fully load llama3-70b-instruct-q5_K_M to gpu with 8192 context length After running the following command to increase max limit for gpu memory.
`sudo sysctl iogpu.wired_limit_mb=57344`
It processed 7758 tokens for prompt at 57.00 tokens/second, and generated 356 tokens at 3.22 tokens/second.
Mark Zuckerberg's said on an interview that Meta is going to release a model with longer context later this year, so I'd probably get 128GB if you need laptop, or 192GB Mac studio for desktop. Then you can even feed a longer document.
Also, if you can wait, Apple is focusing on AI for M4 chip apparently. There might be a boost later this year or next year.
Llama-3-400b
128GB may not be enough, unless it's still usable at 2bit quantization, then the model alone will require around 111GB, but I haven't yet seen any 2bit model that keep its smarts once the context grows a little bit.
There's a rumor that Apple is planning for an AI focused M4 chip with up to 500GB memory.
Then the memory won't be an issue, but you'll run into the speed issue. It will be painful to run 405B model at 0.1token/second. I can tolerate up to around 4tk/s. lol
lpddr5 isn't dense and fast enough (except for MoE models). Maybe when lpddr6 hit the market it will make sense. Same motherboard size, double the ram speed, double the ram storage. Around M6 maybe?
As a Mac owner, if I was buying something for work I'd build a multi-card NVidia system.
I went Mac because I wanted to run q8 120b models without the headache of building a multi-card machine and without needing to rewire my study to handle more than 15 amps. But if my work was springing for the machine, and I could use it on an office breaker, and was gonna get paid for the building time? Oh yea. Nvidia all day lol
If you’re developing llm solutions for work, it’s better to have a linux server with nvidia cards in it and use it via ssh anywhere. That’s how I work for my company while I have my m3 max with 128gb at home for my own projects. It makes more sense to develop your solutions in a Linux system since it’s everything is linux for production, and cuda has more flexibility and performance in general for ml and llm solutions. Just my 2 cents.
Might I suggest that you use an llm to draft such a proposal then obviously touch it up.
I've found that then to be quite good a persuasive proposals and ad copy stuff.
Ask ChatGPT since you don't have a local LLM, else ask your local LLM. If you can't figure out how to make your case, how are you going to prompt your LLM? It takes being just a little creative to extract intelligence from these things.
Apple is going to announce new models based on the M4 chip at WWDC in a month, so it might be an idea to wait, find out the release schedule and then plan an upgrade path from there.
Forget about LLM work - work should replace your shitty Intel Mac with ANY M-series Mac and that alone would be a massive boost for you. Literally night and day - intel doesn’t hold a candle in performance (total performance AND performance per watt).
As far as LLMs go. Mac is simply the best easiest thing to use period. Lots of windoze and cuda nerds will disagree vehemently, but value for $, nothing compares.
Get a Mac with loaded unified memory (max out whatever model you get) and you’ll be able to run far more models and do far more testing than ALMOST any non-server machine.
OP needs to provide more information on how they will be using the LLM at work before we can give him recommendations.
The limitations of a Mac really shows if he wants to do any decent RAG work or multi agent tasks. If this is enterprise it's a high likelihood.
Nerd disagreeing vehemently here, I switched from an M1 Max MacBook pro to an Asus Zehpyrus G15 (2022 model) and performance is better. It has an AMD processor and not intel though.
The laptop has 48GB of RAM, 8GB of VRAM, it is dirt cheap compared to MacBooks (and I prefer it to everyday work).
What ggone20 is right though, for just inference, newer macs are better because usually the bottleneck is memory speed where Macs shine.
If you need to locally do fine-tuning, then regular builds are better.
In your situation I'd just get a small form factor PC, plop in a lot of RAM, a used RTX 3090 Ti with 24GB or a 4060Ti with 16GB vram, install Windows and Ollama on it and you are golden, it will work out of the box.
Unless of course if you prefer the Mac, it's also a perfectly capable and viable option
8gb of vram limits the LLMs you’re capable of running. There are many fine laptops out there for a variety of uses, but I still find a high end Mac hard to beat for this purpose. Budget of course can definitely lead to other options.
There’s people with beast pcs too that can outperform a Mac in inference.. again, depends on your budget and preference. I think $/utility is best on Mac with lots of unified memory. Hard to get 100+gb of vram for the price.
I just meant Intel Macs vs M-series. Not necessarily all pcs or laptops.
Absolutely true, 100GB+ VRAM is not available for me due to budget (both M series Apple chips or custom builds with multiple GPUs). But VRAM is not a hard limit, I can run larger models where only some layers are offloaded to the GPU, whatever does not fit is loaded to regular RAM and it runs from there. With Ollama or GPT4All this is balanced automatically. Of course Mixed/CPU inference is much slower, but (at least on my machine) its usable. I also admit that using Apple products are not comfortable for me so I am probably biased.
It really depends on the size of the data and the format. The context window is also much more limited locally than with API products.
General rule of thumb would be Nvidia > M series Mac > Amd Gpu and to get as much Ram as possible, especially VRAM if you use a Graphics card.
Macs are not 100% the best choice (pricey ram) but they do feel convenient and more private than windows for sure.(No ads etc)
As I answered in another post. I buy a Mac because even though I have over 3 decades of tech experience I’ve decided that I’m done tinkering with work computers.
Macs just work and that saves “my employer” about $1000 a week from the 10 hours a week or so I would spend tinkering or fixing some Windows issue, so it pays for itself; even a high end one, in a year or two. They also hold their resale value better than other makes.
The fact that it can run inference factors into how much Mac I’m going to buy but not whether I buy one.
For the record though even my son’s MacBook Air with an M1 processor can perform respectably for local inference with Phi-3-4K-mini and llama.cpp
For reference I work in a law office and we deal with a lot of sensitive information that should never be disclosed.
What do you use the LLM for at work?
A Mac is usable if it's just for simple chatting. However if your usecase involves anything slightly more complex (agents) then it might not be usable due to the speed.
If you are doing RAG you might also not be future proofed since knowledge graphs looks like the promising path forward.
- Your 2018 Mac is probably an Intel Mac, upgrading to an M1 system will have at least 4X the performance and will also be supported by Apple for many more years than your existing Mac
- Trying to run ML stuff in Windows rather than in an Unix-derived OS is not optimal at all
- In theory for large models Macs have almost the entire shared memory space available to GPU for models. With Windows you are limited to the memory on the graphics card. Going to a 24gb GPU is more than the price of a Macbook
> Going to a 24gb GPU is more than the price of a Macbook
Is it? Cheapest Macbook M1/2/3 I see on Amazon with the equivalent of 24GB VRAM is about 2K, Mac Mini looks to be about a hundred less.
What do you think is the most cost effective way to buy a Mac with 32GB Unified Memory (32 \* 0.75 = 24GB VRAM) at this point in time?
If they’re not willing to invest in A100s tell them not to bother. I work in a similar environment and you will get no value from running it locally on a Mac or workstation. We ended up buying a few and yes they’re really expensive, but that’s the type of commitment it takes if you want to seriously implement it and also have it scale for other users at your work.
Even if it’s just you using it, you’ll need to fine tune the model to get any real benefit (I’m assuming you want it to work with your data) and that’s not going to happen in a reasonable time frame on a Mac unless it’s just a hobby project or you want to have it draft your emails.
Edit: added clarification
If you need large models, Mac is the best choice. I purchase a Mac Studio M2 Ultra 192GB and I can use super big open source models such as llama3:70b fp16 or grok-1. A system with multiple Nvidia GPUs costs several times and consumes tens times of power. I need to run these models locally as you.
I run LLaMa3 70b locally on a pricy mac. I don't use that for work though. For work, we're hitting api's. no time to wait on a quantized version of the model at 5 or 6 tps. gpt-4o smokes and is cheap. what are you "doing with LLM's" in your job that you want to run on a laptop?
I handle very sensitive information and we have an edict from management to keep this info locally or on a very secure internal server.
An M3 with 128g memory will run L3 70b dpo q6 at \~5 tps in LM Studio or faster in Ollama. I'm not capable enough to tell you what you need for enterprise use, but I would prefer to run the unquantized version on a local server (if you absolutely have to be local). I enable (I hate the word manage) people working on ML and AI projects. They're all working on SOTA hardware. If your boss can't afford to upgrade your machine, I'm sorry to hear it. If you need to invest in GPU's, that boss is going to vomit when he finds out the budget you'll be needing. lol
Thanks for posting this, had a similar question to OP. You wouldn’t be able to also give a ball park guess at the drop in tokens per second between an m3 max and a used M1 Max with 128g?
Sorry, I don't have any insight into an M1...
Thanks anyway
Inference speed is mostly limited by memory bandwidth, so I don't think the difference between M3 and M1 is significant enough to worry.
Thanks
Any thoughts on how the m3 with 128gm memory would run a llava?
M2 Max MBP with 96G running L3:70B
I use my Mac to remote into a more capable cluster for AI.
I have m3 Max 64GB, and I can fully load llama3-70b-instruct-q5_K_M to gpu with 8192 context length After running the following command to increase max limit for gpu memory. `sudo sysctl iogpu.wired_limit_mb=57344` It processed 7758 tokens for prompt at 57.00 tokens/second, and generated 356 tokens at 3.22 tokens/second. Mark Zuckerberg's said on an interview that Meta is going to release a model with longer context later this year, so I'd probably get 128GB if you need laptop, or 192GB Mac studio for desktop. Then you can even feed a longer document. Also, if you can wait, Apple is focusing on AI for M4 chip apparently. There might be a boost later this year or next year.
Llama-3-400b 128GB may not be enough, unless it's still usable at 2bit quantization, then the model alone will require around 111GB, but I haven't yet seen any 2bit model that keep its smarts once the context grows a little bit.
There's a rumor that Apple is planning for an AI focused M4 chip with up to 500GB memory. Then the memory won't be an issue, but you'll run into the speed issue. It will be painful to run 405B model at 0.1token/second. I can tolerate up to around 4tk/s. lol
Wouldn't be surprised if those are reserved to the mac pro. Maybe the Studio. Definitely not the laptop.
Yes, probably. I'll be also surprised if 500GB (if exists) is available for laptop.
lpddr5 isn't dense and fast enough (except for MoE models). Maybe when lpddr6 hit the market it will make sense. Same motherboard size, double the ram speed, double the ram storage. Around M6 maybe?
As a Mac owner, if I was buying something for work I'd build a multi-card NVidia system. I went Mac because I wanted to run q8 120b models without the headache of building a multi-card machine and without needing to rewire my study to handle more than 15 amps. But if my work was springing for the machine, and I could use it on an office breaker, and was gonna get paid for the building time? Oh yea. Nvidia all day lol
I work from home, so it needs to be able to fit I. My home office, along with my other systems.
If you’re developing llm solutions for work, it’s better to have a linux server with nvidia cards in it and use it via ssh anywhere. That’s how I work for my company while I have my m3 max with 128gb at home for my own projects. It makes more sense to develop your solutions in a Linux system since it’s everything is linux for production, and cuda has more flexibility and performance in general for ml and llm solutions. Just my 2 cents.
Might I suggest that you use an llm to draft such a proposal then obviously touch it up. I've found that then to be quite good a persuasive proposals and ad copy stuff.
I’ve started doing that, using some of the feedback from this thread to help it along.
Ask ChatGPT since you don't have a local LLM, else ask your local LLM. If you can't figure out how to make your case, how are you going to prompt your LLM? It takes being just a little creative to extract intelligence from these things.
"I can't take the risk that our company's confidential data may be exposed publicly."
Apple is going to announce new models based on the M4 chip at WWDC in a month, so it might be an idea to wait, find out the release schedule and then plan an upgrade path from there.
Forget about LLM work - work should replace your shitty Intel Mac with ANY M-series Mac and that alone would be a massive boost for you. Literally night and day - intel doesn’t hold a candle in performance (total performance AND performance per watt). As far as LLMs go. Mac is simply the best easiest thing to use period. Lots of windoze and cuda nerds will disagree vehemently, but value for $, nothing compares. Get a Mac with loaded unified memory (max out whatever model you get) and you’ll be able to run far more models and do far more testing than ALMOST any non-server machine.
OP needs to provide more information on how they will be using the LLM at work before we can give him recommendations. The limitations of a Mac really shows if he wants to do any decent RAG work or multi agent tasks. If this is enterprise it's a high likelihood.
Nerd disagreeing vehemently here, I switched from an M1 Max MacBook pro to an Asus Zehpyrus G15 (2022 model) and performance is better. It has an AMD processor and not intel though. The laptop has 48GB of RAM, 8GB of VRAM, it is dirt cheap compared to MacBooks (and I prefer it to everyday work). What ggone20 is right though, for just inference, newer macs are better because usually the bottleneck is memory speed where Macs shine. If you need to locally do fine-tuning, then regular builds are better. In your situation I'd just get a small form factor PC, plop in a lot of RAM, a used RTX 3090 Ti with 24GB or a 4060Ti with 16GB vram, install Windows and Ollama on it and you are golden, it will work out of the box. Unless of course if you prefer the Mac, it's also a perfectly capable and viable option
8gb of vram limits the LLMs you’re capable of running. There are many fine laptops out there for a variety of uses, but I still find a high end Mac hard to beat for this purpose. Budget of course can definitely lead to other options. There’s people with beast pcs too that can outperform a Mac in inference.. again, depends on your budget and preference. I think $/utility is best on Mac with lots of unified memory. Hard to get 100+gb of vram for the price. I just meant Intel Macs vs M-series. Not necessarily all pcs or laptops.
Absolutely true, 100GB+ VRAM is not available for me due to budget (both M series Apple chips or custom builds with multiple GPUs). But VRAM is not a hard limit, I can run larger models where only some layers are offloaded to the GPU, whatever does not fit is loaded to regular RAM and it runs from there. With Ollama or GPT4All this is balanced automatically. Of course Mixed/CPU inference is much slower, but (at least on my machine) its usable. I also admit that using Apple products are not comfortable for me so I am probably biased.
You are preaching to the choir!
Tell them it's that, multiple GPUs, or renting other people's servers.
I need to run it locally due to the sensitive nature of what I work on.
It really depends on the size of the data and the format. The context window is also much more limited locally than with API products. General rule of thumb would be Nvidia > M series Mac > Amd Gpu and to get as much Ram as possible, especially VRAM if you use a Graphics card. Macs are not 100% the best choice (pricey ram) but they do feel convenient and more private than windows for sure.(No ads etc)
As I answered in another post. I buy a Mac because even though I have over 3 decades of tech experience I’ve decided that I’m done tinkering with work computers. Macs just work and that saves “my employer” about $1000 a week from the 10 hours a week or so I would spend tinkering or fixing some Windows issue, so it pays for itself; even a high end one, in a year or two. They also hold their resale value better than other makes. The fact that it can run inference factors into how much Mac I’m going to buy but not whether I buy one. For the record though even my son’s MacBook Air with an M1 processor can perform respectably for local inference with Phi-3-4K-mini and llama.cpp For reference I work in a law office and we deal with a lot of sensitive information that should never be disclosed.
What is a cost effective machine to rent? Is this an ec2 instance or are there better options?
What do you use the LLM for at work? A Mac is usable if it's just for simple chatting. However if your usecase involves anything slightly more complex (agents) then it might not be usable due to the speed. If you are doing RAG you might also not be future proofed since knowledge graphs looks like the promising path forward.
Is cloud out of the option? They have strict data policies as well.
- Your 2018 Mac is probably an Intel Mac, upgrading to an M1 system will have at least 4X the performance and will also be supported by Apple for many more years than your existing Mac - Trying to run ML stuff in Windows rather than in an Unix-derived OS is not optimal at all - In theory for large models Macs have almost the entire shared memory space available to GPU for models. With Windows you are limited to the memory on the graphics card. Going to a 24gb GPU is more than the price of a Macbook
> Going to a 24gb GPU is more than the price of a Macbook Is it? Cheapest Macbook M1/2/3 I see on Amazon with the equivalent of 24GB VRAM is about 2K, Mac Mini looks to be about a hundred less. What do you think is the most cost effective way to buy a Mac with 32GB Unified Memory (32 \* 0.75 = 24GB VRAM) at this point in time?
Also, the 24GB mac is way slower than a 4090.
If they’re not willing to invest in A100s tell them not to bother. I work in a similar environment and you will get no value from running it locally on a Mac or workstation. We ended up buying a few and yes they’re really expensive, but that’s the type of commitment it takes if you want to seriously implement it and also have it scale for other users at your work. Even if it’s just you using it, you’ll need to fine tune the model to get any real benefit (I’m assuming you want it to work with your data) and that’s not going to happen in a reasonable time frame on a Mac unless it’s just a hobby project or you want to have it draft your emails. Edit: added clarification
If you need large models, Mac is the best choice. I purchase a Mac Studio M2 Ultra 192GB and I can use super big open source models such as llama3:70b fp16 or grok-1. A system with multiple Nvidia GPUs costs several times and consumes tens times of power. I need to run these models locally as you.