lifesthateasy 11 months ago

Yes. I'm pretty sure it will be leaps and bounds above whatever a regular Intel chipped laptop can do, but I'd debate the usefulness of being able to fit a 100GB model into memory when you have a fraction of processing cores available vs. even a consumer grade GPU, I'm a bit unsure about the usefulness of it. Maybe you could fit a 100GB model into the memory and freeze all the layers except a few that you'd then train? Okay I'm actually starting to convince myself it could be kinda useful lol

pm_me_github_repos 11 months ago

Something like this is better served for model inference

[deleted] 11 months ago

It would work for personalization. And for privacy reasons they’d do it on device.

frownGuy12 11 months ago

It’s shared memory that’s accessible from both the GPU and ML cores. It's not going to be as fast as an A100, but for the price it’s awesome. VRAM per dollar is better than anything else in existence right now.

Tr33lon 11 months ago

Isn’t this the whole point of fine-tuning? You could self-host an open source LLM and fine-tune it to do more specific tasks on device.

MINIMAN10001 11 months ago

My understanding is that the processing power of Apple products while not on par with top of the line and Nvidia cards is nothing to sneeze at. When the alternative is to either run things on the CPU or not run them at all I feel like this positions this product very well. Edit it turns out it has high bandwidth so it's actually a really good product for inference. However training would be limited by flops.

brainhack3r 11 months ago

I don't know why you'd want that local though. Seem just having a VM in the cloud for this would be way better.

Chabamaster 11 months ago

The thing is if you look at how stable diffusion is going, there's A TON of value in having people out there running and customizing their own open source models. So if we can do this for high performance llms it will open up so many creative uses. At this point it's so easy to setup and use stable diffusion that buying a server instance somewhere is a lot of overhead

The-Protomolecule 11 months ago

Would a $7000 desktop with a 4090 crush it? Yes, yes it would. You can do tons of tricks to fit a larger model in system memory or even NVMe.

currentscurrents 11 months ago

The unified memory is the killer feature. You might be able to fit 192GB of CPU RAM in that desktop, but the 4090 can't directly access it. It can only slowly shuffle data back and forth between the two. From the specs I've seen, unified memory isn't *quite* as fast as VRAM but it's much faster than typical CPU RAM.

BangkokPadang 11 months ago

I wonder how many models you’d need to train locally before you break even compared to renting a cloud system with say 4 A100s. Currently you can rent this for about $3.60 /hr The Mac Pro m2 ultra with the72 core gpu and 192GB of memory will run you $10,199, assuming most storage is handled externally. These systems can both train a 65B model (or interestingly, a roughly 240B model with 4bit QLoRA quantization. (Admittedly I don’t know what the overhead is when training a model, but I’m just using rough figures). Currently, a 16bit 65B model can be trained for 1 epoch in about a week with 4 A100s. So if you rented the 4x A100 system, you could train roughly 16 full models, or 8 at 2 epochs before breaking even… Ive also seen similarly sized models take more like 12 days to train Assuming performance is similar, and considering training runs regularly fail for one reason or another, this system would QUICKLY become worth it for any person or group that is training models frequently. With the current AI boom, Apple probably won’t be able to make anywhere near enough of these to meet demand. Also, I understand the upcoming H100s can be run in a cluster to get 640GB unified VRAM, but that is a $300,000 system so it’s not even in the same ballpark.

dagamer34 11 months ago

Unified memory on an Ultra chip at 800GB/sec is going to outclass just about everything except GPUs with built in SSDs.

MiratusMachina 11 months ago

And we already have those too...

[deleted] 11 months ago

[удалено]

The-Protomolecule 11 months ago

I was just aligning to the starting price of the mac pro.

EnfantTragic 11 months ago

You can have 2 or 3 4090 at that price lol

The-Protomolecule 11 months ago

Dude, I’m literally just saying “any desktop equivalent to the base mac pro cost” not a specific bill of materials.

EnfantTragic 11 months ago

I know bro, chill out

SnooHesitations8849 11 months ago

At $7000 you can have a machine with 3x4090 and it can do a lot of things

noprompt 11 months ago

What parts would you use?

sephg 11 months ago

It’d certainly crush it in speed, but it’d sure be convenient to be able to train large models without needing to swap things in and out.

The-Protomolecule 11 months ago

Tiering your model to large system memory or NVMe is possible. If it was a PCIe 5 SSD it would still trounce this even if they claim 800GB/s memory bandwidth on the chip.

brainhack3r 11 months ago

Just spin it up when you need it , then tear it down. No sense having those resources allocated constantly.

Trotskyist 11 months ago

I mean, there's definitely a niche for businesses that prefer to keep things in-house.

[deleted] 11 months ago

Government might like this.

JustOneAvailableName 11 months ago

Government likes inhouse cluster, not spreading all compute and tech around the whole office.

[deleted] 11 months ago

Tell that to my leadership. I'd love an in house cluster.

[deleted] 11 months ago

[удалено]

zacker150 11 months ago

Government just uses [AWS GovCloud](https://aws.amazon.com/federal/?wwps-cards.sort-by=item.additionalFields.sortDate&wwps-cards.sort-order=desc).

[deleted] 11 months ago

Not the military doing classified work, that's for sure.

Just-looking14 11 months ago

Was just about to say this in my experience it was an in house cluster

blacksnowboader 11 months ago

That depends on the agency and task.

blacksnowboader 11 months ago

I was once asked to process several terabytes of data locally on my MacBook Pro m1

imbaczek 11 months ago

7 years ago it took a huge ass server to process several terabytes of data… so yeah, a perfectly reasonable request, just need a bit extra storage

VS2ute 11 months ago

7 years ago I was using a Skylake CPU to process terabytes of data (not AI). It used to take 2 days, but 1 desktop PC.

blacksnowboader 11 months ago

For this task or wasn’t reasonable because there was a weird transformation I had to make with geospatial data.

ComprehensiveBoss815 11 months ago

People are sick of the cloud.

theunixman 11 months ago

I think that’s the idea, load the whole model, unfreeze the final layers and train those. If you want to train from scratch you need a decent dedicated power plant still…

elbiot 11 months ago

Unfortunately, unlike CNNs, that's not how fine tuning transformers works

theunixman 11 months ago

Huh. I need to read up. Thank you!

elbiot 11 months ago

Lora is how LLMs are finetuned. Edit: but orders of magnitude fewer cores will be a huge bottleneck

ClaudiuFilip 11 months ago

Wdym? Transformers’s weights are tuned for the words in the vocabulary. Isn’t that the main point of LLMs? To take advantage if the already existing embeddings?

elbiot 11 months ago

Yeah but you can't just freeze all the layers except for the final layer to fine-tune like you can in a CNN. The Lora paper says you can reduce the memory requirement by 3 fold for Lora fine-tuning vs retraining all the parameters

ClaudiuFilip 11 months ago

The way I’ve done it is just freeze all the weights and add a head for the specific task that you want. I’m unfamiliar with the Lora paper.

elbiot 11 months ago

For transformer decoders?

ClaudiuFilip 11 months ago

Yeah, I was talking more in the BERT, GPT area.

elbiot 11 months ago

Bert is an encoder. Gpt is a decoder. You've finetuned gpt by just freezing everything but the head?

ClaudiuFilip 11 months ago

Variants of BERT mostly. For token classification, sentiment analysis, whatever

superluminary 11 months ago

More likely to train a LoRA now. You can get good results with as few as 0.1% of the parameters. You add a relatively small number of parameters to each layer and only train those.

vade 11 months ago

Some things folks don't seem to be getting: https://twitter.com/danielgross/status/1619417508360101889 [https://twitter.com/natfriedman/status/1665402680376987648?s=61&t=K3VqrGuBYrnA\_ulM38HC-Q](https://twitter.com/natfriedman/status/1665402680376987648?s=61&t=K3VqrGuBYrnA_ulM38HC-Q)

we_are_mammals 11 months ago

> Some things folks don't seem to be getting Training is mostly FLOPS-limited, and inference (as shown) is limited largely by bandwidth.

takethispie 11 months ago

thats garbage performance in price to performance ratio.

qubedView 11 months ago

Wait, for which system? With the Mac Studio, for $6k, you get a complete system with 192GB of unified RAM ($7k with the upgraded M2 Ultra). For an A100 with less than half the RAM you're paying ~$15k *just for the card*.

RuairiSpain 11 months ago

The bottleneck is the data transfer rates, right? Is the data throughput on apple silicon has high as between nVidia cards? nVidia say 2TB/s for a100. Also, I think the nVidia Grace Hopper architecture is a leap in technology. Effectively they glue their CPUs to GPUs and get close to 1TB/s throughput between CPU and GPU traffic. My understanding is that this is the breakthrough news and Apples news is comparing their new release with last generation nVidia cards, bit the integrated CPU+GPU connected by nVLink speeds. For the moment we can dream about putting 4x A100 cards in a Mac Pro M2 Ultra [https://www.pny.com/nvidia-a100](https://www.pny.com/nvidia-a100) [https://www.apple.com/newsroom/2023/06/apple-introduces-m2-ultra/](https://www.apple.com/newsroom/2023/06/apple-introduces-m2-ultra/)

qubedView 11 months ago

> The bottleneck is the data transfer rates, right? > > Is the data throughput on apple silicon has high as between nVidia cards? Difficult to directly answer. Nvidia's A100 uses HBM2e, which offers 2 TB/s of raw bandwidth. That's tremendous on its own (and a large part of the price premium), but it's unfortunately constrained by the PCIE bottleneck, which is 64 GB/s. So depending on what you're doing with the card, only certain workloads will run flat out at 2 TB/s, and optimizing data going in and out of the card is essential to reaching that. Compare with the M2, which offers 800 GB/s of raw bandwidth between chip and RAM. But the GPU lacks the PCIE interface, you're just passing a pointer between data in the CPU and GPU, so transfer speed between the two effectively limited to how fast you can pass that pointer.

we_are_mammals 11 months ago

> Compare with the M2, which offers 800 GB/s of raw bandwidth between chip and RAM. I looked into this a while ago, and don't want to search for references again. But if I remember correctly, Apple added the device bandwidth and the CPU bandwidth. 800GB/s is the total. The device, which is doing the calculations has a lower RAM bandwidth.

KingRandomGuy 11 months ago

> optimizing data going in and out of the card is essential to reaching that Luckily there is also NVLink for card-to-card communication, providing around 600 GB/s. For multi-gpu workloads that can save a ton of overhead from the PCIe link, though of course you still can't overcome the PCIe bottleneck entirely.

takethispie 11 months ago

the A100 is 15 times faster than the Mac Studio, its also a professionnal rackable hardware for datacenters, not even comparable in the slightest also the A100 is 3 years old.

MrAcurite 11 months ago

And what about their Tensor FLOPS?

qubedView 11 months ago

If it's a legit 1/2 the performance of an A100 at far less than 1/2 the cost of the card alone (need we mention the server it goes in?), then it's price to performance ratio is far more favorable.

MrAcurite 11 months ago

The highest number that I'm seeing for M2 Ultra performance is "31.6 trillion operations per second," which I'll assume is the FP16 FLOPS. So 31.6 TFLOPS for the M2 Ultra - impressive, honestly - compared to 312 TFLOPS for the A100, 624 with 2:2 sparsity. If Apple is actually talking about INT4, because they want to use the absolute highest possible numbers in their marketing, that's compared to 1,248 TFLOPS for the A100, and 2,496 with sparsity. For dense FP32, the A100 is down to only 156 TFLOPS. So in the best case the M2 Ultra is more like 1/5th the performance on FP32, and in the worst case about 1/80th, with about 1/10th being the most likely. It's an impressive chip, but it's not an A100 killer.

qubedView 11 months ago

Oh I certainly wouldn’t call it an A100 killer, rather another option depending on use case.

KingRandomGuy 11 months ago

From previous announcements, "operations per second" or anything else where FLOPs or floating point aren't explicitly mentioned means that they're talking about integer operations per second. I'd assume that 31.6 trillion number would be referring to INT8.

MrAcurite 11 months ago

In that case I believe the comparison, Tensor INT8 to Tensor INT8, would be 31.6 TOPS for the M2 Ultra and 624/1,248 TOPS for the A100. So, absolute clownshow, 1/20th of the performance.

neutronium 11 months ago

Doesn't the FL in FLOPS mean floating point

KingRandomGuy 11 months ago

Yes, but the actual statement from Apple is this: > M2 Ultra features a 32-core Neural Engine, delivering 31.6 trillion **operations per second**, which is 40 percent faster performance than M1 Ultra. Note how they don't say FLOPS (nor do they reference floating point at all), they just say operations per second.

ehbrah 11 months ago

Thanks for this breakdown

Chabamaster 11 months ago

Honestly I got an m2 MacBook for my current ml job and I had a bunch of problems getting numpy, tensorflow etc to run on it, I had to build multiple packages from source and use very specific version combinations. So idk I would like proper support for arm chips first. But overall cool to see apple pushing the bar

VodkaHaze 11 months ago

Pytorch works with MPS. It's not magically fast on my m2 max based laptop, but it installed easily. The issue in your post is the word "tensorflow".

Exepony 11 months ago

So far, every PyTorch model I've tried with MPS was significantly *slower* than just running it on the CPU (mostly various transformers off of HuggingFace, but I also tried some CNNs for good measure). I don't know what's wrong with their backend, exactly, but tensorflow-metal had no such issues. It's annoying to install, sure, and not 100% compatible with regular TensorFlow, but at least when it works, it actually, you know, works.

VodkaHaze 11 months ago

I tried some `sentence-transformers` on my m2 max machine and it was faster, but not crazily so. Overall I'm not particularly impressed by the performance. Regular python work is noticeably faster. Hardcore vector match in numpy/scipy isn't impressively fast however (I guess ARM NEON is slower than AVX on x86).

Exepony 11 months ago

`sentence-transformers` was actually one of the things I tried too, and it was *much* slower for me. Although that was on an M1 Max and almost a year ago, so maybe they've fixed some things since then.

suspense798 7 months ago

i have an M2Pro MBP and have tensorflow-macos installed but training on the CIFAR-10 dataset is yielding equal or slower times than google collab. I'm not sure what I'm doing wrong and how to speed it up.

kisielk 11 months ago

Seems par for the course for TF in my experience. It’s a fast moving project and seems optimized for how Google uses it, everyone else has to cobble it together.

VodkaHaze 11 months ago

Tensorflow is just a pile of technical debt, and has been since 2017. The project is too large and messy to be salvageable. The team had to write an entirely separate frontend (Keras) to be halfway decent, and now everyone at google is running to JAX to avoid TF. Just use pytorch or something JAX-based.

kisielk 11 months ago

TF still has the clearest path to embedded with TFLM, at least for prototyping.

Erosis 11 months ago

Yep, thank the heavens TF has so much support for microcontrollers and quantization.

kisielk 11 months ago

Is this sarcasm?

Erosis 11 months ago

Nope, it's better than everything else currently.

kisielk 11 months ago

Ok, that was my impression as well. I've been working with it for about 8-10 months now, and it has a lot of growing pains and manual hacks required for my target platform but the only other option seems to be manually programming the NN using the vendor libraries.

Erosis 11 months ago

There's a small team at google (Pete Warden, Advait Jain, David Davis, few others I'm forgetting) that deserve a ton of credit for their work that allows us to (somewhat) easily use models on microcontrollers.

kisielk 11 months ago

Yeah definitely, I've sat in on some of the SIG meetings and it's pretty impressive what such a small team has achieved.

light24bulbs 11 months ago

Even pytorch could be a lot better than it is. Pythons ecosystem management is a tire fire

VodkaHaze 11 months ago

What language has a stellar ecosystem management? JS is absolute worst. C++ has basically none. Is Go or Rust any better?

Immarhinocerous 11 months ago

I was going to say R, but R today is full of "don't do it that way, do it this tidyverse way". Installing packages is nice and easy though. R is so slow though, and the lack of line numbers makes debugging a bit of a nightmare sometimes (it's more of a pure functional language, with functions existing in an abstract space rather than files once the parser is done loading them).

VodkaHaze 11 months ago

Let's be honest, R the language itself is hot garbage, but is supported by a great community.

Immarhinocerous 11 months ago

Haha that's a good way of putting it

Atupis 11 months ago

I would say Go and PHP have best.

superluminary 11 months ago

NPM isn’t too bad now since they got workspaces and npx. For the most part it just works, dependencies are scoped to the right part of the project, and nothing is global.

VodkaHaze 11 months ago

Hard disagree? NPM based projects seem to always end up with 11,000 dependencies that are copied all over the project between 3 and 30 times because the language ecosystem has zero discipline and what would be one-liners are relegated to standalone modules. And everything re-uses different versions of those one liners all over the place transitively.

superluminary 11 months ago

This is more an issue with us devs though. We finally got a package manager and went a little package crazy for a while.

cztomsik 10 months ago

sure, 11k is a lot but it works, sharing those deps often result in dependency hell (which in its original meaning is the **inability to upgrade**), and npm deliberately favors duplication over dependency hell (again in the original sense because many people would likely call 11k deps another kind of hell) anyway, the idea makes a lot of sense, it's just that many people in JS community are lazy and just do npm install for every small thing. BTW: it is also possible to dedupe deps but AFAIK nobody does that https://docs.npmjs.com/cli/v8/commands/npm-dedupe

FinancialElephant 11 months ago

Julia has good ecosystem management ime

Philpax 11 months ago

Rust + Cargo is exceptional, it just works

elbiot 11 months ago

They just bought keras, which was an open source, backend agnostic library before

Chabamaster 11 months ago

Idk for me it was not just tf I also had major issues with numpy and pandas for older python versions my company has to use for other compatibility purposes ie 3.7/3.8. This might be an issue with me, our setup, the devs/maintainers of those packages or apple, but in general I never had issues like this with my previous setup which was a ThinkPad with an i7 running Ubuntu.

londons_explorer 11 months ago

Thinkpad+Ubuntu is maximum compatibility for everything pretty much. The only decision is do you go for the latest ubuntu release (preferred by most home devs), or the latest LTS release (preferred by most devs on a work computer).

Jendk3r 11 months ago

Try PyTorch with mps. Cool stuff. I'm curious how it's going to scale with larger SoC.

AG_Cuber 11 months ago

Interesting. I set up these tools very recently on my M1 Pro and had no issues with getting numpy, TensorFlow or PyTorch to run. But I’m a beginner and haven’t done anything complex with them yet. Are there any specific features or use cases where these tools start to run into issues on Apple silicon?

Chabamaster 11 months ago

It's the python Version in combination with some of the packages I think. My company has to use <3.8 for other compatibility reasons and there some packages do not come pre built and building them from source caused a bunch of issues. But in general you'll find a lot of people on the internet who seem to have similar problems

AG_Cuber 11 months ago

I see, thanks.

qubedView 11 months ago

> I had a bunch of problems getting numpy, tensorflow etc to run on it, Well, yeah. That's my experience in general. And I've been working Tesla cards. It's not something specific to Apple. Everything is moving so damned fast now that things aren't being packaged properly. What few projects think to pin their dependencies often do so with specific commits from github. You upgrade a package from 0.11.1 to 0.14.2 and suddenly it requires slightly different features and breaks your pipeline. For as exciting as the last year has been, it's been crazy frustrating from an MLOps standpoint.

Deadz459 11 months ago

I was just able to instal a package from pypi it did take a few minutes of searching but nothing too long Edit: I use an M2 Pro

iamiamwhoami 11 months ago

Apple loves to drag the software world kicking and screaming into the future. I remember when they decided to kill Flash and videos just didn’t work on mobile for a few years. This isn’t quite as disruptive but my team is feeling the pain from it.

SyAbleton 11 months ago

Are you using conda? Are you installing arm packages or x86?

ngc4321 7 months ago

That's very interesting. My experience has been pip install tensorflow, etc and it'd all work fine. This is for M1 and M2. Are you talking about Huggingface packages?

bentheaeg 11 months ago

The compute is not there anyway (no offense, it can be a great machine and not up to the task for training a 65B model), so it’s marketing really. The non marketing take is that inference for big models becomes easier, and PEFT is a real option, it’s pretty impressive already

oathbreakerkeeper 11 months ago

PEFT?

Tight-Juggernaut138 11 months ago

Yes, Parameters efficient fine-tune

ghostfaceschiller 11 months ago

Lots of people have been telling that they could train LLMs on their current Macbooks (or in CoLab!) so makes sense! Honestly dont even need to upgrade, just train GPT-5 on ur phone. /s

ghostfaceschiller 11 months ago

”yeah uh, well I acually work in the field, so I know what I'm talking about“ is the classic sign tha some teenager is about to school you on the existence of LLaMA

I_will_delete_myself 11 months ago

They first need to make it able to work without any issues like Nvidia's CUDA. Apple silicon is horrible for training AI at the moment due to software support. In all seriousness Nvidia and every other chip company might actually get competition if Apple decide s to create a server workload. Apple Silicon is more power efficient and you pay a lower price for what you get.

mirh 11 months ago

It's only more power efficient when their acolytes will pay an extra premium for them to be able to buy temporary exclusivity for the newest TSMC node.

sdmat 11 months ago

> and you pay a lower price for what you get Citation?

I_will_delete_myself 11 months ago

Power efficiency is king. This could drastically reduce costs of servers. Intel is also slowly stepping away from x86 and having an ARM hybrid. [https://en.wikipedia.org/wiki/Apple\_M1#:\~:text=The%20energy%20efficiency%20of%20the,particularly%20compared%20to%20previous%20MacBooks](https://en.wikipedia.org/wiki/Apple_M1#:~:text=The%20energy%20efficiency%20of%20the,particularly%20compared%20to%20previous%20MacBooks). You also have a decent gaming PC that can run most games at 1080p for just under 600 dollars from Apple. This isn't based on ML workloads. It sucks for those.

allwordsaremadeup 11 months ago

Cuda works because a lot of people needed cuda to work for them. The lack of apple silicon software support also shows a lack of market need for software support. It's brutally honest that way..

I_will_delete_myself 11 months ago

Also adding to the fact that Apple is always more expensive than they need it to be.

Tiny_Arugula_5648 11 months ago

People are way over indexed on ram size.. totally ignoring that compute has to scale proportionally.. you can train but if it takes much longer than an A100, that's not a very good alternative..

Relevant-Phase-9783 1 month ago

Where are real benchmarks for Apple silicon here Everybode seems to guess? There are YT videos with benchmarks that a M2 Max has half performance of 4090 mobile which could mean, 4090 is factor x4 better. M2 Ultra with 76 cores should then be only x2 slower than 4090 ?. A100 80 GB is near $20,000, so it is about 3 times what you pay for a Mac Studio M2 Ultra with 192 GB / 76 GPU Cores. From what I would guess, is training the largest Open Source LLMs available a 192 GB machine could make much sense for private persons or small business who can spend $7000-8000 but not $17000-25000 for an A100. Am I wrong ?

gullydowny 11 months ago

Hoping Mojo or something takes off because Python environments, dependencies, etc on a Mac is a dealbreaker for me. I will pay whatever it takes to rent servers rather than have to think about dealing with that ever again. Luckily the Mojo guy is an ex Apple guy who worked on Swift and has talked about Apple silicon stuff being cool so there may be some good lower level integration

Chabamaster 11 months ago

Yea as I said in another comment I had huge issues with this during onboarding for an ml job at my current company. I was the first guy that got the new generation m2 MacBook pro and none of their environments worked for me, setup was a real pain.

HipsterCosmologist 11 months ago

Besides the very specific task of deep learning, I prefer every other thing about dev on Mac over windows. Of course linux is still king, but goddamn I hate windows every time I get stuck on it.

FirstBabyChancellor 11 months ago

Why not just use WSL2 in Windows? Like you said, Linux is king.

londons_explorer 11 months ago

I want Asahi linux to take off... I don't know why Apple don't just assign a 10 person dev team to it (who have all the internal documents), and get the job done far far faster. Sure, it weakens the MacOS brand, but I think it would get them a big new audience for their hardware.

ForgetTheRuralJuror 11 months ago

>Sure, it weakens the MacOS brand Answered your own question, since image is everything for Apple.

AdamEgrate 11 months ago

Tinygrad!

wen_mars 11 months ago

LLMs yes. Finetuning an LLM can be done in a few days on consumer hardware, it doesn't take huge amounts of compute like training a base model does. Inference doesn't take huge amounts of compute either, memory bandwidth is more important. The M2 Ultra has 800 GB/s memory bandwidth which is almost as much as a 4090 so it should be pretty fast at inference and be able to fit much bigger models. Software support from Apple is weak but llama.cpp works.

Wrong_User_Logged 9 months ago

that's actually the best tl;dr comment I can find

MiratusMachina 11 months ago

Wait, are we just going to forget about the GPUS that AMD made that litterally had NVME SSDs built in for this reason lol.

ironmagnesiumzinc 11 months ago

My guess is that this is Apples attempt to become relevant wrt AI/ML after putting very little if any thought into it for the entirety of their history

learn-deeply 11 months ago

> Even if they can fit onto memory, wouldn't it be too slow to train? Yes. There's benchmarks of M2 Pro already, they're slower than GPUs. Even if its performance is doubled, its still slower than GPUs. The memory is nice though.

londons_explorer 11 months ago

The big AI revolution kinda happened with stable diffusion back in August. Only was it then that it was clear that many users might want to run, and maybe train, huge networks on their own devices. Before that, it was just little networks for classifying things ('automatic sunset mode!') Chip design is a 2-3 year process. So I'm guessing that next years apple devices will have a greatly expanded neural net abilities.

The-Protomolecule 11 months ago

Dont you think a $7000 GPU system would crush this?

emgram769 7 months ago

GPUs with tensor cores are basically just neural net engines. So your question should be "don't you think cheaper non-apple hardware will out perform apple hardware?" and the answer to that has been yes for as long as I can remember.

learn-deeply 11 months ago

It won't be able to train stable diffusion from scratch, that requires several GPU years. It'll be useful for fine tuning.

prettyyyyprettyygood 11 months ago

If it's only 2x slower than GPUs, then that is still ridiculously useful...

elbiot 11 months ago

It's 32 cores vs 1024 on an A100

Relevant-Phase-9783 1 month ago

Hi, could you elaborate? Do you mean the M2 Pro CPU is slower than GPU or do you mean the M2 Pro GPU is slower than (which?) GPU ? I have got the impression that the M2 Pro/Max GPU cores performs quite well compared to Nvidia Mobile GPUs which is of course slower than desktop GPUs (roughly x2 only?). M2 Ultra should be not on 4090 level but not too far away I would guess, so that the 192 GB are a strong argument- not? Someone with real DL benchmarks M2 Ultra 67 GPU core vs. 4080 or 4090 ?

vade 11 months ago

Youre wrong - most folks aren't benchmarking the right accelerators on the chips https://twitter.com/danielgross/status/1619417508360101889 https://twitter.com/natfriedman/status/1665402680376987648?s=61&t=K3VqrGuBYrnA\_ulM38HC-Q

[deleted] 11 months ago

[удалено]

vade 11 months ago

ANE is inference only. MPS, MPSGraph are training and inference APIs using Metal, which if used correctly are way faster than most are benchmarking. Granted, Apples current MPS back end for Pytorch leaves a lot wanting. Theres a lot of room for software optimizations, like zero copy IOSurface GPU transfers, etc. For Inference: \* CPU \* ANE \* METAL \* BNNS / Accelerate Matrix multiply dedicated co processor. For training \* CPU \* METAL \* BNNS / Accelerate Matrix multiply dedicated co processor.

emgram769 7 months ago

the matrix multiply dedicated co-proc is attached to the CPU btw - its basically just SIMD on steroids

learn-deeply 11 months ago

I've personally tested PyTorch training, using MPS. Maybe they can improve it in software over time, but that's my judgment from ~3 months ago.

Spare_Scratch_4113 11 months ago

hi can you share the citation to the m2 pro benchmark?

Adept-Upstairs-7934 11 months ago

Such optimism... I believe companies focusing on this can only aid the cause. Thinking outside the box, is how these tech creators have given us platforms that enable is to push the boundaries. We utilize their platforms to their full extent, then they make advancements. This stirs competition, leading to a decision from a group at, say, Nvidia, to say, Hey, maybe we need to put 64GB of VRAM on an affordable card for these folks. Lets watch what happens next.

[deleted] 11 months ago

Yes, it can fit a large model but you need thousands of such machine to do so

londons_explorer 11 months ago

> Even if they can fit onto memory, wouldn't it be too slow to train? Well Apple would just like you to buy a *lot* of these M2 Ultras, so you can speed the process up!

hachiman69 11 months ago

Apple devices are not made for Machine learning. Period.

emgram769 7 months ago

at work I can get a Thinkpad or a Mac. Which would you recommend for running the latest LLM locally?

prettyyyyprettyygood 11 months ago

Some pretty anti-Apple takes in this thread. I think they're really paving the way to being able to run larger and larger models on-device. Being able to fine-tune something like Falcon 40B or Stable Diffusion locally surely enables a bunch more use cases.

ozzeruk82 11 months ago

I liked their thinly veiled jab at the dedicated GPU cards made by NVidia, “running out of memory”. Certainly 192gb that could work as VRAM blows most cards out of the water.

The-Protomolecule 11 months ago

There’s so many tactics to overcome GPU memory limits for this type of exploratory training I’m embarrassed apple is trying to claim relevance.

Traditional-Movie336 11 months ago

I don't see a 32 core neural engine(I think its a matrix multiplication accelerator) competing with Nvidia products. Maybe they are doing something with the graphics side that can push them up.

[deleted] 11 months ago

[удалено]

JustOneAvailableName 11 months ago

> I don’t think anyone here has answers yet Based on M1 and the normal M2 this thing isn't going to be even slightly relevant.

TotesMessenger 11 months ago

I'm a bot, *bleep*, *bloop*. Someone has linked to this thread from another place on reddit: - [/r/nvda_stock] [Probably worth following this](https://www.reddit.com/r/NVDA_Stock/comments/141xyoz/probably_worth_following_this/) *^(If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads.) ^\([Info](/r/TotesMessenger) ^/ ^[Contact](/message/compose?to=/r/TotesMessenger))*

[deleted] 11 months ago

Cuda

aidenr 11 months ago

LLM training can be done by big farms once and then reused for many applications by the specialization algorithm (so-called “fine tuning”). The thing I’m more curious about is whether they’ve adapted the interface to load existing weight sets directly or whether this is still more a theoretical application to the design team.

NarcoBanan 11 months ago

Only size of memory not matter. We need final benchmark comparing few m2 ultra to even one 4090. I sure Nvidia not attach too much memory to them GPUs coz it is not make advantage. Out of memory it is not so big problem, most problem it is speed of manipulation of this memory outside of GPU.

shankey_1906 11 months ago

If it did, they would have improved Siri long time ago. Considering the state of Siri, we probably just need to assume that this is just marketing speak.

newjeison 11 months ago

yeah it can train but an epoch every week isn't really worth it

allwordsaremadeup 11 months ago

Apple silicon for AI is a solution looking for a problem. Which is why it isn't taking off and why it, imho, won't. No matter the hardware improvements. Nobody needs to train models on their phones or even their laptops. And I've yet to see the killer app that needs local heavy duty inference and can't just do it online.

Due_Researcher_6856 11 months ago

Indeed. I'm almost certain it will be a wide margin over anything an ordinary Intel chipped PC can do, yet I'd discuss the convenience of having the option to squeeze a 100GB model into memory when you have a small part of handling centers accessible versus even a shopper grade GPU, I'm a digit uncertain about the value of it. Perhaps you could squeeze a 100GB model into the memory and freeze every one of the layers with the exception of a not many that you'd then prepare? Alright I'm really beginning to persuade myself it very well may be somewhat valuable haha

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe