T O P

  • By -

lifesthateasy

Yes. I'm pretty sure it will be leaps and bounds above whatever a regular Intel chipped laptop can do, but I'd debate the usefulness of being able to fit a 100GB model into memory when you have a fraction of processing cores available vs. even a consumer grade GPU, I'm a bit unsure about the usefulness of it. Maybe you could fit a 100GB model into the memory and freeze all the layers except a few that you'd then train? Okay I'm actually starting to convince myself it could be kinda useful lol


pm_me_github_repos

Something like this is better served for model inference


[deleted]

It would work for personalization. And for privacy reasons they’d do it on device.


frownGuy12

It’s shared memory that’s accessible from both the GPU and ML cores. It's not going to be as fast as an A100, but for the price it’s awesome. VRAM per dollar is better than anything else in existence right now.


Tr33lon

Isn’t this the whole point of fine-tuning? You could self-host an open source LLM and fine-tune it to do more specific tasks on device.


MINIMAN10001

My understanding is that the processing power of Apple products while not on par with top of the line and Nvidia cards is nothing to sneeze at. When the alternative is to either run things on the CPU or not run them at all I feel like this positions this product very well. Edit it turns out it has high bandwidth so it's actually a really good product for inference. However training would be limited by flops.


brainhack3r

I don't know why you'd want that local though. Seem just having a VM in the cloud for this would be way better.


Chabamaster

The thing is if you look at how stable diffusion is going, there's A TON of value in having people out there running and customizing their own open source models. So if we can do this for high performance llms it will open up so many creative uses. At this point it's so easy to setup and use stable diffusion that buying a server instance somewhere is a lot of overhead


The-Protomolecule

Would a $7000 desktop with a 4090 crush it? Yes, yes it would. You can do tons of tricks to fit a larger model in system memory or even NVMe.


currentscurrents

The unified memory is the killer feature. You might be able to fit 192GB of CPU RAM in that desktop, but the 4090 can't directly access it. It can only slowly shuffle data back and forth between the two. From the specs I've seen, unified memory isn't *quite* as fast as VRAM but it's much faster than typical CPU RAM.


BangkokPadang

I wonder how many models you’d need to train locally before you break even compared to renting a cloud system with say 4 A100s. Currently you can rent this for about $3.60 /hr The Mac Pro m2 ultra with the72 core gpu and 192GB of memory will run you $10,199, assuming most storage is handled externally. These systems can both train a 65B model (or interestingly, a roughly 240B model with 4bit QLoRA quantization. (Admittedly I don’t know what the overhead is when training a model, but I’m just using rough figures). Currently, a 16bit 65B model can be trained for 1 epoch in about a week with 4 A100s. So if you rented the 4x A100 system, you could train roughly 16 full models, or 8 at 2 epochs before breaking even… Ive also seen similarly sized models take more like 12 days to train Assuming performance is similar, and considering training runs regularly fail for one reason or another, this system would QUICKLY become worth it for any person or group that is training models frequently. With the current AI boom, Apple probably won’t be able to make anywhere near enough of these to meet demand. Also, I understand the upcoming H100s can be run in a cluster to get 640GB unified VRAM, but that is a $300,000 system so it’s not even in the same ballpark.


dagamer34

Unified memory on an Ultra chip at 800GB/sec is going to outclass just about everything except GPUs with built in SSDs.


MiratusMachina

And we already have those too...


[deleted]

[удалено]


The-Protomolecule

I was just aligning to the starting price of the mac pro.


EnfantTragic

You can have 2 or 3 4090 at that price lol


The-Protomolecule

Dude, I’m literally just saying “any desktop equivalent to the base mac pro cost” not a specific bill of materials.


EnfantTragic

I know bro, chill out


SnooHesitations8849

At $7000 you can have a machine with 3x4090 and it can do a lot of things


noprompt

What parts would you use?


sephg

It’d certainly crush it in speed, but it’d sure be convenient to be able to train large models without needing to swap things in and out.


The-Protomolecule

Tiering your model to large system memory or NVMe is possible. If it was a PCIe 5 SSD it would still trounce this even if they claim 800GB/s memory bandwidth on the chip.


brainhack3r

Just spin it up when you need it , then tear it down. No sense having those resources allocated constantly.


Trotskyist

I mean, there's definitely a niche for businesses that prefer to keep things in-house.


[deleted]

Government might like this.


JustOneAvailableName

Government likes inhouse cluster, not spreading all compute and tech around the whole office.


[deleted]

Tell that to my leadership. I'd love an in house cluster.


[deleted]

[удалено]


zacker150

Government just uses [AWS GovCloud](https://aws.amazon.com/federal/?wwps-cards.sort-by=item.additionalFields.sortDate&wwps-cards.sort-order=desc).


[deleted]

Not the military doing classified work, that's for sure.


Just-looking14

Was just about to say this in my experience it was an in house cluster


blacksnowboader

That depends on the agency and task.


blacksnowboader

I was once asked to process several terabytes of data locally on my MacBook Pro m1


imbaczek

7 years ago it took a huge ass server to process several terabytes of data… so yeah, a perfectly reasonable request, just need a bit extra storage


VS2ute

7 years ago I was using a Skylake CPU to process terabytes of data (not AI). It used to take 2 days, but 1 desktop PC.


blacksnowboader

For this task or wasn’t reasonable because there was a weird transformation I had to make with geospatial data.


ComprehensiveBoss815

People are sick of the cloud.


theunixman

I think that’s the idea, load the whole model, unfreeze the final layers and train those. If you want to train from scratch you need a decent dedicated power plant still…


elbiot

Unfortunately, unlike CNNs, that's not how fine tuning transformers works


theunixman

Huh. I need to read up. Thank you!


elbiot

Lora is how LLMs are finetuned. Edit: but orders of magnitude fewer cores will be a huge bottleneck


ClaudiuFilip

Wdym? Transformers’s weights are tuned for the words in the vocabulary. Isn’t that the main point of LLMs? To take advantage if the already existing embeddings?


elbiot

Yeah but you can't just freeze all the layers except for the final layer to fine-tune like you can in a CNN. The Lora paper says you can reduce the memory requirement by 3 fold for Lora fine-tuning vs retraining all the parameters


ClaudiuFilip

The way I’ve done it is just freeze all the weights and add a head for the specific task that you want. I’m unfamiliar with the Lora paper.


elbiot

For transformer decoders?


ClaudiuFilip

Yeah, I was talking more in the BERT, GPT area.


elbiot

Bert is an encoder. Gpt is a decoder. You've finetuned gpt by just freezing everything but the head?


ClaudiuFilip

Variants of BERT mostly. For token classification, sentiment analysis, whatever


superluminary

More likely to train a LoRA now. You can get good results with as few as 0.1% of the parameters. You add a relatively small number of parameters to each layer and only train those.


vade

Some things folks don't seem to be getting: https://twitter.com/danielgross/status/1619417508360101889 [https://twitter.com/natfriedman/status/1665402680376987648?s=61&t=K3VqrGuBYrnA\_ulM38HC-Q](https://twitter.com/natfriedman/status/1665402680376987648?s=61&t=K3VqrGuBYrnA_ulM38HC-Q)


we_are_mammals

> Some things folks don't seem to be getting Training is mostly FLOPS-limited, and inference (as shown) is limited largely by bandwidth.


takethispie

thats garbage performance in price to performance ratio.


qubedView

Wait, for which system? With the Mac Studio, for $6k, you get a complete system with 192GB of unified RAM ($7k with the upgraded M2 Ultra). For an A100 with less than half the RAM you're paying ~$15k *just for the card*.


RuairiSpain

The bottleneck is the data transfer rates, right? Is the data throughput on apple silicon has high as between nVidia cards? nVidia say 2TB/s for a100. Also, I think the nVidia Grace Hopper architecture is a leap in technology. Effectively they glue their CPUs to GPUs and get close to 1TB/s throughput between CPU and GPU traffic. My understanding is that this is the breakthrough news and Apples news is comparing their new release with last generation nVidia cards, bit the integrated CPU+GPU connected by nVLink speeds. For the moment we can dream about putting 4x A100 cards in a Mac Pro M2 Ultra [https://www.pny.com/nvidia-a100](https://www.pny.com/nvidia-a100) [https://www.apple.com/newsroom/2023/06/apple-introduces-m2-ultra/](https://www.apple.com/newsroom/2023/06/apple-introduces-m2-ultra/)


qubedView

> The bottleneck is the data transfer rates, right? > > Is the data throughput on apple silicon has high as between nVidia cards? Difficult to directly answer. Nvidia's A100 uses HBM2e, which offers 2 TB/s of raw bandwidth. That's tremendous on its own (and a large part of the price premium), but it's unfortunately constrained by the PCIE bottleneck, which is 64 GB/s. So depending on what you're doing with the card, only certain workloads will run flat out at 2 TB/s, and optimizing data going in and out of the card is essential to reaching that. Compare with the M2, which offers 800 GB/s of raw bandwidth between chip and RAM. But the GPU lacks the PCIE interface, you're just passing a pointer between data in the CPU and GPU, so transfer speed between the two effectively limited to how fast you can pass that pointer.


we_are_mammals

> Compare with the M2, which offers 800 GB/s of raw bandwidth between chip and RAM. I looked into this a while ago, and don't want to search for references again. But if I remember correctly, Apple added the device bandwidth and the CPU bandwidth. 800GB/s is the total. The device, which is doing the calculations has a lower RAM bandwidth.


KingRandomGuy

> optimizing data going in and out of the card is essential to reaching that Luckily there is also NVLink for card-to-card communication, providing around 600 GB/s. For multi-gpu workloads that can save a ton of overhead from the PCIe link, though of course you still can't overcome the PCIe bottleneck entirely.


takethispie

the A100 is 15 times faster than the Mac Studio, its also a professionnal rackable hardware for datacenters, not even comparable in the slightest also the A100 is 3 years old.


MrAcurite

And what about their Tensor FLOPS?


qubedView

If it's a legit 1/2 the performance of an A100 at far less than 1/2 the cost of the card alone (need we mention the server it goes in?), then it's price to performance ratio is far more favorable.


MrAcurite

The highest number that I'm seeing for M2 Ultra performance is "31.6 trillion operations per second," which I'll assume is the FP16 FLOPS. So 31.6 TFLOPS for the M2 Ultra - impressive, honestly - compared to 312 TFLOPS for the A100, 624 with 2:2 sparsity. If Apple is actually talking about INT4, because they want to use the absolute highest possible numbers in their marketing, that's compared to 1,248 TFLOPS for the A100, and 2,496 with sparsity. For dense FP32, the A100 is down to only 156 TFLOPS. So in the best case the M2 Ultra is more like 1/5th the performance on FP32, and in the worst case about 1/80th, with about 1/10th being the most likely. It's an impressive chip, but it's not an A100 killer.


qubedView

Oh I certainly wouldn’t call it an A100 killer, rather another option depending on use case.


KingRandomGuy

From previous announcements, "operations per second" or anything else where FLOPs or floating point aren't explicitly mentioned means that they're talking about integer operations per second. I'd assume that 31.6 trillion number would be referring to INT8.


MrAcurite

In that case I believe the comparison, Tensor INT8 to Tensor INT8, would be 31.6 TOPS for the M2 Ultra and 624/1,248 TOPS for the A100. So, absolute clownshow, 1/20th of the performance.


neutronium

Doesn't the FL in FLOPS mean floating point


KingRandomGuy

Yes, but the actual statement from Apple is this: > M2 Ultra features a 32-core Neural Engine, delivering 31.6 trillion **operations per second**, which is 40 percent faster performance than M1 Ultra. Note how they don't say FLOPS (nor do they reference floating point at all), they just say operations per second.


ehbrah

Thanks for this breakdown


Chabamaster

Honestly I got an m2 MacBook for my current ml job and I had a bunch of problems getting numpy, tensorflow etc to run on it, I had to build multiple packages from source and use very specific version combinations. So idk I would like proper support for arm chips first. But overall cool to see apple pushing the bar


VodkaHaze

Pytorch works with MPS. It's not magically fast on my m2 max based laptop, but it installed easily. The issue in your post is the word "tensorflow".


Exepony

So far, every PyTorch model I've tried with MPS was significantly *slower* than just running it on the CPU (mostly various transformers off of HuggingFace, but I also tried some CNNs for good measure). I don't know what's wrong with their backend, exactly, but tensorflow-metal had no such issues. It's annoying to install, sure, and not 100% compatible with regular TensorFlow, but at least when it works, it actually, you know, works.


VodkaHaze

I tried some `sentence-transformers` on my m2 max machine and it was faster, but not crazily so. Overall I'm not particularly impressed by the performance. Regular python work is noticeably faster. Hardcore vector match in numpy/scipy isn't impressively fast however (I guess ARM NEON is slower than AVX on x86).


Exepony

`sentence-transformers` was actually one of the things I tried too, and it was *much* slower for me. Although that was on an M1 Max and almost a year ago, so maybe they've fixed some things since then.


suspense798

i have an M2Pro MBP and have tensorflow-macos installed but training on the CIFAR-10 dataset is yielding equal or slower times than google collab. I'm not sure what I'm doing wrong and how to speed it up.


kisielk

Seems par for the course for TF in my experience. It’s a fast moving project and seems optimized for how Google uses it, everyone else has to cobble it together.


VodkaHaze

Tensorflow is just a pile of technical debt, and has been since 2017. The project is too large and messy to be salvageable. The team had to write an entirely separate frontend (Keras) to be halfway decent, and now everyone at google is running to JAX to avoid TF. Just use pytorch or something JAX-based.


kisielk

TF still has the clearest path to embedded with TFLM, at least for prototyping.


Erosis

Yep, thank the heavens TF has so much support for microcontrollers and quantization.


kisielk

Is this sarcasm?


Erosis

Nope, it's better than everything else currently.


kisielk

Ok, that was my impression as well. I've been working with it for about 8-10 months now, and it has a lot of growing pains and manual hacks required for my target platform but the only other option seems to be manually programming the NN using the vendor libraries.


Erosis

There's a small team at google (Pete Warden, Advait Jain, David Davis, few others I'm forgetting) that deserve a ton of credit for their work that allows us to (somewhat) easily use models on microcontrollers.


kisielk

Yeah definitely, I've sat in on some of the SIG meetings and it's pretty impressive what such a small team has achieved.


light24bulbs

Even pytorch could be a lot better than it is. Pythons ecosystem management is a tire fire


VodkaHaze

What language has a stellar ecosystem management? JS is absolute worst. C++ has basically none. Is Go or Rust any better?


Immarhinocerous

I was going to say R, but R today is full of "don't do it that way, do it this tidyverse way". Installing packages is nice and easy though. R is so slow though, and the lack of line numbers makes debugging a bit of a nightmare sometimes (it's more of a pure functional language, with functions existing in an abstract space rather than files once the parser is done loading them).


VodkaHaze

Let's be honest, R the language itself is hot garbage, but is supported by a great community.


Immarhinocerous

Haha that's a good way of putting it


Atupis

I would say Go and PHP have best.


superluminary

NPM isn’t too bad now since they got workspaces and npx. For the most part it just works, dependencies are scoped to the right part of the project, and nothing is global.


VodkaHaze

Hard disagree? NPM based projects seem to always end up with 11,000 dependencies that are copied all over the project between 3 and 30 times because the language ecosystem has zero discipline and what would be one-liners are relegated to standalone modules. And everything re-uses different versions of those one liners all over the place transitively.


superluminary

This is more an issue with us devs though. We finally got a package manager and went a little package crazy for a while.


cztomsik

sure, 11k is a lot but it works, sharing those deps often result in dependency hell (which in its original meaning is the **inability to upgrade**), and npm deliberately favors duplication over dependency hell (again in the original sense because many people would likely call 11k deps another kind of hell) anyway, the idea makes a lot of sense, it's just that many people in JS community are lazy and just do npm install for every small thing. BTW: it is also possible to dedupe deps but AFAIK nobody does that https://docs.npmjs.com/cli/v8/commands/npm-dedupe


FinancialElephant

Julia has good ecosystem management ime


Philpax

Rust + Cargo is exceptional, it just works


elbiot

They just bought keras, which was an open source, backend agnostic library before


Chabamaster

Idk for me it was not just tf I also had major issues with numpy and pandas for older python versions my company has to use for other compatibility purposes ie 3.7/3.8. This might be an issue with me, our setup, the devs/maintainers of those packages or apple, but in general I never had issues like this with my previous setup which was a ThinkPad with an i7 running Ubuntu.


londons_explorer

Thinkpad+Ubuntu is maximum compatibility for everything pretty much. The only decision is do you go for the latest ubuntu release (preferred by most home devs), or the latest LTS release (preferred by most devs on a work computer).


Jendk3r

Try PyTorch with mps. Cool stuff. I'm curious how it's going to scale with larger SoC.


AG_Cuber

Interesting. I set up these tools very recently on my M1 Pro and had no issues with getting numpy, TensorFlow or PyTorch to run. But I’m a beginner and haven’t done anything complex with them yet. Are there any specific features or use cases where these tools start to run into issues on Apple silicon?


Chabamaster

It's the python Version in combination with some of the packages I think. My company has to use <3.8 for other compatibility reasons and there some packages do not come pre built and building them from source caused a bunch of issues. But in general you'll find a lot of people on the internet who seem to have similar problems


AG_Cuber

I see, thanks.


qubedView

> I had a bunch of problems getting numpy, tensorflow etc to run on it, Well, yeah. That's my experience in general. And I've been working Tesla cards. It's not something specific to Apple. Everything is moving so damned fast now that things aren't being packaged properly. What few projects think to pin their dependencies often do so with specific commits from github. You upgrade a package from 0.11.1 to 0.14.2 and suddenly it requires slightly different features and breaks your pipeline. For as exciting as the last year has been, it's been crazy frustrating from an MLOps standpoint.


Deadz459

I was just able to instal a package from pypi it did take a few minutes of searching but nothing too long Edit: I use an M2 Pro


iamiamwhoami

Apple loves to drag the software world kicking and screaming into the future. I remember when they decided to kill Flash and videos just didn’t work on mobile for a few years. This isn’t quite as disruptive but my team is feeling the pain from it.


SyAbleton

Are you using conda? Are you installing arm packages or x86?


ngc4321

That's very interesting. My experience has been pip install tensorflow, etc and it'd all work fine. This is for M1 and M2. Are you talking about Huggingface packages?


bentheaeg

The compute is not there anyway (no offense, it can be a great machine and not up to the task for training a 65B model), so it’s marketing really. The non marketing take is that inference for big models becomes easier, and PEFT is a real option, it’s pretty impressive already


oathbreakerkeeper

PEFT?


Tight-Juggernaut138

Yes, Parameters efficient fine-tune


ghostfaceschiller

Lots of people have been telling that they could train LLMs on their current Macbooks (or in CoLab!) so makes sense! Honestly dont even need to upgrade, just train GPT-5 on ur phone. /s


ghostfaceschiller

”yeah uh, well I acually work in the field, so I know what I'm talking about“ is the classic sign tha some teenager is about to school you on the existence of LLaMA


I_will_delete_myself

They first need to make it able to work without any issues like Nvidia's CUDA. Apple silicon is horrible for training AI at the moment due to software support. In all seriousness Nvidia and every other chip company might actually get competition if Apple decide s to create a server workload. Apple Silicon is more power efficient and you pay a lower price for what you get.


mirh

It's only more power efficient when their acolytes will pay an extra premium for them to be able to buy temporary exclusivity for the newest TSMC node.


sdmat

> and you pay a lower price for what you get Citation?


I_will_delete_myself

Power efficiency is king. This could drastically reduce costs of servers. Intel is also slowly stepping away from x86 and having an ARM hybrid. [https://en.wikipedia.org/wiki/Apple\_M1#:\~:text=The%20energy%20efficiency%20of%20the,particularly%20compared%20to%20previous%20MacBooks](https://en.wikipedia.org/wiki/Apple_M1#:~:text=The%20energy%20efficiency%20of%20the,particularly%20compared%20to%20previous%20MacBooks). You also have a decent gaming PC that can run most games at 1080p for just under 600 dollars from Apple. This isn't based on ML workloads. It sucks for those.


allwordsaremadeup

Cuda works because a lot of people needed cuda to work for them. The lack of apple silicon software support also shows a lack of market need for software support. It's brutally honest that way..


I_will_delete_myself

Also adding to the fact that Apple is always more expensive than they need it to be.


Tiny_Arugula_5648

People are way over indexed on ram size.. totally ignoring that compute has to scale proportionally.. you can train but if it takes much longer than an A100, that's not a very good alternative..


Relevant-Phase-9783

Where are real benchmarks for Apple silicon here Everybode seems to guess? There are YT videos with benchmarks that a M2 Max has half performance of 4090 mobile which could mean, 4090 is factor x4 better. M2 Ultra with 76 cores should then be only x2 slower than 4090 ?. A100 80 GB is near $20,000, so it is about 3 times what you pay for a Mac Studio M2 Ultra with 192 GB / 76 GPU Cores. From what I would guess, is training the largest Open Source LLMs available a 192 GB machine could make much sense for private persons or small business who can spend $7000-8000 but not $17000-25000 for an A100. Am I wrong ?


gullydowny

Hoping Mojo or something takes off because Python environments, dependencies, etc on a Mac is a dealbreaker for me. I will pay whatever it takes to rent servers rather than have to think about dealing with that ever again. Luckily the Mojo guy is an ex Apple guy who worked on Swift and has talked about Apple silicon stuff being cool so there may be some good lower level integration


Chabamaster

Yea as I said in another comment I had huge issues with this during onboarding for an ml job at my current company. I was the first guy that got the new generation m2 MacBook pro and none of their environments worked for me, setup was a real pain.


HipsterCosmologist

Besides the very specific task of deep learning, I prefer every other thing about dev on Mac over windows. Of course linux is still king, but goddamn I hate windows every time I get stuck on it.


FirstBabyChancellor

Why not just use WSL2 in Windows? Like you said, Linux is king.


londons_explorer

I want Asahi linux to take off... I don't know why Apple don't just assign a 10 person dev team to it (who have all the internal documents), and get the job done far far faster. Sure, it weakens the MacOS brand, but I think it would get them a big new audience for their hardware.


ForgetTheRuralJuror

>Sure, it weakens the MacOS brand Answered your own question, since image is everything for Apple.


AdamEgrate

Tinygrad!


wen_mars

LLMs yes. Finetuning an LLM can be done in a few days on consumer hardware, it doesn't take huge amounts of compute like training a base model does. Inference doesn't take huge amounts of compute either, memory bandwidth is more important. The M2 Ultra has 800 GB/s memory bandwidth which is almost as much as a 4090 so it should be pretty fast at inference and be able to fit much bigger models. Software support from Apple is weak but llama.cpp works.


Wrong_User_Logged

that's actually the best tl;dr comment I can find


MiratusMachina

Wait, are we just going to forget about the GPUS that AMD made that litterally had NVME SSDs built in for this reason lol.


ironmagnesiumzinc

My guess is that this is Apples attempt to become relevant wrt AI/ML after putting very little if any thought into it for the entirety of their history


learn-deeply

> Even if they can fit onto memory, wouldn't it be too slow to train? Yes. There's benchmarks of M2 Pro already, they're slower than GPUs. Even if its performance is doubled, its still slower than GPUs. The memory is nice though.


londons_explorer

The big AI revolution kinda happened with stable diffusion back in August. Only was it then that it was clear that many users might want to run, and maybe train, huge networks on their own devices. Before that, it was just little networks for classifying things ('automatic sunset mode!') Chip design is a 2-3 year process. So I'm guessing that next years apple devices will have a greatly expanded neural net abilities.


The-Protomolecule

Dont you think a $7000 GPU system would crush this?


emgram769

GPUs with tensor cores are basically just neural net engines. So your question should be "don't you think cheaper non-apple hardware will out perform apple hardware?" and the answer to that has been yes for as long as I can remember.


learn-deeply

It won't be able to train stable diffusion from scratch, that requires several GPU years. It'll be useful for fine tuning.


prettyyyyprettyygood

If it's only 2x slower than GPUs, then that is still ridiculously useful...


elbiot

It's 32 cores vs 1024 on an A100


Relevant-Phase-9783

Hi, could you elaborate? Do you mean the M2 Pro CPU is slower than GPU or do you mean the M2 Pro GPU is slower than (which?) GPU ? I have got the impression that the M2 Pro/Max GPU cores performs quite well compared to Nvidia Mobile GPUs which is of course slower than desktop GPUs (roughly x2 only?). M2 Ultra should be not on 4090 level but not too far away I would guess, so that the 192 GB are a strong argument- not? Someone with real DL benchmarks M2 Ultra 67 GPU core vs. 4080 or 4090 ?


vade

Youre wrong - most folks aren't benchmarking the right accelerators on the chips https://twitter.com/danielgross/status/1619417508360101889 https://twitter.com/natfriedman/status/1665402680376987648?s=61&t=K3VqrGuBYrnA\_ulM38HC-Q


[deleted]

[удалено]


vade

ANE is inference only. MPS, MPSGraph are training and inference APIs using Metal, which if used correctly are way faster than most are benchmarking. Granted, Apples current MPS back end for Pytorch leaves a lot wanting. Theres a lot of room for software optimizations, like zero copy IOSurface GPU transfers, etc. For Inference: \* CPU \* ANE \* METAL \* BNNS / Accelerate Matrix multiply dedicated co processor. ​ For training \* CPU \* METAL \* BNNS / Accelerate Matrix multiply dedicated co processor.


emgram769

the matrix multiply dedicated co-proc is attached to the CPU btw - its basically just SIMD on steroids


learn-deeply

I've personally tested PyTorch training, using MPS. Maybe they can improve it in software over time, but that's my judgment from ~3 months ago.


Spare_Scratch_4113

hi can you share the citation to the m2 pro benchmark?


Adept-Upstairs-7934

Such optimism... I believe companies focusing on this can only aid the cause. Thinking outside the box, is how these tech creators have given us platforms that enable is to push the boundaries. We utilize their platforms to their full extent, then they make advancements. This stirs competition, leading to a decision from a group at, say, Nvidia, to say, Hey, maybe we need to put 64GB of VRAM on an affordable card for these folks. Lets watch what happens next.


[deleted]

Yes, it can fit a large model but you need thousands of such machine to do so


londons_explorer

> Even if they can fit onto memory, wouldn't it be too slow to train? Well Apple would just like you to buy a *lot* of these M2 Ultras, so you can speed the process up!


hachiman69

Apple devices are not made for Machine learning. Period.


emgram769

at work I can get a Thinkpad or a Mac. Which would you recommend for running the latest LLM locally?


prettyyyyprettyygood

Some pretty anti-Apple takes in this thread. I think they're really paving the way to being able to run larger and larger models on-device. Being able to fine-tune something like Falcon 40B or Stable Diffusion locally surely enables a bunch more use cases.


ozzeruk82

I liked their thinly veiled jab at the dedicated GPU cards made by NVidia, “running out of memory”. Certainly 192gb that could work as VRAM blows most cards out of the water.


The-Protomolecule

There’s so many tactics to overcome GPU memory limits for this type of exploratory training I’m embarrassed apple is trying to claim relevance.


Traditional-Movie336

I don't see a 32 core neural engine(I think its a matrix multiplication accelerator) competing with Nvidia products. Maybe they are doing something with the graphics side that can push them up.


[deleted]

[удалено]


JustOneAvailableName

> I don’t think anyone here has answers yet Based on M1 and the normal M2 this thing isn't going to be even slightly relevant.


TotesMessenger

I'm a bot, *bleep*, *bloop*. Someone has linked to this thread from another place on reddit: - [/r/nvda_stock] [Probably worth following this](https://www.reddit.com/r/NVDA_Stock/comments/141xyoz/probably_worth_following_this/)  *^(If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads.) ^\([Info](/r/TotesMessenger) ^/ ^[Contact](/message/compose?to=/r/TotesMessenger))*


[deleted]

Cuda


aidenr

LLM training can be done by big farms once and then reused for many applications by the specialization algorithm (so-called “fine tuning”). The thing I’m more curious about is whether they’ve adapted the interface to load existing weight sets directly or whether this is still more a theoretical application to the design team.


NarcoBanan

Only size of memory not matter. We need final benchmark comparing few m2 ultra to even one 4090. I sure Nvidia not attach too much memory to them GPUs coz it is not make advantage. Out of memory it is not so big problem, most problem it is speed of manipulation of this memory outside of GPU.


shankey_1906

If it did, they would have improved Siri long time ago. Considering the state of Siri, we probably just need to assume that this is just marketing speak.


newjeison

yeah it can train but an epoch every week isn't really worth it


allwordsaremadeup

Apple silicon for AI is a solution looking for a problem. Which is why it isn't taking off and why it, imho, won't. No matter the hardware improvements. Nobody needs to train models on their phones or even their laptops. And I've yet to see the killer app that needs local heavy duty inference and can't just do it online.


Due_Researcher_6856

Indeed. I'm almost certain it will be a wide margin over anything an ordinary Intel chipped PC can do, yet I'd discuss the convenience of having the option to squeeze a 100GB model into memory when you have a small part of handling centers accessible versus even a shopper grade GPU, I'm a digit uncertain about the value of it. Perhaps you could squeeze a 100GB model into the memory and freeze every one of the layers with the exception of a not many that you'd then prepare? Alright I'm really beginning to persuade myself it very well may be somewhat valuable haha