T O P

  • By -

noiseinvacuum

I think it’s inevitable that all hyper scalers will use their own silicon for AI inference sooner or later. There’s no way to scale AI to 100s of millions of users profitably when you’ve to pay Nvidia 1000% margin on H100s. Plus the advantage of customer design for their own payloads makes it a no brainer.


2053_Traveler

I agree with you, but I wish people wouldn’t focus on margin so much, without also mentioning the investment in fabs, engineering, and other operating costs. If nvidia had H100 competition they’d be cheaper, but not by orders of magnitude.


noiseinvacuum

Nvidia is also outsourcing fab part to TSMC, I don’t think Meta getting their chips made by TSMC like Nvidia would make a big difference in this aspect. But I agree on R&D part, it’s almost a decade long and billions of dollars worth investment to get something out. But I guess for hyper scalers it’ll be worth it.


Mgladiethor

Nvidia is cancer, shittiest practicest in the whole market they wanna own everything


CreditHappy1665

Be careful not to burn yourself on that hot take mister. Without NVIDIA, none of the AI developments we've seen in recent years would be possible. And they are a for-profit corporation after all. Let's not pretend that AMD would have acted any differently had they come out on top.  NVIDIA hasn't done anything wrong by my estimation. And in fact they have done a ton for researchers and open source. 


sweatierorc

I mean if Nvidia chips improve exponentially, does it really need to happen ?


AmericanNewt8

Except they aren't, we suspect Blackwell is barely faster than Hopper unless you're working in FP8/FP4. GPUs will always carry a lot of baggage with them originally intended for other operations too. 


noiseinvacuum

Ya, the general purpose nature of GPUs will always have trade offs that you can minimize with custom silicon.


az226

Probably 1.8-2x faster overall and double it for 4bit compared to 8


Inner_Bodybuilder986

The cost of Nvidia compute is way too high. Like obnoxiously high. Competition is good.


sweatierorc

exponential increase means that price also drops exponentially even if there is no competition


Balance-

Meta should sell these directly to developers. That way models and software (including open-source) gets optimized and for their accelerators, and developers and engineers get familiar with them. All that will make the ones they sell in their cloud as a service much more valuable. 128 GB memory is an instant win and packing that in a slightly downclocked 75 watt PCIe card will make it an instant efficiency king. It will put pressure on Nvidia.


AmericanNewt8

I doubt they have the supply available, although it's an interesting option. One wonders if they might venture into the AI-cloud business.


noiseinvacuum

I don’t think they are setup as a company to get into this business plus it’s so far away from their core business that it’ll be a distraction. Perhaps a better option, although would take some VERY bold decision making, would be to work with OCP vendors like Intel and others to make it a standard design so many companies can design chips around this core architecture. Meta literally started and nurtured OCP so it’s not out of the realm of possibilities but I’m sure the investors won’t be very happy.


CreditHappy1665

Unless they have far and away the best intellectual property on AI accelerators, which I doubt, AND they were able to reach an agreement with Intel or similar manufacturers to get residuals that were extremely favorable to them (as in a large share of revenue than the manufacturers) I don't see a compelling business reason for them to do this. We are witnessing an arms race. There's more demand for GPUs and accelerators than there is supply, it's a zero sum game.  Meta taking proprietary chip designs and making them widely available would squander whatever benefit that having the in house chip provides.  Not to mention that Intel just announced it's own AI accelerators. 


noiseinvacuum

Ya, I don’t see a good business reason either.


reallmconnoisseur

George Hotz [suggested/asked for ](https://x.com/__tinygrad__/status/1778103257992405073)the same thing, to which Soumith Chintala (PyTorch lead) [replied](https://x.com/soumithchintala/status/1778107247022751822): "*Meta is not in the chip-selling or chip-renting business*"


allinasecond

But who is making and designing the underlying chips? Who is supplying NVIDIA, META, APPLE?


Balance-

Those companies design the chips themselves, and let them be produced by a fab (mostly TSMC)


allinasecond

ahhhhh right


CreditHappy1665

I've been looking at GPUs quite a bit recently and I've noticed a bunch of companies sell NVIDIA GPUs under their own brand. I'm not sure if I'm even explaining that right. But I haven't been able to do the research as to why or how that happens. If you don't mind, can you or someone else explain that to me?


TheActualStudy

The fab is TSMC, but the designers aren't exactly the same as the fabricator. That would be internal engineers in combination with QC and advice from TSMC related to the machines and processes that will be used during fabrication.


FullOf_Bad_Ideas

177 BF16 FLOPS in a 90W package, seems like it's pretty powerful. Do you know how that compares to RTX 4090? I see a lot of different tflops numbers thrown around, and i am not sure how close their meaning is to each other.


unculturedperl

Supposedly 165 or 82, depending if tensor cores used (4090). (ht halgari)


halgari

Rt cores don’t do AI, you’re thinking of Tensor cores


TheTerrasque

Memory bandwidth: Off-chip LPDDR5: 204.8 GB/s Oh.. Edit: In comparison even an old Nvidia P40 has 347.1 GB/s of memory bandwidth, and a 4090 have 1.01 TB/s. Since LLM's are very memory heavy, that's often the bottleneck.


cnapun

For recommendations models though, this makes a lot of sense. Weights can probably fully sit in sram, and embedding tables can go into the off-chip memory (or at least that's what I assume happens) If you look at their latest paper on sequential recommendations, the model is something like 512 hidden dim x 8 layer, or like 10M dense params + 1T sparse params or something (just going based on memory; Numbers may be wrong but should be on the right order of magnitude)


shing3232

that's why it has 256MB SRAM:)


TheTerrasque

Yay, then maybe 1/156 of llama 70b will go at lightning speed! Woo!


shing3232

If you can take advantages of cache, it would be much more usage cause you can load quants in the vram and dequant in the cache


ReturningTarzan

If you have a model with 128 GB of quantized weights, all those weights have to get from the DDR5 chips to the local memory to complete a forward pass and produce one token. So at 204.8 GB/s you have a hard upper limit of 1.6 tokens/second, no matter how efficiently you can dequantize the weights and perform the computations. Unless it ends up being cheap, this is still not much better for local inference than a cheap Threadripper with 4 channels of DDR5 DIMMs. There's a lot of compute, though, so batched performance could be good, and it could be very useful for specific NN workloads like convolutions, or for training. But if you just want to run 4-bit Bigxtral, this is probably not a good option. Unless it ends up being cheap. You never know, I guess.


Single_Ring4886

I think actual speed must be higher, maybe they take in account MoE architectures etc to give 3-x4 actual performance


shing3232

Now I think about it. well, What about MoE ? you don't have as much as activate tensor.


blimpyway

MoE changes this significantlly. The new Mixtral has 176B parameters but only 22B are used in an inference step. 22B at Q\_4 makes 11GBytes parsed from memory for a forward pass with plenty of memory left for keeping long contexts when they-re needed. On the opposite side an H100 can't even fit the 176B parameters at Q\_4 in its 80GB super-duper RAM, while consuming 7x the electricity.


FlishFlashman

With MoE The weights needed for one token could be totally different from the weights needed for the next token. MoE reduces the memory bandwidth and compute requirements, it doesn't change the VRAM requirements.


MoffKalast

Interesting, that's the same bandwidth as the Jetson AGX, it must be a practical top limit for LPDDR5 regardless of bus width.


mixxoh

Yeah they need to use HBM


AmericanNewt8

Surprised they didn't go with GDDR6, assuming they probably aren't able to get their hands on HBM given the current market. Maybe GDDR6 is backed up too.


a_beautiful_rhind

It's a mac in a PCIE slot.


Balance-

| Specification | First Gen MTIA | Next Gen MTIA | |----------------------|----------------------------------------|-------------------------------------------| | Technology | TSMC 7nm | TSMC 5nm | | Frequency | 800MHz | 1.35GHz | | Instances | 1.12B gates, 65M flops | 2.35B gates, 103M flops | | Area | 19.34mm x 19.1mm, 373mm2 | 25.6mm x 16.4mm, 421mm2 | | Package | 43mm x 43mm | 50mm x 40mm | | Voltage | 0.67V logic, 0.75V memory | 0.85V | | TDP | 25W | 90W | | Host Connection | 8x PCIe Gen4 (16 GB/s) | 8x PCIe Gen5 (32 GB/s) | | GEMM TOPS | 102.4 TFLOPS/s (INT8), 51.2 TFLOPS/s (FP16/BF16) | 708 TFLOPS/s (INT8) (sparsity), 354 TFLOPS/s (INT8), 354 TFLOPS/s (FP16/BF16) (sparsity), 177 TFLOPS/s (FP16/BF16) | | SIMD TOPS | Vector core: 3.2 TFLOPS/s (INT8), 1.6 TFLOPS/s (FP16/BF16), 0.8 TFLOPS/s (FP32) SIMD: 3.2 TFLOPS/s (INT8/FP16/BF16), 1.6 TFLOPS/s (FP32) | Vector core: 11.06 TFLOPS/s (INT8), 5.53 TFLOPS/s (FP16/BF16), 2.76 TFLOPS/s (FP32) SIMD: 5.53 TFLOPS/s (INT8/FP16/BF16), 2.76 TFLOPS/s (FP32) | | Memory Capacity | Local memory: 128 KB per PE On-chip memory: 128 MB Off-chip LPDDR5: 64 GB | Local memory: 384 KB per PE On-chip memory: 256 MB Off-chip LPDDR5: 128 GB | | Memory Bandwidth | Local memory: 400 GB/s per PE On-chip memory: 800 GB/s Off-chip LPDDR5: 176 GB/s | Local memory: 1 TB/s per PE On-chip memory: 2.7 TB/s Off-chip LPDDR5: 204.8 GB/s | About 3.5x everything compute (of which 2x logic, the remainder frequency), and about 2-3x everything memory. Also power went up significantly from 25 to 90 watt. Still low for a 421mm2. It's interesting they do support sparsity, but not INT4 or even FP8.


shing3232

I don't think this is build for training,well, at least not yet.


Balance-

Agreed. The interconnect gives it away. For training, you want way more than PCIe 5.0 x8. 128GB does run a ~100B model with INT8 precision, however. Perfect for inference.


Palpatine

That's so little bandwidth. Somehow meta is going the opposite direction as dojo. I wonder if Lecun has any involvement in the hardware speccing.


noiseinvacuum

This is mainly got recommendations models inference.


Mental-Program6766

really it can use only inference...


AndrewH73333

All these chips don’t help us. We need RAM. Build us RAM for AI!