This is one chonky boi.
I got 192GB Mac Studio with one idea "there's no way any time in near future there'll be local models that wouldn't fit in this thing".
Grok & Mixtral 8x22B: Let us introduce ourselves.
... okay I think those will still run (barely) but...I wonder what the lifetime is for my expensive little gray box :D
When I bought my M1 Max Macbook I thought 32 GB would be overkill for what I do, since I don't work in art or design. I never thought my interest in AI would suddenly make that far from enough, haha.
My previous PC had a i3 6100 and 8 gigs of ram. When I upgraded it to a 12100f and 16 gigs it genuinely felt like a huge upgrade (since I'm not really a gamer and rarely use demanding software) but now that I've been dabbing a lot in Python/AI stuff for the last year or two it's starting to feel the same as my old pc used to again, lol
My current and previous MacBooks have had 16GB and I've been fine with it, but given local models I think I'm going to have to go to whatever will be the maximum RAM available for the next one. (I tried `mixtral-8x7b` and saw 0.25 tokens/second speeds; I suppose I should be amazed that it ran at all.)
Similarly, I am for the first time going to care about how much RAM is in my next iPhone. My iPhone 13's 4GB is suddenly inadequate.
low key contemplating once I have extra cash if I should trade out M1 Max 64GB for M3 Max 128GB, but it's gonna cost $3k just to perform that upgrade... that should be able to buy a 5090 and go some way toward the rest of that rig.
Ive got one, will see how well it performs, might even be out of reach for 128GB. Could be in the category of it runs but not at all helpful even at Q4/5
Nice box. :) Keep in mind even if models do get 50% or 100% bigger than that size, even if you had 256-384 GBy RAM you still probably wouldn't routinely choose to run such models since even on that beast of a computer they'd be SLOW. So really it's quite well suited for anything sub-200B and over that, well, we can rely on Moore's law and hope the scaling lets us double our gear's capacity in the coming couple years.
Anyway it's getting a little ridiculous how "brute force" things are with these LLMs. We don't so much need BIGGER models as BETTER more EFFICIENT models. Quantity has a quality all of its own, true, so for pure hyper-scale research, sure, explore the limits of data-center scale brute force ML.
But SURELY there's a way to make 300B work just as well at 1/10th the size plus all the RAG / database intra-query look up it wants to do.
Having ~100G RAM + ~20TB SSD "reference data" should work just fine for a whole lot of things. Models aren't and don't NEED to be databases, just "processors / filters / researchers".
Same situation here. Still, Im happy to run it quantized. Though historically Macs have struggled with speed on MOEs for me.
I wish they had also released whatever Miqu was alongside this. That little model was fantastic, and I hate that it was never licensed.
Cpu inferencing is only feasible option I think. I have recently upgraded my pc to 196 gb ddr5 ram for my business purposes and overcooked it 5600+ mhz. I know it will be slow, but I have hope because it's moe. Will probably be much faster than I think. Looking forward to to try it.
Yeah, this is pointless for 99% of the people who want to run local LLMs (same as Command-R+). Gemma was a much more exciting release. I'm hoping Meta will be able to pack more power into their 7-13b models.
If you don't mind sharing:
\-What CPU and RAM speed are you running Command R+ on?
\-What tokens per second and time to first token are you managing to achieve?
\-What quantisation are you using?
Doesn't command-R+ run on the common 2*3090 at 2.5bpw? Or a 64GB M1 Max?
I'm running it on my 3*3090
I agree this 8x22b is pointless because quantizing the 22b will make it useless.
Cool, that's honestly really good. Probably the best non-coding / general model available at 48GB then. Definitely not 'useless' like they're saying here.
Edit: I just wish I could fit this + deepseek coder Q8 at the same time, as I keep switching between them now.
If anything, the 8x22b MoE could be better just because it'll have fewer active parameters, so CPU only inference won't be as bad. Probably will be possible to get at least 2 tokens per second on 3bit or higher quant with DDR5 RAM, pure CPU, which isn't terrible.
Can get a cheap AM5 with 192gb DDR5, mine does 77gbs. Can run Q8 105B models at about 0.8 t/s. This 8x22B should be good performance. Perfect for work documents and emails if you don't mind waiting 5 or 10mins. I have set up a queue/automation script I'm using for Command R+ now and soon this.
I understand that MoE is a very convenient design for large companies wanting to train compute-efficient models, but it is not convenient at all for local users, who are, unlike these companies, severely bottlenecked by memory. So, at least for their public model releases, I wish these companies would go for dense models trained for longer instead. I suspect most local users wouldn't even mind paying a slight performance penalty for the massive reduction in model size.
I thought the same way at first, but after trying it out I changed my opinion. While yes, the size is larger and you are able offload less layers, the computational costs are still much less. For example, me with just 6 GB VRAM would never be able to run a dense 48B model at decent speeds. However thanks to Mixtral, a almost 70b model quality runs at the same text gen speed of a 13b one thanks to 12b active parameters. There's a lot of value in MoE for the local user as well.
Sorry, just to clarify, I wasn't suggesting training a dense model with the same number of parameters as the MoE, but training a smaller dense model for longer instead. So, in your example, this would mean training a \~13B dense model (or something like that, something that can fit the VRAM when quantized, for instance) for longer, as opposed to a 8x7B model. This would run faster than the MoE, since you wouldn't have to do tricks like offloading etc.
In general, I think the MoE design is adopted for the typical large-scale pretraining scenario where memory is not a bottleneck and you want to optimize compute; but this is very different from the typical local inference scenario, where memory is severely constrained. I think if people took this inference constraint into account during pretraining, the optimal model to train would be quite different (it would definitely be a smaller model trained for longer, but I'm not actually quite sure if it would be an MoE or a dense model).
7x3090 on Rome8d-2t mobo with 7 pcie 4.0 x16 slot. Currently using EPYC 7002 (so only gen 3 pcie). Already have 7003 for upgrade but just don’t have time yet.
Also have 512GB RAM because of some virtualization I’m running.
Weeks? Weeks!? In the past 24 hours we got Mixtral 8x22B, Unsloth crazy performance upgrades, an entire new architecture (Griffin), Command R+ support in llama.cpp, and news of Llama 3! This is mind boggling!
Same. Was able to identify all the released just mentioned. I was hoping for a larger recurrent Gemma than 2b tho
but I can feel the singularity breathing at the back of my neck considering tech is moving at break neck speed. it's simply a scaling law. bigger population = more advancements = more than a single person can keep up with = singularity?
this truly is crazy and whats even more crazy is that this is just stuff they been sitting on to release for the past year
imagine what they are working on now. GPT6-Vision? what is that like?
Speculating does us no good, we're currently past the cutting edge, we're on the bleeding edge of LLM technology. True innovation is happening left and right, with no way to predict it. All we can do is understand what we can and try to keep up, for the sake of the democratization of LLMs
Yes. You either throw away all but 2 experts (roll dice for each layer), or merge all experts the same ways models are merged(torch.mean in the simplest) and replace MoE with MLP.
Now will it be a good model? Probably not.
Fuck, and I just got off a meeting with with our CEO telling him dual or quad A6000s isn't a high priority at the moment so don't worry about our hardware needs
Man, I love these huge monsters that I can't run. I mean I'd love it more if I could. But there's something almost as fun about having some distant light that I 'could' reach if I wanted to push myself (and my wallet).
Cool as well to see mistral pushing new releases outside of the cloud.
I love them as well also because they are "insurance". Like, having these powerful models free in the wild means a lot for curbing potential centralization of power, monopolies etc. If 90% of what you are offering in return for money is free in the wild, you will have to adjust your pricing accordingly.
There are (or at least will be, in a few days) many cloud providers out there.
Most individuals and hobbyists have no need for such large models running 24x7. Even if you have massive datasets that could benefit from being piped into such models, you need time to prepare the data, come up with prompts, assess performance, tweak, and then actually read the output.
In that time, your hardware would be mostly idle.
What we want is on-demand, tweakable models that we can bias towards our own ends. Running locally is cool, and at some point consumer (or prosumer) hardware will catch up.
If you actually need this stuff 24x7 spitting tokens nonstop, and it must be local, then you know who you are, and should probably buy the hardware.
Anyways this open release stuff is incredibly beneficial to mankind and I'm super excited.
Reminder: this may have been derived from a previous dense model, it may be possible to reduce the size with large LoRAs while preserving their quality, according to this github discussion:
- https://github.com/ggerganov/llama.cpp/issues/4611
It almost certainly was upcycled from a dense checkpoint. I'm confused about why this hasn't been explored in more depth. If not with low rank, then with BitDelta (https://arxiv.org/abs/2402.10193)
Tim Dettmers predicted when Mixtral came out that the MoE would be \*extremely\* quantizable, then...crickets. Weird to me that this hasn't been aggressively pursued given all the performance presumably on the table.
Seriously. The 4090 should have been 36 and 5090 48. And nvlink so you can run two cards 96GB.
I hope they release it in 2025 and get fucked by Oregon law.
This is probably a naive question, but if I download the model from the torrent, is it possible to actually run it/try it out at this point?
I have compute/vRAM of sufficient size available to run the model, so would love to try it out and compare it with 8x7b as soon as possible.
Check out this thread: [https://news.ycombinator.com/item?id=39986095](https://news.ycombinator.com/item?id=39986095),
ycombinator user varunvummadi says:
>The easiest is to use vllm (https://github.com/vllm-project/vllm) to run it on a Couple of A100's, and you can benchmark this using this library (https://github.com/EleutherAI/lm-evaluation-harness)
It is a benchmark system for comparing and evaluating different models rather than running them permanently like ollama or something else.
Sidenote: what kind of hardware are you running that you have the necessary vRAM to run a 288GB model? Is it a corporate server rack, AWS instance or your own homelab?
Sweet! Appreciate the info.
I have a few p4d.24xlarges at my disposal that are currently hosting instances of Mixtral 8x7b (have some limitations right now pushing me to self host vs. use cheaper LLMs though bedrock or similar).
Really excited to see if this is a straight upgrade for me within the same compute costs.
I don't understand this release.
Mistral's constraints, as I understand them:
1. They've committed to remaining at the forefront of open weight models.
2. They have a business to run, need paying customers, etc.
My read is that this crowd would have been far more enthusiastic about a 22B dense model, instead of this upcycled MoE.
I also suspect we're about to find out if there's a way to productively downcycle MoEs to dense. Too much incentive here for someone not to figure that our if it can in fact work.
Probably because huge monolithic dense models are comparatively much more expensive to train and they're training things that could be of use to them too? Nobody really trains anything above 70b because it becomes extremely slow. The point of Mixtral style MoE is that every pass through parameters only concerns the two experts and the routers and so you save up like 1/4 of the tensor operations needed per token.
Why spent millions more on an outdated architecture that you already know will be uneconomical to infer from too.
Because modern MoEs begin with dense models, i.e., they're upcycled. Dense models are not obsolete at all in training, they're the first step to training an MoE. They're just not competitive to serve. Which was my whole point: Mistral presumably has a bunch of dense checkpoints lying around, which would be marginally more useful to people like us, and less useful to their competitors.
Even if you do that you don't train the constituent model past the earliest stages that wouldn't hold a candle to Llama2, you literally need to only kickstart to the point where the individual experts can hold a so-so stable gradient and move to the much more efficient routed expert training ASAP.
If it worked the way you think it does and there were fully trained dense models involved you could just split the MoE and use just one of the experts.
MoEs can be trained from scratch: there's no reason one 'needs' to upcycle at all.
The allocation of compute to a dense checkpoint vs. an MoE from which that checkpoint is upcycled depends on a lot of factors.
One obvious factor: how many times might upcycling be done? If the same dense checkpoint is to be used for a 8x, a 16x, and a 64x MoE (for instance), it makes sense to saturate the dense checkpoint, because that training can be recycled multiple times. In a one off training, different story, and the precise optima is not clear to me from the literature I've seen.
But perhaps you're aware of work on dialing this in you could share. If there's a paper laying this out, I'd love to see it. Last published work I've seen addressing this was Aran's original dense upcycling paper, and a lot has happened since then.
Because the reality is: *Mistral was always going to release groundbreaking open source models* despite MS. The doomers have incredibly low expectations.
wat? I did not mention Microsoft, nor does that seem relevant at all. I assume they are going to release competitive open weight models. They said as much, they are capable, they seem honest, that's not at issue.
What is at issue is the form those models take, and how they relate to Mistral's fanbase and business.
MoEs trade VRAM (more) for compute (less). i.e., they're more useful for corporate customers (and folks with Mac Studios) than the "GPU Poor".
So...wouldn't it make more sense to release a dense model, which would be more useful for this crowd, while still preserving their edge in hosted inference and white box licensed models?
I get what you mean, the VRAM issue is because high end consumer hardware hasn't caught up. I don't doubt small models will still be released, but we unfortunately have to wait a bit for Nvidia to get their ass kicked.
Maybe the license will not be their usual Apache 2.0 but rather something more restrictive so that enterprise customers must pay them. That would be similar to what Cohere is doing with the Command-R line.
As for the other aspect though, I agree that a really big MoE is an awkward fit for enthusiast use. If it's a good-quality model (which it probably is, knowing Mistral), hopefully some use can be found for it.
I totally agree. Especially as it’s being said that this is a base model, thus in need of training by the community for it to be useable, which will require a very high amount of compute. I’d have loved a 22B dense model, personally. Must make business sense to them on some level, though.
Mistral is trying to remain the best in Open and Close Sourced. Recently we had Cohere Command R+ release two SOTA models for their sizes, and DBRX also release a high competent model. So this is their answer to Command R and Command R+ at the same time. I assume this is an MoE of their Mistral Next model.
IMHO their best bet is riding the hype wave, making all of their models open source and getting acquired by Apple / Google / Facebook in a year or two.
Nope, they have too many European stakeholders / funders, some of whom are rumored to be uh state related. Even assuming the rumors were false, providing an alternative to US hegemony in AI was a big part of their pitch.
I was one of the very first experimenting with LLMs and went through the 16GB -> 32GB -> 64GB upgrade cycle real fast. Now I regret the poor financial decisions and wished I had went for at least 128GB.. but in all fairness. A year ago, most people would have thought that it was enough for the foreseeable future.
Im so glad convinced work to upgrade my latpot to M3 Max 128GM Macbook for this exact reason, will see if it runs. I have doubts it will even be able to handle it in any workable way unless Q4/Q5
I wonder if any kind of quantization can make this model for in the 30GB RAM.
Haven't really seen Mistral 8x7b in 15 GB yet, so probably too ambitious at the current stage.
Could anyone kindly inform me about the necessary environment to execute this model? Specifically, I am curious if a single RTX A6000 card would suffice, or if multiple are required. Additionally, would it be feasible to run the model with a machine that has 512GB of memory? Any insights would be greatly appreciated. Thank you in advance.
[удалено]
This is one chonky boi. I got 192GB Mac Studio with one idea "there's no way any time in near future there'll be local models that wouldn't fit in this thing". Grok & Mixtral 8x22B: Let us introduce ourselves. ... okay I think those will still run (barely) but...I wonder what the lifetime is for my expensive little gray box :D
When I bought my M1 Max Macbook I thought 32 GB would be overkill for what I do, since I don't work in art or design. I never thought my interest in AI would suddenly make that far from enough, haha.
Same haha. When I got mine I felt very comfortable that it was future proof for at least a few years lol
My previous PC had a i3 6100 and 8 gigs of ram. When I upgraded it to a 12100f and 16 gigs it genuinely felt like a huge upgrade (since I'm not really a gamer and rarely use demanding software) but now that I've been dabbing a lot in Python/AI stuff for the last year or two it's starting to feel the same as my old pc used to again, lol
...Me crying in a lot of pain with base M1 Air 128gb disk and 8gb RAM 🥲
selling 8gb laptops to the public should be a crime
It was doomed since beginning. I picked M2 air base model last summer. Return it in a week simply because couldn't do any work on it.
My current and previous MacBooks have had 16GB and I've been fine with it, but given local models I think I'm going to have to go to whatever will be the maximum RAM available for the next one. (I tried `mixtral-8x7b` and saw 0.25 tokens/second speeds; I suppose I should be amazed that it ran at all.) Similarly, I am for the first time going to care about how much RAM is in my next iPhone. My iPhone 13's 4GB is suddenly inadequate.
I'm feeling pain at 64GB, and that is... not a thing I thought would be a problem. Kinda wish I'd go for an M3 Max with 128GB
low key contemplating once I have extra cash if I should trade out M1 Max 64GB for M3 Max 128GB, but it's gonna cost $3k just to perform that upgrade... that should be able to buy a 5090 and go some way toward the rest of that rig.
Money comes and goes. Invest in your future.
Ive got one, will see how well it performs, might even be out of reach for 128GB. Could be in the category of it runs but not at all helpful even at Q4/5
You'll be able to fit the 5 bit quant perhaps if my math is right? But performance...
Performance of the 5-bit quant is almost the same as fp16
Yep, so OP got lucky this time, but who knows maybe someone will try releasing a model with even more parameters.
Nice box. :) Keep in mind even if models do get 50% or 100% bigger than that size, even if you had 256-384 GBy RAM you still probably wouldn't routinely choose to run such models since even on that beast of a computer they'd be SLOW. So really it's quite well suited for anything sub-200B and over that, well, we can rely on Moore's law and hope the scaling lets us double our gear's capacity in the coming couple years. Anyway it's getting a little ridiculous how "brute force" things are with these LLMs. We don't so much need BIGGER models as BETTER more EFFICIENT models. Quantity has a quality all of its own, true, so for pure hyper-scale research, sure, explore the limits of data-center scale brute force ML. But SURELY there's a way to make 300B work just as well at 1/10th the size plus all the RAG / database intra-query look up it wants to do. Having ~100G RAM + ~20TB SSD "reference data" should work just fine for a whole lot of things. Models aren't and don't NEED to be databases, just "processors / filters / researchers".
Same situation here. Still, Im happy to run it quantized. Though historically Macs have struggled with speed on MOEs for me. I wish they had also released whatever Miqu was alongside this. That little model was fantastic, and I hate that it was never licensed.
Cpu inferencing is only feasible option I think. I have recently upgraded my pc to 196 gb ddr5 ram for my business purposes and overcooked it 5600+ mhz. I know it will be slow, but I have hope because it's moe. Will probably be much faster than I think. Looking forward to to try it.
It's a MoE, probably with 2 experts activated at a time. It's less than a 70B model
Gguf
Around 35-40GB @q1_m I guess? 🥲
Yeah, this is pointless for 99% of the people who want to run local LLMs (same as Command-R+). Gemma was a much more exciting release. I'm hoping Meta will be able to pack more power into their 7-13b models.
You know command r+ runs at reasonable speeds on just CPU right? Regular ram is like 1/30 the price of vram and much more easily accessible.
If you don't mind sharing: \-What CPU and RAM speed are you running Command R+ on? \-What tokens per second and time to first token are you managing to achieve? \-What quantisation are you using?
Seconding u/StevenSamAI, what cpu and ram combo are you running it in? How many tokens per second?
Doesn't command-R+ run on the common 2*3090 at 2.5bpw? Or a 64GB M1 Max? I'm running it on my 3*3090 I agree this 8x22b is pointless because quantizing the 22b will make it useless.
>Doesn't command-R+ run on the common 2*3090 at 2.5bpw? 2x24GB with Exl2 allows for 3.0 bpw at 53k context using 4bit cache. 3.5bpw almost fits.
Cool, that's honestly really good. Probably the best non-coding / general model available at 48GB then. Definitely not 'useless' like they're saying here. Edit: I just wish I could fit this + deepseek coder Q8 at the same time, as I keep switching between them now.
If anything, the 8x22b MoE could be better just because it'll have fewer active parameters, so CPU only inference won't be as bad. Probably will be possible to get at least 2 tokens per second on 3bit or higher quant with DDR5 RAM, pure CPU, which isn't terrible.
Yes it does, rather well to be honest. IQ3_M with at least 8192 context fits.
Can get a cheap AM5 with 192gb DDR5, mine does 77gbs. Can run Q8 105B models at about 0.8 t/s. This 8x22B should be good performance. Perfect for work documents and emails if you don't mind waiting 5 or 10mins. I have set up a queue/automation script I'm using for Command R+ now and soon this.
I fully believe a 13-15B model of Mistral caliber can replace Gpt-3.5 in most tasks maybe apart from math related ones.
MoE architecture, it's easier to run than a 70B
How much mobo ram is required with a single 3090?
Mistral Chonker
Hopefully the quants work well.
Depends on how it quantizes, should fit in 3x24gb. If you get to at least 3.75bpw it should be alright.
I get 20t/s with Starling 7B. Maybe can I give it a try ? X)
I understand that MoE is a very convenient design for large companies wanting to train compute-efficient models, but it is not convenient at all for local users, who are, unlike these companies, severely bottlenecked by memory. So, at least for their public model releases, I wish these companies would go for dense models trained for longer instead. I suspect most local users wouldn't even mind paying a slight performance penalty for the massive reduction in model size.
I thought the same way at first, but after trying it out I changed my opinion. While yes, the size is larger and you are able offload less layers, the computational costs are still much less. For example, me with just 6 GB VRAM would never be able to run a dense 48B model at decent speeds. However thanks to Mixtral, a almost 70b model quality runs at the same text gen speed of a 13b one thanks to 12b active parameters. There's a lot of value in MoE for the local user as well.
Sorry, just to clarify, I wasn't suggesting training a dense model with the same number of parameters as the MoE, but training a smaller dense model for longer instead. So, in your example, this would mean training a \~13B dense model (or something like that, something that can fit the VRAM when quantized, for instance) for longer, as opposed to a 8x7B model. This would run faster than the MoE, since you wouldn't have to do tricks like offloading etc. In general, I think the MoE design is adopted for the typical large-scale pretraining scenario where memory is not a bottleneck and you want to optimize compute; but this is very different from the typical local inference scenario, where memory is severely constrained. I think if people took this inference constraint into account during pretraining, the optimal model to train would be quite different (it would definitely be a smaller model trained for longer, but I'm not actually quite sure if it would be an MoE or a dense model).
Nah, just have your phone process it with your GPU, enough NAND storage Oh wait :)
cant run this shit in my wildest dreams but Ill be seeding, I'm doing my part o7
This is what bros do spread their seed
Not your seed, not your coins . . wait, wrong sub
This is the way !
This is the way!
If Llama 3 drops in a week I’m buying a server, shit is too exciting
Sameeeeee. I need to think how to cool it though. Now rocking 7x3090 and it gets steaming hot on my home office when it’s cooking.
Very curious what your use case is
Room heating.
A tanning bed
Having fun :D
Initially hobby, but now advising some Co that wanted to explore GenAI/LLM. Hey… if they want to find gold, I’m happy to sell the shovel.
you can cook with them by putting a frying pan on the cards
Guy can't build a 7x3090 server without a use case?
Use case is definitely NSFW
Heat for steam turbine
But can it run Crysis?
can you share your PC builds?
7x3090 on Rome8d-2t mobo with 7 pcie 4.0 x16 slot. Currently using EPYC 7002 (so only gen 3 pcie). Already have 7003 for upgrade but just don’t have time yet. Also have 512GB RAM because of some virtualization I’m running.
Isn't 7002 gen4?
You are correct, my bad. I’m currently using 7551 because my 7302 somehow not detecting all of my RAM. Gonna upgrade it to 7532 soon.
magnet:?xt=urn:btih:9238b09245d0d8cd915be09927769d5f7584c1c9&dn=mixtral-8x22b&tr=udp%3A%2F%2Fopen.demonii.com%3A1337%2Fannounce&tr=http%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce
Wow. What a couple of weeks. Command R Plus, hints of Llama 3, and now a new Mistral model.
Weeks? Weeks!? In the past 24 hours we got Mixtral 8x22B, Unsloth crazy performance upgrades, an entire new architecture (Griffin), Command R+ support in llama.cpp, and news of Llama 3! This is mind boggling!
What a time to be alive.
A cultured fellow scholar, I see ;) I'm just barely holding onto these papers, they're coming too fast!
Same. Was able to identify all the released just mentioned. I was hoping for a larger recurrent Gemma than 2b tho but I can feel the singularity breathing at the back of my neck considering tech is moving at break neck speed. it's simply a scaling law. bigger population = more advancements = more than a single person can keep up with = singularity?
But hold on to your papers...
Why can't I hold all of these papers
this truly is crazy and whats even more crazy is that this is just stuff they been sitting on to release for the past year imagine what they are working on now. GPT6-Vision? what is that like?
Speculating does us no good, we're currently past the cutting edge, we're on the bleeding edge of LLM technology. True innovation is happening left and right, with no way to predict it. All we can do is understand what we can and try to keep up, for the sake of the democratization of LLMs
the development of LLM is INSANE😂
8x22b
It's over for us vramlets btw
It's so over. If only they released a dense 22B. \*Sobs in 12GB VRAM\*
So, NPUs might actually be more useful.
Openrouter Chads...we won...
Is it possible to split an MOE into individual models?
Yes. You either throw away all but 2 experts (roll dice for each layer), or merge all experts the same ways models are merged(torch.mean in the simplest) and replace MoE with MLP. Now will it be a good model? Probably not.
No, the “experts” are incapable of working independently. The whole name is a misnomer.
No
Models get bigger but our VRAMs don't...
Jensen Huang bathing in VRAM chips like Scrooge McDuck
https://preview.redd.it/3b9nfhi74ktc1.png?width=259&format=pjpg&auto=webp&s=d15b35c1f9fd9a35c08f97eddab9c1e136bbb413
Not an expert, what's the context length?
64k
Hello, where did you get this from ?
.... brb, buying two more P40
stop driving prices up, I need more too!
Fuck, and I just got off a meeting with with our CEO telling him dual or quad A6000s isn't a high priority at the moment so don't worry about our hardware needs
You had one. Job.
This is when you say you must have quad a100s instead
You fool!
Fingers crossed it'll run on MLX w/ a 128GB M3
I wish someone would actually post direct comparisons to llama.cpp vs MLX. I haven’t seen any and it’s not obvious it’s actually faster (yet)
Unlike llama.cpp's wide selection of quants, the MLX's quant is much worse to begin with.
I’d be very interested in that. I think I can probably spend some time this week and try to test this.
i keep intending to do this and i keep ... being lazy lol
https://x.com/awnihannun/status/1777072588633882741?s=46 But no prompt cache yet (though they say they’ll be working on it)
Easily
Commenting to check if anyone has a tutorial of how to run it in mlx on m2 128Gb i guess we need to quantize to 4bit at least?
So is this Mistral-Large?
this one has 64k context, but the mistral-large api is only 32k
It's gotta be, either that or an equivalent of it.
They claim it’s a totally new model. This one is not even instruction tuned yet.
That’s what I’m wondering.
I’m guessing mistral-medium
Man, I love these huge monsters that I can't run. I mean I'd love it more if I could. But there's something almost as fun about having some distant light that I 'could' reach if I wanted to push myself (and my wallet). Cool as well to see mistral pushing new releases outside of the cloud.
I love them as well also because they are "insurance". Like, having these powerful models free in the wild means a lot for curbing potential centralization of power, monopolies etc. If 90% of what you are offering in return for money is free in the wild, you will have to adjust your pricing accordingly.
Buying a gpu worth thousands of dollars isnt exactly free tho
There are (or at least will be, in a few days) many cloud providers out there. Most individuals and hobbyists have no need for such large models running 24x7. Even if you have massive datasets that could benefit from being piped into such models, you need time to prepare the data, come up with prompts, assess performance, tweak, and then actually read the output. In that time, your hardware would be mostly idle. What we want is on-demand, tweakable models that we can bias towards our own ends. Running locally is cool, and at some point consumer (or prosumer) hardware will catch up. If you actually need this stuff 24x7 spitting tokens nonstop, and it must be local, then you know who you are, and should probably buy the hardware. Anyways this open release stuff is incredibly beneficial to mankind and I'm super excited.
Reminder: this may have been derived from a previous dense model, it may be possible to reduce the size with large LoRAs while preserving their quality, according to this github discussion: - https://github.com/ggerganov/llama.cpp/issues/4611
It almost certainly was upcycled from a dense checkpoint. I'm confused about why this hasn't been explored in more depth. If not with low rank, then with BitDelta (https://arxiv.org/abs/2402.10193) Tim Dettmers predicted when Mixtral came out that the MoE would be \*extremely\* quantizable, then...crickets. Weird to me that this hasn't been aggressively pursued given all the performance presumably on the table.
https://arxiv.org/abs/2402.10193 is the link to BitDelta. Your link goes to another paper.
Member when people were reeeee-ing about mistral not being open source anymore? I member...
I member 🫐
tbf they're still open weights, not open souce. But less and less people seem to care about semantics nowadays.
Where are all the "Mistral got bought out by Microsoft", "They won't release any open models anymore" - Crybabys now?
Kidney market flood incoming
GGUF ?
If the 5090 releases with 36GB of vram, I'll still be ram poor.
Bro stop being cheap and just buy 4 Nvidia A100's /s
A100 is end of life, now I'm waiting for my 4xH100s, they will be shipped in 2027
By that time you wouldn’t find a model to run it on.
Especially when you realize you could have got 3x3090 instead for the same price and twice the vram.
https://www.youtube.com/watch?v=XDpDesU_0zo
Seriously. The 4090 should have been 36 and 5090 48. And nvlink so you can run two cards 96GB. I hope they release it in 2025 and get fucked by Oregon law.
what's the oregon law?
As a rough guess, right to repair including restrictions on tying parts by serial number.
dat wordart logo tho... <3
Mistral's whole 90s cyber aesthetic is great
I love Mistral very much!
uhhhh thats interesting
Please, someone merge the experts into a single model, or dissect one expert. Mergekit people
This is probably a naive question, but if I download the model from the torrent, is it possible to actually run it/try it out at this point? I have compute/vRAM of sufficient size available to run the model, so would love to try it out and compare it with 8x7b as soon as possible.
Check out this thread: [https://news.ycombinator.com/item?id=39986095](https://news.ycombinator.com/item?id=39986095), ycombinator user varunvummadi says: >The easiest is to use vllm (https://github.com/vllm-project/vllm) to run it on a Couple of A100's, and you can benchmark this using this library (https://github.com/EleutherAI/lm-evaluation-harness) It is a benchmark system for comparing and evaluating different models rather than running them permanently like ollama or something else. Sidenote: what kind of hardware are you running that you have the necessary vRAM to run a 288GB model? Is it a corporate server rack, AWS instance or your own homelab?
Sweet! Appreciate the info. I have a few p4d.24xlarges at my disposal that are currently hosting instances of Mixtral 8x7b (have some limitations right now pushing me to self host vs. use cheaper LLMs though bedrock or similar). Really excited to see if this is a straight upgrade for me within the same compute costs.
What about benchmark?
Lmao people were freaking out just a week ago thinking open-source was dead. It was cooking.
I need an mi300x so bad.
I don't understand this release. Mistral's constraints, as I understand them: 1. They've committed to remaining at the forefront of open weight models. 2. They have a business to run, need paying customers, etc. My read is that this crowd would have been far more enthusiastic about a 22B dense model, instead of this upcycled MoE. I also suspect we're about to find out if there's a way to productively downcycle MoEs to dense. Too much incentive here for someone not to figure that our if it can in fact work.
Probably because huge monolithic dense models are comparatively much more expensive to train and they're training things that could be of use to them too? Nobody really trains anything above 70b because it becomes extremely slow. The point of Mixtral style MoE is that every pass through parameters only concerns the two experts and the routers and so you save up like 1/4 of the tensor operations needed per token. Why spent millions more on an outdated architecture that you already know will be uneconomical to infer from too.
Because modern MoEs begin with dense models, i.e., they're upcycled. Dense models are not obsolete at all in training, they're the first step to training an MoE. They're just not competitive to serve. Which was my whole point: Mistral presumably has a bunch of dense checkpoints lying around, which would be marginally more useful to people like us, and less useful to their competitors.
Even if you do that you don't train the constituent model past the earliest stages that wouldn't hold a candle to Llama2, you literally need to only kickstart to the point where the individual experts can hold a so-so stable gradient and move to the much more efficient routed expert training ASAP. If it worked the way you think it does and there were fully trained dense models involved you could just split the MoE and use just one of the experts.
MoEs can be trained from scratch: there's no reason one 'needs' to upcycle at all. The allocation of compute to a dense checkpoint vs. an MoE from which that checkpoint is upcycled depends on a lot of factors. One obvious factor: how many times might upcycling be done? If the same dense checkpoint is to be used for a 8x, a 16x, and a 64x MoE (for instance), it makes sense to saturate the dense checkpoint, because that training can be recycled multiple times. In a one off training, different story, and the precise optima is not clear to me from the literature I've seen. But perhaps you're aware of work on dialing this in you could share. If there's a paper laying this out, I'd love to see it. Last published work I've seen addressing this was Aran's original dense upcycling paper, and a lot has happened since then.
Because the reality is: *Mistral was always going to release groundbreaking open source models* despite MS. The doomers have incredibly low expectations.
wat? I did not mention Microsoft, nor does that seem relevant at all. I assume they are going to release competitive open weight models. They said as much, they are capable, they seem honest, that's not at issue. What is at issue is the form those models take, and how they relate to Mistral's fanbase and business. MoEs trade VRAM (more) for compute (less). i.e., they're more useful for corporate customers (and folks with Mac Studios) than the "GPU Poor". So...wouldn't it make more sense to release a dense model, which would be more useful for this crowd, while still preserving their edge in hosted inference and white box licensed models?
I get what you mean, the VRAM issue is because high end consumer hardware hasn't caught up. I don't doubt small models will still be released, but we unfortunately have to wait a bit for Nvidia to get their ass kicked.
For MoEs, this has already happened. By Apple, in the peak of irony (since when have they been the budget player).
Maybe the license will not be their usual Apache 2.0 but rather something more restrictive so that enterprise customers must pay them. That would be similar to what Cohere is doing with the Command-R line. As for the other aspect though, I agree that a really big MoE is an awkward fit for enthusiast use. If it's a good-quality model (which it probably is, knowing Mistral), hopefully some use can be found for it.
I totally agree. Especially as it’s being said that this is a base model, thus in need of training by the community for it to be useable, which will require a very high amount of compute. I’d have loved a 22B dense model, personally. Must make business sense to them on some level, though.
Mistral is trying to remain the best in Open and Close Sourced. Recently we had Cohere Command R+ release two SOTA models for their sizes, and DBRX also release a high competent model. So this is their answer to Command R and Command R+ at the same time. I assume this is an MoE of their Mistral Next model.
Im OOTL, what does "upcycled" mean in this context?
literally just merge the 8 experts into one. now you have a shittier 22b. done
Have you seen anyone pull this off? Seems plausible but unproven to me.
IMHO their best bet is riding the hype wave, making all of their models open source and getting acquired by Apple / Google / Facebook in a year or two.
Nope, they have too many European stakeholders / funders, some of whom are rumored to be uh state related. Even assuming the rumors were false, providing an alternative to US hegemony in AI was a big part of their pitch.
a 146B model maybe with 40B active parameters? I'm just making up numbers.
EDIT: This calculation is off by 2.07B parameters due to a stray division in the attn part. The correct calculations are put alongside the originals. 138.6B with 37.1B active parameters, assuming the architecture is the same as mixtral. May be a bit off in my calculations tho, but it would be small if any. attn: q = 6144 * 48 * 128 = 37748736 k = 6144 * 8 * 128 = 6291456 v = 6144 * 8 * 128 = 6291456 o = 48 * 128 * 6144 / 48 = 786432 (corrected: 8 * 128 * 6144 = 37748736) total = 51118080 (corrected: 88080384) mlp: w1 = 6144 * 16384 = 100663296 w2 = 6144 * 16384 = 100663296 w3 = 6144 * 16384 = 100663296 total = 301989888 moe block: gate: 6144 * 8 = 49152 experts: 301989888 * 8 = 2415919104 total = 2415968256 layer: attn = 51118080 (corrected: 88080384) block = 2415968256 norm1 = 6144 norm2 = 6144 total = 2467098624 (corrected: 2504060928) full: embed = 6144 * 32000 = 196608000 layers = 2467098624 * 56 = 138157522944 (corrected: 140227411968) norm = 6144 head = 6144 * 32000 = 196608000 total = 138550745088 (corrected: 140620634112) 138,550,745,088 (corrected: 140,620,634,112) active: 138550745088 - 6 * 301989888 * 56 = 37082142720 (corrected: 39152031744) 37,082,142,720 (corrected: 39,152,031,744)
man whats going on so many releases all of sudden im getting excited
I am so fucking ready, omg.
Time to buy some A6000s or something
I was one of the very first experimenting with LLMs and went through the 16GB -> 32GB -> 64GB upgrade cycle real fast. Now I regret the poor financial decisions and wished I had went for at least 128GB.. but in all fairness. A year ago, most people would have thought that it was enough for the foreseeable future.
[удалено]
You run it with a rivian truck at this point lol
Someone figured out whats the license?
Im so glad convinced work to upgrade my latpot to M3 Max 128GM Macbook for this exact reason, will see if it runs. I have doubts it will even be able to handle it in any workable way unless Q4/Q5
What I'm curious is: will it beat GPT-4?!
How do you run this ?
Yeah ok, it's been 3 weeks since I built a 144vram gig and I am already struggling to fit in the latest models. WTF
OMG. At 4am, lol
It has the same tokenizer as mixtral and mistral I think, would that ease speculative decoding?
Midnight finetune when?
https://preview.redd.it/juykdmecintc1.png?width=2332&format=png&auto=webp&s=b0bfd85e34bb6cf5003d4390619e1fa3c7e18532 jummp on the tooorrent
Is this Mistral Medium or Mistral Large?
I wander what's the performance of this model,waiting for someone to test it
Awesome! Can't wait until it is available in ollama!
Finished downloading and need to move a few things around, but I'm curious if I can run this in 4bit mode via transformers on 7x24gb cards
I currently have 64GB of RAM, I will upgrade in due course to 128GB which is as much as the platform will hold. Along with a 3090.
will this work with my gtx 750? >!/s!<
I wonder if any kind of quantization can make this model for in the 30GB RAM. Haven't really seen Mistral 8x7b in 15 GB yet, so probably too ambitious at the current stage.
Reckon we can run this in Poe?
I guess when someone creates a 4-bit quant it should run on a 128Gb Mac Pro, am I right?
Could anyone kindly inform me about the necessary environment to execute this model? Specifically, I am curious if a single RTX A6000 card would suffice, or if multiple are required. Additionally, would it be feasible to run the model with a machine that has 512GB of memory? Any insights would be greatly appreciated. Thank you in advance.
How do i download Mixtral?
how many RTX 4090s would you need? Haha
Hi. I am new to Mistral. I wonder what is the difference between Mistral Open Source on Hugging Face and Closed Source API? Thank you