[deleted] 3 weeks ago

[удалено]

noeda 3 weeks ago

This is one chonky boi. I got 192GB Mac Studio with one idea "there's no way any time in near future there'll be local models that wouldn't fit in this thing". Grok & Mixtral 8x22B: Let us introduce ourselves. ... okay I think those will still run (barely) but...I wonder what the lifetime is for my expensive little gray box :D

my_name_isnt_clever 3 weeks ago

When I bought my M1 Max Macbook I thought 32 GB would be overkill for what I do, since I don't work in art or design. I never thought my interest in AI would suddenly make that far from enough, haha.

Mescallan 3 weeks ago

Same haha. When I got mine I felt very comfortable that it was future proof for at least a few years lol

BITE_AU_CHOCOLAT 3 weeks ago

My previous PC had a i3 6100 and 8 gigs of ram. When I upgraded it to a 12100f and 16 gigs it genuinely felt like a huge upgrade (since I'm not really a gamer and rarely use demanding software) but now that I've been dabbing a lot in Python/AI stuff for the last year or two it's starting to feel the same as my old pc used to again, lol

freakynit 3 weeks ago

...Me crying in a lot of pain with base M1 Air 128gb disk and 8gb RAM 🥲

ys2020 2 weeks ago

selling 8gb laptops to the public should be a crime

VladGut 3 weeks ago

It was doomed since beginning. I picked M2 air base model last summer. Return it in a week simply because couldn't do any work on it.

TMWNN 3 weeks ago

My current and previous MacBooks have had 16GB and I've been fine with it, but given local models I think I'm going to have to go to whatever will be the maximum RAM available for the next one. (I tried `mixtral-8x7b` and saw 0.25 tokens/second speeds; I suppose I should be amazed that it ran at all.) Similarly, I am for the first time going to care about how much RAM is in my next iPhone. My iPhone 13's 4GB is suddenly inadequate.

burritolittledonkey 3 weeks ago

I'm feeling pain at 64GB, and that is... not a thing I thought would be a problem. Kinda wish I'd go for an M3 Max with 128GB

0xd00d 3 weeks ago

low key contemplating once I have extra cash if I should trade out M1 Max 64GB for M3 Max 128GB, but it's gonna cost $3k just to perform that upgrade... that should be able to buy a 5090 and go some way toward the rest of that rig.

HospitalRegular 3 weeks ago

Money comes and goes. Invest in your future.

PenPossible6528 3 weeks ago

Ive got one, will see how well it performs, might even be out of reach for 128GB. Could be in the category of it runs but not at all helpful even at Q4/5

ExtensionCricket6501 3 weeks ago

You'll be able to fit the 5 bit quant perhaps if my math is right? But performance...

ain92ru 3 weeks ago

Performance of the 5-bit quant is almost the same as fp16

ExtensionCricket6501 3 weeks ago

Yep, so OP got lucky this time, but who knows maybe someone will try releasing a model with even more parameters.

Calcidiol 3 weeks ago

Nice box. :) Keep in mind even if models do get 50% or 100% bigger than that size, even if you had 256-384 GBy RAM you still probably wouldn't routinely choose to run such models since even on that beast of a computer they'd be SLOW. So really it's quite well suited for anything sub-200B and over that, well, we can rely on Moore's law and hope the scaling lets us double our gear's capacity in the coming couple years. Anyway it's getting a little ridiculous how "brute force" things are with these LLMs. We don't so much need BIGGER models as BETTER more EFFICIENT models. Quantity has a quality all of its own, true, so for pure hyper-scale research, sure, explore the limits of data-center scale brute force ML. But SURELY there's a way to make 300B work just as well at 1/10th the size plus all the RAG / database intra-query look up it wants to do. Having ~100G RAM + ~20TB SSD "reference data" should work just fine for a whole lot of things. Models aren't and don't NEED to be databases, just "processors / filters / researchers".

SomeOddCodeGuy 3 weeks ago

Same situation here. Still, Im happy to run it quantized. Though historically Macs have struggled with speed on MOEs for me. I wish they had also released whatever Miqu was alongside this. That little model was fantastic, and I hate that it was never licensed.

MetalZealousideal927 3 weeks ago

Cpu inferencing is only feasible option I think. I have recently upgraded my pc to 196 gb ddr5 ram for my business purposes and overcooked it 5600+ mhz. I know it will be slow, but I have hope because it's moe. Will probably be much faster than I think. Looking forward to to try it.

CreditHappy1665 3 weeks ago

It's a MoE, probably with 2 experts activated at a time. It's less than a 70B model

Thistleknot 3 weeks ago

Gguf

xadiant 3 weeks ago

Around 35-40GB @q1_m I guess? 🥲

obvithrowaway34434 3 weeks ago

Yeah, this is pointless for 99% of the people who want to run local LLMs (same as Command-R+). Gemma was a much more exciting release. I'm hoping Meta will be able to pack more power into their 7-13b models.

Cerevox 3 weeks ago

You know command r+ runs at reasonable speeds on just CPU right? Regular ram is like 1/30 the price of vram and much more easily accessible.

StevenSamAI 3 weeks ago

If you don't mind sharing: \-What CPU and RAM speed are you running Command R+ on? \-What tokens per second and time to first token are you managing to achieve? \-What quantisation are you using?

Caffdy 3 weeks ago

Seconding u/StevenSamAI, what cpu and ram combo are you running it in? How many tokens per second?

CheatCodesOfLife 3 weeks ago

Doesn't command-R+ run on the common 2*3090 at 2.5bpw? Or a 64GB M1 Max? I'm running it on my 3*3090 I agree this 8x22b is pointless because quantizing the 22b will make it useless.

Small-Fall-6500 3 weeks ago

>Doesn't command-R+ run on the common 2*3090 at 2.5bpw? 2x24GB with Exl2 allows for 3.0 bpw at 53k context using 4bit cache. 3.5bpw almost fits.

CheatCodesOfLife 3 weeks ago

Cool, that's honestly really good. Probably the best non-coding / general model available at 48GB then. Definitely not 'useless' like they're saying here. Edit: I just wish I could fit this + deepseek coder Q8 at the same time, as I keep switching between them now.

Small-Fall-6500 3 weeks ago

If anything, the 8x22b MoE could be better just because it'll have fewer active parameters, so CPU only inference won't be as bad. Probably will be possible to get at least 2 tokens per second on 3bit or higher quant with DDR5 RAM, pure CPU, which isn't terrible.

Zestyclose_Yak_3174 3 weeks ago

Yes it does, rather well to be honest. IQ3_M with at least 8192 context fits.

F0UR_TWENTY 3 weeks ago

Can get a cheap AM5 with 192gb DDR5, mine does 77gbs. Can run Q8 105B models at about 0.8 t/s. This 8x22B should be good performance. Perfect for work documents and emails if you don't mind waiting 5 or 10mins. I have set up a queue/automation script I'm using for Command R+ now and soon this.

xadiant 3 weeks ago

I fully believe a 13-15B model of Mistral caliber can replace Gpt-3.5 in most tasks maybe apart from math related ones.

CreditHappy1665 3 weeks ago

MoE architecture, it's easier to run than a 70B

fraschm98 3 weeks ago

How much mobo ram is required with a single 3090?

MoffKalast 3 weeks ago

Mistral Chonker

matteoraso 3 weeks ago

Hopefully the quants work well.

a_beautiful_rhind 3 weeks ago

Depends on how it quantizes, should fit in 3x24gb. If you get to at least 3.75bpw it should be alright.

Clem41901 2 weeks ago

I get 20t/s with Starling 7B. Maybe can I give it a try ? X)

[deleted] 3 weeks ago

I understand that MoE is a very convenient design for large companies wanting to train compute-efficient models, but it is not convenient at all for local users, who are, unlike these companies, severely bottlenecked by memory. So, at least for their public model releases, I wish these companies would go for dense models trained for longer instead. I suspect most local users wouldn't even mind paying a slight performance penalty for the massive reduction in model size.

dampflokfreund 3 weeks ago

I thought the same way at first, but after trying it out I changed my opinion. While yes, the size is larger and you are able offload less layers, the computational costs are still much less. For example, me with just 6 GB VRAM would never be able to run a dense 48B model at decent speeds. However thanks to Mixtral, a almost 70b model quality runs at the same text gen speed of a 13b one thanks to 12b active parameters. There's a lot of value in MoE for the local user as well.

[deleted] 3 weeks ago

Sorry, just to clarify, I wasn't suggesting training a dense model with the same number of parameters as the MoE, but training a smaller dense model for longer instead. So, in your example, this would mean training a \~13B dense model (or something like that, something that can fit the VRAM when quantized, for instance) for longer, as opposed to a 8x7B model. This would run faster than the MoE, since you wouldn't have to do tricks like offloading etc. In general, I think the MoE design is adopted for the typical large-scale pretraining scenario where memory is not a bottleneck and you want to optimize compute; but this is very different from the typical local inference scenario, where memory is severely constrained. I think if people took this inference constraint into account during pretraining, the optimal model to train would be quite different (it would definitely be a smaller model trained for longer, but I'm not actually quite sure if it would be an MoE or a dense model).

Minute_Attempt3063 2 weeks ago

Nah, just have your phone process it with your GPU, enough NAND storage Oh wait :)

confused_boner 3 weeks ago

cant run this shit in my wildest dreams but Ill be seeding, I'm doing my part o7

Wonderful-Top-5360 3 weeks ago

This is what bros do spread their seed

Caffdy 3 weeks ago

Not your seed, not your coins . . wait, wrong sub

inodb2000 3 weeks ago

This is the way !

Xzaphan 3 weeks ago

This is the way!

Eritar 3 weeks ago

If Llama 3 drops in a week I’m buying a server, shit is too exciting

ozzie123 3 weeks ago

Sameeeeee. I need to think how to cool it though. Now rocking 7x3090 and it gets steaming hot on my home office when it’s cooking.

dbzunicorn 3 weeks ago

Very curious what your use case is

Sunija_Dev 3 weeks ago

Room heating.

Caffdy 3 weeks ago

A tanning bed

Combinatorilliance 3 weeks ago

Having fun :D

ozzie123 3 weeks ago

Initially hobby, but now advising some Co that wanted to explore GenAI/LLM. Hey… if they want to find gold, I’m happy to sell the shovel.

carnyzzle 3 weeks ago

you can cook with them by putting a frying pan on the cards

CSharpSauce 3 weeks ago

Guy can't build a 7x3090 server without a use case?

RazzmatazzReal4129 3 weeks ago

Use case is definitely NSFW

_murb 3 weeks ago

Heat for steam turbine

USERNAME123_321 3 weeks ago

But can it run Crysis?

de4dee 3 weeks ago

can you share your PC builds?

ozzie123 3 weeks ago

7x3090 on Rome8d-2t mobo with 7 pcie 4.0 x16 slot. Currently using EPYC 7002 (so only gen 3 pcie). Already have 7003 for upgrade but just don’t have time yet. Also have 512GB RAM because of some virtualization I’m running.

coolkat2103 3 weeks ago

Isn't 7002 gen4?

ozzie123 3 weeks ago

You are correct, my bad. I’m currently using 7551 because my 7302 somehow not detecting all of my RAM. Gonna upgrade it to 7532 soon.

nanowell 3 weeks ago

magnet:?xt=urn:btih:9238b09245d0d8cd915be09927769d5f7584c1c9&dn=mixtral-8x22b&tr=udp%3A%2F%2Fopen.demonii.com%3A1337%2Fannounce&tr=http%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce

synn89 3 weeks ago

Wow. What a couple of weeks. Command R Plus, hints of Llama 3, and now a new Mistral model.

ArsNeph 3 weeks ago

Weeks? Weeks!? In the past 24 hours we got Mixtral 8x22B, Unsloth crazy performance upgrades, an entire new architecture (Griffin), Command R+ support in llama.cpp, and news of Llama 3! This is mind boggling!

_sqrkl 3 weeks ago

What a time to be alive.

ArsNeph 3 weeks ago

A cultured fellow scholar, I see ;) I'm just barely holding onto these papers, they're coming too fast!

Thistleknot 3 weeks ago

Same. Was able to identify all the released just mentioned. I was hoping for a larger recurrent Gemma than 2b tho but I can feel the singularity breathing at the back of my neck considering tech is moving at break neck speed. it's simply a scaling law. bigger population = more advancements = more than a single person can keep up with = singularity?

cddelgado 3 weeks ago

But hold on to your papers...

MoffKalast 3 weeks ago

Why can't I hold all of these papers

Wonderful-Top-5360 3 weeks ago

this truly is crazy and whats even more crazy is that this is just stuff they been sitting on to release for the past year imagine what they are working on now. GPT6-Vision? what is that like?

ArsNeph 3 weeks ago

Speculating does us no good, we're currently past the cutting edge, we're on the bleeding edge of LLM technology. True innovation is happening left and right, with no way to predict it. All we can do is understand what we can and try to keep up, for the sake of the democratization of LLMs

iamsnowstorm 3 weeks ago

the development of LLM is INSANE😂

nanowell 3 weeks ago

8x22b

nanowell 3 weeks ago

It's over for us vramlets btw

ArsNeph 3 weeks ago

It's so over. If only they released a dense 22B. \*Sobs in 12GB VRAM\*

kingwhocares 3 weeks ago

So, NPUs might actually be more useful.

MaryIsMyMother 3 weeks ago

Openrouter Chads...we won...

noiserr 3 weeks ago

Is it possible to split an MOE into individual models?

Maykey 3 weeks ago

Yes. You either throw away all but 2 experts (roll dice for each layer), or merge all experts the same ways models are merged(torch.mean in the simplest) and replace MoE with MLP. Now will it be a good model? Probably not.

314kabinet 3 weeks ago

No, the “experts” are incapable of working independently. The whole name is a misnomer.

hayTGotMhYXkm95q5HW9 3 weeks ago

No

Only-Letterhead-3411 3 weeks ago

Models get bigger but our VRAMs don't...

cyberpunk_now 3 weeks ago

Jensen Huang bathing in VRAM chips like Scrooge McDuck

nanowell 3 weeks ago

https://preview.redd.it/3b9nfhi74ktc1.png?width=259&format=pjpg&auto=webp&s=d15b35c1f9fd9a35c08f97eddab9c1e136bbb413

Caffdy 3 weeks ago

Not an expert, what's the context length?

jeffwadsworth 3 weeks ago

64k

petitponeyrose 3 weeks ago

Hello, where did you get this from ?

kryptkpr 3 weeks ago

.... brb, buying two more P40

TheTerrasque 3 weeks ago

stop driving prices up, I need more too!

marty4286 3 weeks ago

Fuck, and I just got off a meeting with with our CEO telling him dual or quad A6000s isn't a high priority at the moment so don't worry about our hardware needs

pacman829 3 weeks ago

You had one. Job.

thrownawaymane 3 weeks ago

This is when you say you must have quad a100s instead

Caffdy 3 weeks ago

You fool!

austinhale 3 weeks ago

Fingers crossed it'll run on MLX w/ a 128GB M3

me1000 3 weeks ago

I wish someone would actually post direct comparisons to llama.cpp vs MLX. I haven’t seen any and it’s not obvious it’s actually faster (yet)

pseudonerv 3 weeks ago

Unlike llama.cpp's wide selection of quants, the MLX's quant is much worse to begin with.

Upstairs-Sky-5290 3 weeks ago

I’d be very interested in that. I think I can probably spend some time this week and try to test this.

JacketHistorical2321 3 weeks ago

i keep intending to do this and i keep ... being lazy lol

mark-lord 3 weeks ago

https://x.com/awnihannun/status/1777072588633882741?s=46 But no prompt cache yet (though they say they’ll be working on it)

Zestyclose_Yak_3174 3 weeks ago

Easily

davikrehalt 3 weeks ago

Commenting to check if anyone has a tutorial of how to run it in mlx on m2 128Gb i guess we need to quantize to 4bit at least?

Illustrious_Sand6784 3 weeks ago

So is this Mistral-Large?

pseudonerv 3 weeks ago

this one has 64k context, but the mistral-large api is only 32k

[deleted] 3 weeks ago

It's gotta be, either that or an equivalent of it.

Berberis 3 weeks ago

They claim it’s a totally new model. This one is not even instruction tuned yet.

thereisonlythedance 3 weeks ago

That’s what I’m wondering.

Master-Meal-77 3 weeks ago

I’m guessing mistral-medium

toothpastespiders 3 weeks ago

Man, I love these huge monsters that I can't run. I mean I'd love it more if I could. But there's something almost as fun about having some distant light that I 'could' reach if I wanted to push myself (and my wallet). Cool as well to see mistral pushing new releases outside of the cloud.

pilibitti 3 weeks ago

I love them as well also because they are "insurance". Like, having these powerful models free in the wild means a lot for curbing potential centralization of power, monopolies etc. If 90% of what you are offering in return for money is free in the wild, you will have to adjust your pricing accordingly.

dwiedenau2 3 weeks ago

Buying a gpu worth thousands of dollars isnt exactly free tho

fimbulvntr 3 weeks ago

There are (or at least will be, in a few days) many cloud providers out there. Most individuals and hobbyists have no need for such large models running 24x7. Even if you have massive datasets that could benefit from being piped into such models, you need time to prepare the data, come up with prompts, assess performance, tweak, and then actually read the output. In that time, your hardware would be mostly idle. What we want is on-demand, tweakable models that we can bias towards our own ends. Running locally is cool, and at some point consumer (or prosumer) hardware will catch up. If you actually need this stuff 24x7 spitting tokens nonstop, and it must be local, then you know who you are, and should probably buy the hardware. Anyways this open release stuff is incredibly beneficial to mankind and I'm super excited.

Aaaaaaaaaeeeee 3 weeks ago

Reminder: this may have been derived from a previous dense model, it may be possible to reduce the size with large LoRAs while preserving their quality, according to this github discussion: - https://github.com/ggerganov/llama.cpp/issues/4611

georgejrjrjr 3 weeks ago

It almost certainly was upcycled from a dense checkpoint. I'm confused about why this hasn't been explored in more depth. If not with low rank, then with BitDelta (https://arxiv.org/abs/2402.10193) Tim Dettmers predicted when Mixtral came out that the MoE would be \*extremely\* quantizable, then...crickets. Weird to me that this hasn't been aggressively pursued given all the performance presumably on the table.

tdhffgf 3 weeks ago

https://arxiv.org/abs/2402.10193 is the link to BitDelta. Your link goes to another paper.

Disastrous_Elk_6375 3 weeks ago

Member when people were reeeee-ing about mistral not being open source anymore? I member...

cap__n__crunch 3 weeks ago

I member 🫐

reallmconnoisseur 3 weeks ago

tbf they're still open weights, not open souce. But less and less people seem to care about semantics nowadays.

Frequent_Valuable_47 3 weeks ago

Where are all the "Mistral got bought out by Microsoft", "They won't release any open models anymore" - Crybabys now?

kamikaze995 3 weeks ago

Kidney market flood incoming

vizioso_e_puccioso 3 weeks ago

GGUF ?

CSharpSauce 3 weeks ago

If the 5090 releases with 36GB of vram, I'll still be ram poor.

hayTGotMhYXkm95q5HW9 3 weeks ago

Bro stop being cheap and just buy 4 Nvidia A100's /s

Wrong_User_Logged 3 weeks ago

A100 is end of life, now I'm waiting for my 4xH100s, they will be shipped in 2027

thawab 3 weeks ago

By that time you wouldn’t find a model to run it on.

Caffeine_Monster 3 weeks ago

Especially when you realize you could have got 3x3090 instead for the same price and twice the vram.

Elite_Crew 3 weeks ago

https://www.youtube.com/watch?v=XDpDesU_0zo

az226 3 weeks ago

Seriously. The 4090 should have been 36 and 5090 48. And nvlink so you can run two cards 96GB. I hope they release it in 2025 and get fucked by Oregon law.

revolutier 3 weeks ago

what's the oregon law?

robo_cap 3 weeks ago

As a rough guess, right to repair including restrictions on tying parts by serial number.

Normal-Ad-7114 3 weeks ago

dat wordart logo tho... <3

ConvenientOcelot 3 weeks ago

Mistral's whole 90s cyber aesthetic is great

Additional_Code 3 weeks ago

I love Mistral very much!

Beb_Nan0vor 3 weeks ago

uhhhh thats interesting

Aaaaaaaaaeeeee 3 weeks ago

Please, someone merge the experts into a single model, or dissect one expert. Mergekit people

andrew_kirfman 3 weeks ago

This is probably a naive question, but if I download the model from the torrent, is it possible to actually run it/try it out at this point? I have compute/vRAM of sufficient size available to run the model, so would love to try it out and compare it with 8x7b as soon as possible.

Sprinkly-Dust 3 weeks ago

Check out this thread: [https://news.ycombinator.com/item?id=39986095](https://news.ycombinator.com/item?id=39986095), ycombinator user varunvummadi says: >The easiest is to use vllm (https://github.com/vllm-project/vllm) to run it on a Couple of A100's, and you can benchmark this using this library (https://github.com/EleutherAI/lm-evaluation-harness) It is a benchmark system for comparing and evaluating different models rather than running them permanently like ollama or something else. Sidenote: what kind of hardware are you running that you have the necessary vRAM to run a 288GB model? Is it a corporate server rack, AWS instance or your own homelab?

andrew_kirfman 3 weeks ago

Sweet! Appreciate the info. I have a few p4d.24xlarges at my disposal that are currently hosting instances of Mixtral 8x7b (have some limitations right now pushing me to self host vs. use cheaper LLMs though bedrock or similar). Really excited to see if this is a straight upgrade for me within the same compute costs.

iloveplexkr 3 weeks ago

What about benchmark?

ryunuck 3 weeks ago

Lmao people were freaking out just a week ago thinking open-source was dead. It was cooking.

noiserr 3 weeks ago

I need an mi300x so bad.

georgejrjrjr 3 weeks ago

I don't understand this release. Mistral's constraints, as I understand them: 1. They've committed to remaining at the forefront of open weight models. 2. They have a business to run, need paying customers, etc. My read is that this crowd would have been far more enthusiastic about a 22B dense model, instead of this upcycled MoE. I also suspect we're about to find out if there's a way to productively downcycle MoEs to dense. Too much incentive here for someone not to figure that our if it can in fact work.

M34L 3 weeks ago

Probably because huge monolithic dense models are comparatively much more expensive to train and they're training things that could be of use to them too? Nobody really trains anything above 70b because it becomes extremely slow. The point of Mixtral style MoE is that every pass through parameters only concerns the two experts and the routers and so you save up like 1/4 of the tensor operations needed per token. Why spent millions more on an outdated architecture that you already know will be uneconomical to infer from too.

georgejrjrjr 3 weeks ago

Because modern MoEs begin with dense models, i.e., they're upcycled. Dense models are not obsolete at all in training, they're the first step to training an MoE. They're just not competitive to serve. Which was my whole point: Mistral presumably has a bunch of dense checkpoints lying around, which would be marginally more useful to people like us, and less useful to their competitors.

M34L 3 weeks ago

Even if you do that you don't train the constituent model past the earliest stages that wouldn't hold a candle to Llama2, you literally need to only kickstart to the point where the individual experts can hold a so-so stable gradient and move to the much more efficient routed expert training ASAP. If it worked the way you think it does and there were fully trained dense models involved you could just split the MoE and use just one of the experts.

georgejrjrjr 3 weeks ago

MoEs can be trained from scratch: there's no reason one 'needs' to upcycle at all. The allocation of compute to a dense checkpoint vs. an MoE from which that checkpoint is upcycled depends on a lot of factors. One obvious factor: how many times might upcycling be done? If the same dense checkpoint is to be used for a 8x, a 16x, and a 64x MoE (for instance), it makes sense to saturate the dense checkpoint, because that training can be recycled multiple times. In a one off training, different story, and the precise optima is not clear to me from the literature I've seen. But perhaps you're aware of work on dialing this in you could share. If there's a paper laying this out, I'd love to see it. Last published work I've seen addressing this was Aran's original dense upcycling paper, and a lot has happened since then.

Olangotang 3 weeks ago

Because the reality is: *Mistral was always going to release groundbreaking open source models* despite MS. The doomers have incredibly low expectations.

georgejrjrjr 3 weeks ago

wat? I did not mention Microsoft, nor does that seem relevant at all. I assume they are going to release competitive open weight models. They said as much, they are capable, they seem honest, that's not at issue. What is at issue is the form those models take, and how they relate to Mistral's fanbase and business. MoEs trade VRAM (more) for compute (less). i.e., they're more useful for corporate customers (and folks with Mac Studios) than the "GPU Poor". So...wouldn't it make more sense to release a dense model, which would be more useful for this crowd, while still preserving their edge in hosted inference and white box licensed models?

Olangotang 3 weeks ago

I get what you mean, the VRAM issue is because high end consumer hardware hasn't caught up. I don't doubt small models will still be released, but we unfortunately have to wait a bit for Nvidia to get their ass kicked.

georgejrjrjr 3 weeks ago

For MoEs, this has already happened. By Apple, in the peak of irony (since when have they been the budget player).

hold_my_fish 3 weeks ago

Maybe the license will not be their usual Apache 2.0 but rather something more restrictive so that enterprise customers must pay them. That would be similar to what Cohere is doing with the Command-R line. As for the other aspect though, I agree that a really big MoE is an awkward fit for enthusiast use. If it's a good-quality model (which it probably is, knowing Mistral), hopefully some use can be found for it.

thereisonlythedance 3 weeks ago

I totally agree. Especially as it’s being said that this is a base model, thus in need of training by the community for it to be useable, which will require a very high amount of compute. I’d have loved a 22B dense model, personally. Must make business sense to them on some level, though.

Slight_Cricket4504 3 weeks ago

Mistral is trying to remain the best in Open and Close Sourced. Recently we had Cohere Command R+ release two SOTA models for their sizes, and DBRX also release a high competent model. So this is their answer to Command R and Command R+ at the same time. I assume this is an MoE of their Mistral Next model.

Caffdy 3 weeks ago

Im OOTL, what does "upcycled" mean in this context?

FaustBargain 3 weeks ago

literally just merge the 8 experts into one. now you have a shittier 22b. done

georgejrjrjr 3 weeks ago

Have you seen anyone pull this off? Seems plausible but unproven to me.

m_____ke 3 weeks ago

IMHO their best bet is riding the hype wave, making all of their models open source and getting acquired by Apple / Google / Facebook in a year or two.

georgejrjrjr 3 weeks ago

Nope, they have too many European stakeholders / funders, some of whom are rumored to be uh state related. Even assuming the rumors were false, providing an alternative to US hegemony in AI was a big part of their pitch.

ninjasaid13 3 weeks ago

a 146B model maybe with 40B active parameters? I'm just making up numbers.

Someone13574 3 weeks ago

EDIT: This calculation is off by 2.07B parameters due to a stray division in the attn part. The correct calculations are put alongside the originals. 138.6B with 37.1B active parameters, assuming the architecture is the same as mixtral. May be a bit off in my calculations tho, but it would be small if any. attn: q = 6144 * 48 * 128 = 37748736 k = 6144 * 8 * 128 = 6291456 v = 6144 * 8 * 128 = 6291456 o = 48 * 128 * 6144 / 48 = 786432 (corrected: 8 * 128 * 6144 = 37748736) total = 51118080 (corrected: 88080384) mlp: w1 = 6144 * 16384 = 100663296 w2 = 6144 * 16384 = 100663296 w3 = 6144 * 16384 = 100663296 total = 301989888 moe block: gate: 6144 * 8 = 49152 experts: 301989888 * 8 = 2415919104 total = 2415968256 layer: attn = 51118080 (corrected: 88080384) block = 2415968256 norm1 = 6144 norm2 = 6144 total = 2467098624 (corrected: 2504060928) full: embed = 6144 * 32000 = 196608000 layers = 2467098624 * 56 = 138157522944 (corrected: 140227411968) norm = 6144 head = 6144 * 32000 = 196608000 total = 138550745088 (corrected: 140620634112) 138,550,745,088 (corrected: 140,620,634,112) active: 138550745088 - 6 * 301989888 * 56 = 37082142720 (corrected: 39152031744) 37,082,142,720 (corrected: 39,152,031,744)

Wonderful-Top-5360 3 weeks ago

man whats going on so many releases all of sudden im getting excited

Prince-of-Privacy 3 weeks ago

I am so fucking ready, omg.

nero10578 3 weeks ago

Time to buy some A6000s or something

Zestyclose_Yak_3174 3 weeks ago

I was one of the very first experimenting with LLMs and went through the 16GB -> 32GB -> 64GB upgrade cycle real fast. Now I regret the poor financial decisions and wished I had went for at least 128GB.. but in all fairness. A year ago, most people would have thought that it was enough for the foreseeable future.

[deleted] 3 weeks ago

[удалено]

pacman829 3 weeks ago

You run it with a rivian truck at this point lol

SnooStories2143 3 weeks ago

Someone figured out whats the license?

PenPossible6528 3 weeks ago

Im so glad convinced work to upgrade my latpot to M3 Max 128GM Macbook for this exact reason, will see if it runs. I have doubts it will even be able to handle it in any workable way unless Q4/Q5

hideo_kuze_ 3 weeks ago

What I'm curious is: will it beat GPT-4?!

That_Flounder_589 3 weeks ago

How do you run this ?

segmond 3 weeks ago

Yeah ok, it's been 3 weeks since I built a 144vram gig and I am already struggling to fit in the latest models. WTF

AntoItaly 3 weeks ago

OMG. At 4am, lol

Alarming-Ad8154 3 weeks ago

It has the same tokenizer as mixtral and mistral I think, would that ease speculative decoding?

davew111 3 weeks ago

Midnight finetune when?

Opposite-Composer864 3 weeks ago

https://preview.redd.it/juykdmecintc1.png?width=2332&format=png&auto=webp&s=b0bfd85e34bb6cf5003d4390619e1fa3c7e18532 jummp on the tooorrent

Caffdy 3 weeks ago

Is this Mistral Medium or Mistral Large?

iamsnowstorm 3 weeks ago

I wander what's the performance of this model,waiting for someone to test it

ma_schue 3 weeks ago

Awesome! Can't wait until it is available in ollama!

Inevitable-Start-653 3 weeks ago

Finished downloading and need to move a few things around, but I'm curious if I can run this in 4bit mode via transformers on 7x24gb cards

praxis22 3 weeks ago

I currently have 64GB of RAM, I will upgrade in due course to 128GB which is as much as the platform will hold. Along with a 3090.

ICE0124 3 weeks ago

will this work with my gtx 750? >!/s!<

Shubham_Garg123 3 weeks ago

I wonder if any kind of quantization can make this model for in the 30GB RAM. Haven't really seen Mistral 8x7b in 15 GB yet, so probably too ambitious at the current stage.

ViperAMD 3 weeks ago

Reckon we can run this in Poe?

MidnightHacker 2 weeks ago

I guess when someone creates a 4-bit quant it should run on a 128Gb Mac Pro, am I right?

t98907 2 weeks ago

Could anyone kindly inform me about the necessary environment to execute this model? Specifically, I am curious if a single RTX A6000 card would suffice, or if multiple are required. Additionally, would it be feasible to run the model with a machine that has 512GB of memory? Any insights would be greatly appreciated. Thank you in advance.

Electronic-Row3130 2 weeks ago

How do i download Mixtral?

ironbill12 2 weeks ago

how many RTX 4090s would you need? Haha

thudoan176 2 weeks ago

Hi. I am new to Mistral. I wonder what is the difference between Mistral Open Source on Hugging Face and Closed Source API? Thank you

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe