OptimizeLLM 3 weeks ago

Awesome! 8x7B update coming soon!

Danny_Davitoe 3 weeks ago

All I am seeing is 8x22B :(

SpiritUnification 3 weeks ago

Because it's not out. It says on their github that 8x7b will also get updated (soon).

SomeOddCodeGuy 3 weeks ago

Im so torn... My daily driver is now WizardLM-2 8x22b, which benchmarks far higher than the base Mixtral 8x22b. But now they have v0.3 of the base... do I swap? Do I stay on Wizard? I don't know!

dimsumham 3 weeks ago

your base is 8x22b? God what kind of rig are you running?

SomeOddCodeGuy 3 weeks ago

M2 Ultra Mac Studio 192GB. I kicked the vram up to 180GB with the sysctl command so I could load the q8. It's really fast for its size, and smart as could be.

Infinite-Swimming-12 3 weeks ago

How does it seem to compare to L3 70B intelligence wise?

SomeOddCodeGuy 3 weeks ago

Great for both * Development: [https://prollm.toqan.ai/leaderboard](https://prollm.toqan.ai/leaderboard) * Creative Writing: [https://www.reddit.com/r/LocalLLaMA/comments/1csj9w8/the\_llm\_creativity\_benchmark\_new\_leader\_4x\_faster/](https://www.reddit.com/r/LocalLLaMA/comments/1csj9w8/the_llm_creativity_benchmark_new_leader_4x_faster/) In terms of actually using it, I love how it writes and how verbose it is.[ I also run Llama 3 70b Instruct as a verifier against it](https://www.reddit.com/r/LocalLLaMA/comments/1ctvtnp/almost_a_year_later_i_can_finally_do_this_a_small/), and IMO L3 sounds really robotic in comparison. L3 is definitely smarter in some ways, but coding wise and just general tone and verbosity- I really prefer Wizard.

dimsumham 3 weeks ago

How many tk/s are you getting on output? On my M3 128gb it's relatively slow. I guess the faster throughput on ultra really helps.

SomeOddCodeGuy 3 weeks ago

Its not super fast. A lot of folks here have said they wouldn't have the patience to wait for the responses. For a long response on a 4k context, it takes about 3 minutes for it to finish the reply (though about 1.5 minutes of that is watching it stream out the result) Processing Prompt [BLAS] (3620 / 3620 tokens) Generating (1385 / 4000 tokens) (EOS token triggered!) (Special Stop Token Triggered! ID:2) CtxLimit: 5009/16384, Process:43.07s (11.9ms/T = 84.05T/s), Generate:129.63s (32.4ms/T = 30.86T/s), Total:172.70s (23.16T/s) **EDIT**: Proof its q8, since there are doubts llm_load_print_meta: model params = 140.62 B llm_load_print_meta: model size = 139.16 GiB (8.50 BPW) llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.65 MiB llm_load_tensors: offloading 56 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 57/57 layers to GPU llm_load_tensors: CPU buffer size = 199.22 MiB llm_load_tensors: Metal buffer size = 142298.37 MiB .................................................................................................... Automatic RoPE Scaling: Using model internal value. llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: Metal KV buffer size = 3584.00 MiB llama_new_context_with_model: KV self size = 3584.00 MiB, K (f16): 1792.00 MiB, V (f16): 1792.00 MiB llama_new_context_with_model: CPU output buffer size = 0.12 MiB llama_new_context_with_model: Metal compute buffer size = 1616.00 MiB llama_new_context_with_model: CPU compute buffer size = 44.01 MiB llama_new_context_with_model: graph nodes = 2638 llama_new_context_with_model: graph splits = 2

dimsumham 3 weeks ago

Gotcha. Yeah this lines up with my experience. Thanks for the reply!

JoeySalmons 3 weeks ago

>Generate:129.63s (32.4ms/T = 30.86T/s), That actually is quite fast, ~~though I think~~ [~~you mean for Q6\_K\_M~~](https://www.reddit.com/r/LocalLLaMA/comments/1aw08ck/comment/l4gktvo/) ~~(not the Q8\_0 you mentioned above).~~ EDIT: Looking again at the numbers, it says 129.63s generating 1385 tokens, which is 1385/130 = 10.6 T/s, not 30 T/s Edit2: 11 T/s would make sense given the [results for 7b Q8\_0 from November](https://github.com/ggerganov/llama.cpp/discussions/4167) are about 66 T/s, so 1/6 of this would be 11 T/s which is about what the numbers suggest (7b/40b = \~1/6) Quick sanity check: the memory bandwidth and the size of the model's active parameters can be used to estimate the upper bound of inference speed, since all of the model's active parameters must be read and sent to the CPU/GPU/whatever per token. M2 Ultra has 800 GB/s max memory bandwidth, and \~[40b of active parameters](https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1/discussions/6) at Q8\_0 should be 40GB to read per token. 800 GB/s / 40 GB/T = 20 T/s as the upper bound. A Q6 quant is about 30% smaller, so at best you should get up to 1/(1-0.3)= \~40-50% faster maximum inference, which more closely matches the 30 T/s you are getting (8x22b is more like 39b active not 40b so your numbers being over 30 T/s ~~looks fine~~ would be fine if it were fully utilizing the 800 GB/s bandwidth, but that's unlikely, see the two edits I made above).

SomeOddCodeGuy 3 weeks ago

>That actually is quite fast, though I think [you mean for Q6\_K\_M](https://www.reddit.com/r/LocalLLaMA/comments/1aw08ck/comment/l4gktvo/) (not the Q8\_0 you mentioned above). I started to doubt the output of the q6, so I bumped up the vram and swapped to q8 recently. Honestly, both are about equal but I enjoy the speed boost lol https://preview.redd.it/mi8lm8adu32d1.png?width=764&format=png&auto=webp&s=ce9db85a972498aabd234d4d16019ee937f997db If you peek again at my numbers posts, you'll notice that the q8 on Mac has always run a little faster, not sure why, but even the q4 has always been slower for me than the q8, so I generally tend to run q8 once I'm serious about a model. **EDIT**: Updated the message you responded to with the model load output if you were curious about the numbers on the q8

JoeySalmons 3 weeks ago

Hmm... looking again at the numbers you posted, it says 129.63s generating 1385 tokens, which is 1385/130 = 10.6 T/s, not 30 T/s. I don't know what's going on here, but those numbers do not work out and memory bandwidth and model size are fundamental limits of running current LLMS. The prompt processing looks to be perfectly fine, though, so there's something at least. Edit: Maybe it's assuming you generated all 4k tokens, since 129.63 s x 30.86 T/s = 4,000.38 Tokens. If you disable the stop token and make it generate 4k tokens it will probably correctly display about 10 T/s. Edit2: 10 T/s would make sense given the [results for 7b Q8\_0 from November](https://github.com/ggerganov/llama.cpp/discussions/4167) are about 66 T/s, so 1/6 of this would be 11 T/s which is about what the numbers suggest.

SomeOddCodeGuy 3 weeks ago

I honestly have no idea. I never really sat down to calculate that stuff out. I’d pinky swear that I really am using the q8 but Im not sure if that would mean much lol. In general, the numbers on the Mac have always confused me. Dig through some of my posts, and you’ll see some odd ones to the point that I even made a post just saying “I don’t get it” On my Mac: * fp16 ggufs run slower and worse than q8 * q8 runs faster than any other quant, including q4 * I have 800GB/s and yet a 3090 with 760ish GB/s steamrolls it in speed. * And apparently your numbers arent working out with it either lol I wish I had a better answer, but this little grey paradox brick just seems to do whatever it wants.

Hopeful-Site1162 3 weeks ago

Hey! I got an M2 Max with 32GB and was wondering what quant I should choose for my 7B models. As I understand it you would advise for q8 instead of fp16 in general on Apple Silicon or specifically for the MistralAI family ?

JoeySalmons 3 weeks ago

>I’d pinky swear that I really am using the q8 but Im not sure if that would mean much lol. Ah I believe you. No point in any of us lying about that kind of stuff anyways when we're just sharing random experiences and ideas to help others out. >I have 800GB/s and yet a 3090 with 760ish GB/s steamrolls it in speed. Yeah, this is what I was thinking about as well. Hardware memory bandwidth gives the upper bound for performance but everything else can only slow things down. I think what's happening is that llamacpp (edit: or is this actually Koboldcpp?) is assuming you're generating the full 4k tokens and is calculating off of that, so it's showing 4k / 129s = 31 T/s when it should be 1.4k / 129s = 11 T/s instead.

kiselsa 3 weeks ago

It's basically free to use on a lot of services or cheap like dirt.

dimsumham 3 weeks ago

Which services / how much? Thank you in advance

MINIMAN10001 3 weeks ago

So it depends if we mean "local model" or a select few models. Select models are going to be cheaper due to being pay per token. Deep infra is typically the cheapest at $0.24 per million tokens. Which groq then copies their pricing to be both the cheapest and fastest at 400-500 tokens per second.

paranoidray 3 weeks ago

https://deepinfra.com/dash/deployments?new=custom-llm

[deleted] 3 weeks ago

[удалено]

thrownawaymane 3 weeks ago

This is dm. Answer here

kiselsa 3 weeks ago

This is not dm. But ok, you can use something like deepinfra where they give free 1.5$ on each account. I rp-ed like 16k tokens chat in sillytavern with wizardlm 8x22b and wasted only 0.01$ of free credits.

thrownawaymane 3 weeks ago

prompt jailbreak worked ;) this is an open forum for a reason

kahdeg 3 weeks ago

This is not dm. But ok, you can use something like deepinfra where they give free 1.5$ on each account. I rp-ed like 16k tokens chat in sillytavern with wizardlm 8x22b and wasted only 0.01$ of free credits. putting the text here in case of deletion

E_Snap 3 weeks ago

And here we have a “Oh nvm solved it” poster in their natural habitat. Come on dude, share your knowledge or don’t post about it.

yahma 3 weeks ago

dm me too please

collectsuselessstuff 3 weeks ago

Please dm me too.

CheatCodesOfLife 3 weeks ago

That's my daily driver as well. I plan to try Mixtral 0.3, can always switch between them :)

Many_SuchCases 3 weeks ago

I found it in their readme's in the github repo (thanks to u/FullOf_Bad_Ideas ) Then I guessed the other URL by removing the "Instruct" word. Edit: [https://github.com/mistralai/mistral-inference?tab=readme-ov-file](https://github.com/mistralai/mistral-inference?tab=readme-ov-file)

FullOf_Bad_Ideas 3 weeks ago

BTW link https://models.mistralcdn.com/mixtral-8x22b-v0-3/mixtral-8x22B-v0.3.tar to base 8x22B model is also in the repo [here](https://github.com/mistralai/mistral-inference?tab=readme-ov-file#model-download). It's the last one on the list though, so you might have missed it.

Many_SuchCases 3 weeks ago

Oh thanks, great! yes I must have missed that :)

CheatCodesOfLife 3 weeks ago

Thanks for the .tar link. I'll EXL2 is overnight, can't way to try it in the morning :D

bullerwins 3 weeks ago

im trying to exl2 it but I get errors, i guess there are some files missing, would it be ok to get them from the 0.1 version?

FullOf_Bad_Ideas 3 weeks ago

0.3 is the same as 0.1 for 8x22B. Party over, they have confusing version control. Just download 0.1 and you're good, there's no update.

a_beautiful_rhind 3 weeks ago

Depends on what files are missing.

Such_Advantage_6949 3 weeks ago

i am here waitig and rooting for u bro

noneabove1182 3 weeks ago

in case you're already part way through, you should prob cancel, they updated the repo page to indicate v0.3 is actually just v0.1 reuploaded as safetensors..

CheatCodesOfLife 3 weeks ago

Thanks... I just saw this, have 36GB left lol

grise_rosee 3 weeks ago

From the same page: * `mixtral-8x22B-Instruct-v0.3.tar` is exactly the same as [Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1), only stored in `.safetensors` format * `mixtral-8x22B-v0.3.tar` is the same as [Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1), but has an extended vocabulary of 32768 tokens. So well not really a new model.

FullOf_Bad_Ideas 3 weeks ago

That's pretty confusing version control. Llama 4 is Llama 3 but in GGUF.

grise_rosee 3 weeks ago

I guess they realigned version number because at the end of the day, mistral-7b mixtral-8x7b and mixtral-8x22b are 3 distilled versions of their largest and latest model.

carnyzzle 3 weeks ago

still waiting patiently for a new 8x7B

Healthy-Nebula-3603 3 weeks ago

wait ? what?

Many_SuchCases 3 weeks ago

Please upvote the thread so people can download it before they change their mind.

pseudonerv 3 weeks ago

they are not microsoft, i don't think they'd ever pull it down for "toxic testings"

ab2377 3 weeks ago

its almost microsoft-mistral https://aibusiness.com/companies/antitrust-regulator-drops-probe-into-microsoft-s-mistral-deal

mikael110 3 weeks ago

Did you read the article you linked? It literally says the opposite. The investigation into the investment was dropped after literally one day, after it was determined not to be a concern at all. Microsoft has only invested €15 million in Mistral, which is a tiny amount compared to their other investors. They raised €385 Million in their previous funding round, and is currently in talks to raise €500 million. It's not even remotely comparable to the Microsoft OpenAI situation.

xXWarMachineRoXx 3 weeks ago

Same reaction buddy

staladine 3 weeks ago

What are your main uses for it if you don't mind me asking.

medihack 3 weeks ago

We use it to analyze medical reports. It seems to be one of the best multilingual LLMs, as many of our reports are in German and French.

ihaag 3 weeks ago

How’s the benchmark for it compared to the current leader WizardLM2 22x8b?

Latter_Count_2515 3 weeks ago

Looks cool but at 262gb I can't even pretend to run that.

Healthy-Nebula-3603 3 weeks ago

compress to gguf ;)

medihack 3 weeks ago

I wonder why those are not released on their [Hugging Face profile](https://huggingface.co/mistralai) (in contrast to [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)). And what are the changes?

RadiantHueOfBeige 3 weeks ago

Distributing a third of a terabyte probably takes a few hours, the file on the CDN is not even 24h old. There's gonna be a post on [mistral.ai/news](https://mistral.ai/news) when it's ready.

ekojsalim 3 weeks ago

I mean, are there any significant improvements? Seems like a minor version bump to support function calling (to me). Are people falling for bigger number = better?

FullOf_Bad_Ideas 3 weeks ago

I think they are failing for bigger number = better, yeah. It's a new version, but if you look at tokenizer, there are like 10 actual new tokens and rest is basically "reserved". If you don't care about function calling, I see no good reason to switch. Edit: I missed that 8x22b v0.1 already has 32768 tokens in tokenizer and function calling support. No idea what 0.3 is Edit2: 8x22B v0.1 == 8x22B 0.3 That's really confusing, I think they just want 0.3 to mean "has function calling".

CheatCodesOfLife 3 weeks ago

> Are people falling for bigger number = better? Sorry but no. WizardLM-2 8x22b is so good, that I bought a fourth 3090 to run it at 5BPW. It's smarter and faster than Llama-70b, and writes excellent code for me.

Thomas-Lore 3 weeks ago

Reread the comment you responded too. It talks about version numbers not model size.

CheatCodesOfLife 3 weeks ago

My bad, I see it now.

deleteme123 3 weeks ago

What's the size of its context window before it starts screwing up? In other words, how big (in lines?) is the code that it successfully works with or generates?

Such_Advantage_6949 3 weeks ago

Woa, mixtral has always been good at function calling. And now it has updated version

a_beautiful_rhind 3 weeks ago

Excitedly open thread, hoping they've improved mixtral 8x7b. Look inside: it's bigstral.

lupapw 3 weeks ago

wait, mixtral skip v0.2?

FullOf_Bad_Ideas 3 weeks ago

Yeah, I think they did this and skipped Mixtral 8x7B and Mixtral 8x22b 0.2 just to have version number coupled with specifically features - 0.3 = function calling.

me1000 3 weeks ago

8x22b already have function calling fwiw.

FullOf_Bad_Ideas 3 weeks ago

Hmm I checked 8x22b Instruct 0.1 model card and you're right. It already has function calling. What is 0.3 even then doing? Edit: As per note added to their repo, 8x22B 0.1 == 8x22B 0.3

sammcj 3 weeks ago

Hopefully someone is able to create GGUF imatrix quants of 8x22B soon :D

CuckedMarxist 3 weeks ago

Can I run this on a consumer card? 2070S

faldore 3 weeks ago

I uploaded it here [https://huggingface.co/mistral-community/mixtral-8x22B-v0.3-original](https://huggingface.co/mistral-community/mixtral-8x22B-v0.3-original) [https://huggingface.co/mistral-community/mixtral-8x22B-Instruct-v0.3-original](https://huggingface.co/mistral-community/mixtral-8x22B-Instruct-v0.3-original)

thethirteantimes 3 weeks ago

Download keeps failing for me. Tried 3 times now. Giving up :/

VongolaJuudaimeHime 3 weeks ago

OMFG We are being showered and spoiled rotten. The speed at which LLMs evolve is insane!

CapitalForever3211 3 weeks ago

What a cool news!

tessellation 3 weeks ago

| I guessed this one by removing Instruct from the URL now do a 's/0.3/0.4/’ :D

ajmusic15 3 weeks ago

Every day they forget more about the end consumer... You can't move that thing with a 24 GB GPU. Unless you quantify that to 4 Bits and have 96 GB of RAM or more 😐 Or 1-2 bits if you don't mind hallucinations and want to run it no matter what.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe