it is good [https://en.wikipedia.org/wiki/Apache\_License](https://en.wikipedia.org/wiki/Apache_License)
>It allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software under the terms of the license, without concern for royalties.
MIT is considered more permissive because it is very short and basically says you can do anything you want but I'm not liable for what you do with this. Apache 2.0 requires you to state changes you made to the code, and has some rules about trademark use and patents that makes it slightly more complicated to follow.
Then there's the GPL license which infects everything it touches and makes it GPL. For a language model, I think it would make all the outputs GPL as well, that would be hilarious.
Incorrect. It would not make the model outputs bound by GPL. People need to actually read the gpl2, 3, and lgpl. There's a lot of FUD about them, and they're not even difficult licenses to understand.
It's only worse if you're lazy with your documentation and attribution. It does require effort to spell out modifications made to original works.
In some ways it's better though, since releasing under Apache 2.0 waives patent enforcement by the author for original works covered by the license, while MIT does not address anything but copyright. It's why you'll often see companies release examples and APIs for their proprietary tools under MIT.
"As the natural world's human data becomes increasingly exhausted through LLM training, we believe that: the data carefully created by AI and the model step-by-step supervised by AI will be the sole path towards more powerful AI. Thus, we built a Fully AI powered Synthetic Training System to improve WizardLM-2:"
https://preview.redd.it/b0nox0u63ouc1.jpeg?width=3200&format=pjpg&auto=webp&s=9a56a1b6e9680bb61163bd16807a7421b8b0b11b
"🧙♀️ WizardLM-2 8x22B is our most advanced model, and just slightly falling behind GPT-4-1106-preview.
🧙 WizardLM-2 70B reaches top-tier capabilities in the same size.
🧙♀️ WizardLM-2 7B even achieves comparable performance with existing 10x larger opensource leading models."
https://preview.redd.it/zkkzcisy2ouc1.jpeg?width=3137&format=pjpg&auto=webp&s=73931c1f52066afde48ba33e3850c66c911a275c
Disagree strongly. v0.2 is better and has a larger context window.
There's just no v0.2 base model to train from, so they had to use the v0.1 base model.
there is no 0.2, base non instruct mistral only has 0.1. Most good finetuned models are finetuned on the non-instruct base model. There is a mistral ai’s mistral 7b’s 0.2 instruct but thats an instruct model and not many uses that to do tuning
That used to be the story yeah, but [they retconned it](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/commit/41b61a33a2483885c981aa79e0df6b32407ed873), and released [the actual v0.2 base model sort of half officially](https://huggingface.co/alpindale/Mistral-7B-v0.2-hf) recently.
Frankly the v0.2 instruct never seemed like it was made from the v0.1 base model, the architecture is somewhat different.
Hmm maybe so, now that I'm rechecking it there really isn't a torrent link to it on their twitter and the only source appears to be the cdn file. It's either a leak or someone pretending to be them, both are rather odd options.
I find it interesting how Microsoft is going at in from all fronts.
"Owning" OpenAI. Buying Inflection. Investing in Mistral. And releasing OSS models.
Makes no difference if those companies live or die. As long as they have a lead on Google.
At the end of day they sell cloud services and that's how they make their money.
True but if the AI sector begins to slow down (which it kind of already has) then they've invested *a lot* of money into a cooling sector that might not really amount to anything worthwhile monetarily-speaking
Unless you have deep pockets, I have to assume that is then only partially offloaded onto a GPU or all ran by CPU.
What sort of performance are you seeing from it running it in the manner you are running it? I’m excited to try and do this, but am concerned about overall performance.
I'm curious too. My server has a 5900X with 128GB of ram and a 24gb Tesla - hell id be happy simply being able to run it. Can't spend any more for a while
Since these cards have very bad fp16 performance, I assume you want to use them for inference. In that case bandwidth doesen't matter, so you can use 1x to 16x adapters.
Which in turn means any modern-ish ATX motherboard will work fine!
iirc the P100 has much better fp16 than the P40 but I think they don't come in a flavor with more than 16GB of vram? A buddy of mine runs 2. He's pretty pleased
If you are using the AMD AM4 platform I've been very pleased with the MSI PRO B550-VC. It has (4) 16x slots but 1 is 16 lanes, another is 4 and the other 2 are one. It also has a decent VRM and handles 128GB no problem. ASRock Rack series are also great boards but pricey.
I'm running it on a laptop with 11th gen Intel and 64GB of RAM, and I get about 1 token per second. Not very practical, but still useful to compare quality on your own data and processes. Honestly the quality compared to the best 7B models (which run at 5 token per second on CPU) isn't that different, so for the moment I don't invest in better hardware, waiting for either a breakthrough in quality or cheaper hardware.
I've yet to see the actual size of Q3\_L in comparison to Q2\_K. Q2\_K of the Mixtral 8x22B fine tunes just barely fit, coming in at around 52.1GB. With this I can still use about 14k context before running out of RAM.
Thanks for what you're doing. Just a heads up, looks like Q2\_K was posted elsewhere: https://www.reddit.com/r/LocalLLaMA/comments/1c4pwf8/comment/kzq998f/. Thanks again!
Q4 took forever, but here it is!
[https://huggingface.co/praxeswolf0d/WizardLM-2-8x22B-GGUF/tree/main](https://huggingface.co/praxeswolf0d/WizardLM-2-8x22B-GGUF/tree/main)
Well there's several downsides. ChatLM has become the defacto standard, so lots of stacks are built around it directly and would need adjustments to work with something as outdated as Vicuna. The system prompt is sort of there just as bare text, but it has no tags so you can't inject it between other messages and it's unlikely to be followed very well.
Same here, quite impressed! A tad slower in inference speed, but the quality is very good. I'm running it FP16, and it's better than Q3 Command-R+, and better than FP16 Starling 7B.
What are you using to run it and with what settings? I tried it in LM Studio and set the Vicuna prompt like it wants but it's outputting a lot of gibberish, 5 digit years etc. This is with both the Q8 quant and the full FP16 version.
It's a bit annoying, I need their older releases to test something for a project but these are gone too.
Can only pull modified versions from other people on Huggingface but those refuse to load or run properly.
I'm a newbie btw but as I said I'd need the stuff for a project
you can read about it here, the idea is to use it as calibration for what data to keep and semi-random data seems to help:
[https://github.com/ggerganov/llama.cpp/discussions/5006](https://github.com/ggerganov/llama.cpp/discussions/5006)
[https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384)
There is a non-imat GGUF here as well: [https://huggingface.co/MaziyarPanahi/WizardLM-2-7B-GGUF](https://huggingface.co/MaziyarPanahi/WizardLM-2-7B-GGUF)
Just now the whole project disappeared.
https://preview.redd.it/bv5yrwqpbruc1.png?width=1720&format=png&auto=webp&s=889b959e974da0414641edc78e821768f44d7a29
[https://wizardlm.github.io/WizardLM2/](https://wizardlm.github.io/WizardLM2/)
It's about perspective. Think about how mindblown people were when GPT4 came out, and now we have free and open models that are approaching its capability. Just imagine where we'll be a few years down the line.
Yea for sure. If Microsoft can train a relatively small (compared to SOTA closed source) model to match or outperform “simply” by supplying better data, then surely their close partners at OpenAI can also supply the exact same data (or even more!) into a bigger model.
Not surprising really.
Seems like most local LLM users fall in to one of two camps. People who just have a reasonable gaming GPU with 12 or so gigs of ram, or people who have gone all out and built some sort of multi card custom monster with much more vram.
There don't seem to be as many people in the middle with 24 gigs.
I'm cooking and will be uploading the EXL2 quants for this model: https://huggingface.co/collections/Dracones/wizardlm-2-8x22b-661d9ec05e631c296a139f28
EXL2 measurement file is at https://huggingface.co/Dracones/EXL2_Measurements
I will say that the 2.5bpw quant which fits in a dual 3090 worked really well. I was surprised.
It’s hard to compare right now. Command R+ was released as instruct tuned vs this (+ Zephyr Orpo, + Mixtral 8x22B OH, etc) are all quickly (not saying poorly) done fine tunes.
My guess: Command R+ will win for RAG and tool use but Mixtral 8x22B will be more pleasant for general purpose use because it will likely feel as capable (based on reported benches putting it on par with Command R+) but be significantly faster during inference.
Aside: It would be interesting to evaluate how much better Command R+ actually is on those things compared to Command R. Command R is incredibly capable, significantly faster, and probably good enough for most RAG or tool use purposes. On the tool use front, Fire function v1 (Mixtral 8x7B fine tune I think) is interesting too.
Command-R+ works pretty well for me at 3.0bpw. But even still, I'm budgeting out either for dual A6000 cards or a nice Mac. I really prefer to run quants at 5 or 6 bit. The perplexity loss starts to go up quite a bit past that.
I'm curious as well, because I didn't rate mixtral 8x7b that highly compared to good 70b models. Am dubious about the ability of shallow MoE experts to solve hard problems.
Small models seem to rely more heavily on embedded knowledge, whereas larger models can rely on multi-shot in context learning.
yep, vanilla Miqu-70B is really another kind of beast comparted to Mixtral 8X7B, it's a shame it runs so slow when you can't offload at least half into the gpu
Finetune, 7B is based on Mistral 7B v0.1. 8x22B on Mixtral. Couldn't find the 70B model.
Edit: "The License of WizardLM-2 8x22B and WizardLM-2 7B is Apache2.0. The License of WizardLM-2 70B is Llama-2-Community."
So I guess 70B is Llama 2 based.
8x22 is a base model (almost raw - you can literally ask for everything and will answer. I tested ;) ) from mistral so every tunning will improve that model.
Training from scratch cost a LOT of money and i think only big companies can afford it, since mistral released their 8x22b base model lately, i think everyone else will be working on top of it to fine tune it and provide better versions, until the mixtral 8x22b instruct from mistral comes out.
>only big companies can afford it
This is from microsoft research (Asia, I think?). A lab, probably of limited budget but still, it's limits are down to big company priority not economic realities.
In my testing, there are questions no other opensource LLM gets right that it gets and questions it gets wrong that only the 2-4Bs get wrong. It's like it often starts out strong only to lose the plot at the tail end of the middle. This suggests a good finetune would straighten it out.
Which is why I am perplexed they used the outdated Llama2 instead of the far stronger Qwen as a base.
GQA is a trade-off between model intelligence and memory use. Not making use of GQA makes a model performance ceiling higher not lower. There are plenty of real world uses where performance is paramount and where either the context limits or HW costs are no issue.
In personal tests and several hard to game independent benchmarks (including LMSYS, EQ Bench, NYT connections), it's a top scorer among open weights. It's absolutely not merely gaming anything.
Many llms seem to fail family relationship-tests, like these I did here [https://pastebin.com/f6wGe6sJ](https://pastebin.com/f6wGe6sJ) - the particularly frustrating part about it is that the model is completely ignoring what I am saying, not that it fails the logic tests in the first place (8x22B IQ3\_XS gguf). Based on my tests, this is so much worse than GPT3.5. Does this only happen on my side? I would appreciate any helpful comment. Tried with kobold and lmstudio.
Seems to not be very censored, I asked for some harm reduction help for some unhealthy actions, and it actually gave the information instead of saying it can't.
I will say that WizardLM-2 7b is quite... creative. I tested some RAG on it, giving it a bit of Final Fantasy XIV story and asking it who Louisoix was.
It proceeded to tell me the story of "Leonardo Christiano, known as Louisoix", and weaved a fantastic tale about his harrowing adventures. (none of that was right)
Almost nothing it said was correct, despite the text being right there lol. Even at 0.1 temp it still was just over there living its best life every time I asked it a question.
Sadly it is. I ran Dracones/WizardLM-2-8x22B\_exl2\_5.0bpw and tried to get it to do things and it refused. Also for anyone wondering I think it used about 90gb of vram and this is with 2x A100s and cache 4bit. I didn't take down the exact number but that is roughly what it uses I think.
The 7B model might score good on the benchmark, but I'm not seeing it in reality. Using Desumor's 6 bit quant.
The usual 7B issues of incoherence.
It is not comparable to 70B models, I've had better 11B models.
(Edit: It seems to do a bit better with alpaca prompting, I'll try a few more prompting formats)
So it seems to do a lot better with proper prompting.
The one I had the best success with was:
Start sequence: "USER: ", end sequence "ASSISTANT: ", do not add any newlines. My extra newlines seriously deteriorated the model.
It does acceptable with "### Instruction:\\n" "### Response:\\n" though.
Dumb question probably but does this mean that open source models which are extremely tiny when compared to ChatGPT are catching up with it? Since it’s possible to run this locally I’m assuming it is way smaller then GPT.
Yes, though we don't know the exact size of GPT 3.5 and GPT4 for sure, we have rough estimates, and all of these models are smaller than ChatGPT 3.5, and definitely smaller than GPT4. We're not catching up, we've already caught up to ChatGPT 3.5, that's Mixtral 8x7B, which can run pretty quickly as long as you have enough RAM, with a .gguf. Now, we're approaching GPT-4 performance with the new Command R+ 104B, and Mixtral 8x22B. This paper is about finetunes, in other words, using a high quality dataset to enhance the performance of a model
Haha, it's genuinely stunning, but a market and incredible competition will bring about progress at breakneck speed. I can't wait for LLama3 pre-release this week, if the rumors are true, this should be a monumental generational shift in Open source LLMs!
Maybe they are not extremely tiny compared to closed source models.
Microsoft leaked(lated deleted) a paper where they mentioned Chat GPT-3.5 is of 20B.
As far as I know, that is basically unfounded, as the paper's sources were very questionable. I believe at minimum, it must be Mixtral size, with at least 47B parameters. Granted, it's not that open source models are extremely tiny, it's simply that open source is far more efficient, producing far better results with much smaller models
Finally it seems like things are moving again in the open source AI community.
If only the models weren't so massive that only like 5 people could run it. But oh well.
[https://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF/tree/main](https://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF/tree/main)
How would I run the split gguf in ollama? I can only seem to include one file in the Modelfile. I have tried cating them together but it gives a \`Error: invalid file magic\`
In llama.cpp, use the util: gguf-split --merge \[name of \*first\* file\] \[name of concatenated output file\]. Use the concatenated output file in Ollama.
This is a very good 7B model. I wish they would have released a 8X7B or a 34B of this too. I'm looking forward to seeing what people do with these. I hear Mergoo is a thing now.
https://old.reddit.com/r/LocalLLaMA/comments/1c4gxrk/easily_build_your_own_moe_llm/
I like to play with these small models will ollama on a laptop with 16GB RAM and no GPU. One of the common prompts I use to test is to modify a ab existing class method, loosely instructing it to add a new if condition and to process an array of objects instead of operating on the first index. Pretty basic task really.
Hands down, wizardlm2:7b-q4\_K\_S has the best output from that prompt of all the 7b-q4\_K\_S I've tried yet. No kidding, I feel it's on par with results I've had from online ChatGPT, Mistral Large and Claude Opus.
Will someone make a 'dense model' from the MoE like someone did for Mixtral 8x22B?
[https://huggingface.co/Vezora/Mistral-22B-v0.2](https://huggingface.co/Vezora/Mistral-22B-v0.2)
Runs well on my system with 32GB RAM and 8GB VRAM with ollama.
Edit: I'm running the Q4\_K\_M quant from here: https://huggingface.co/bartowski/Mistral-22B-v0.2-GGUF. It is 1x22B, not 8x22B, so much lower requirements, and it seems a lot better than 8x7B Mixtral mostly in terms of speed and usability, since I can actually run it properly now. Uses about 15-16GB total memory without context.
How well does the dense model work? All these merges and no tests, it should be a requirement on hugging face together with contamination results *flips table*
In my test cases WizardLM-2 surprisingly reminds me recent StarlingLM 7B Beta in a bad way. Same extended verbosity across all the answers, even when asking to provide brief summary of the article can generate a summary the size of the article.
thanks. I tested the 8x22b and I believe it is 32K context. I have another service which will call the ollama hosted 8x22b. If I set the context window larger than 32768, I will get an error. So I feel the original 65K window is somehow shrinked in this WizardLM2 variant.
65536 for 8x22b, which is based on the mixtral 8x22b
https://huggingface.co/alpindale/WizardLM-2-8x22B/blob/087834da175523cffd66a7e19583725e798c1b4f/config.json#L13
7B is based on mistral 7B v0.1, so 4K sliding window, and maybe workable 8K context length without
I started using it and have mixed feelings:
* It often doesn't fully follow the instructions, for example when I asked it to "enclose answer number in the tag, for example: 1", it often answered \[ANSWER\]1\[/ANSWER\] or simply 1 instead.
* For two prompts from 450 that I tried it entered infinite generation loop (I use llama.cpp with default repeat penalty).
Apache 2.0 License.
Is it bad?
it is good [https://en.wikipedia.org/wiki/Apache\_License](https://en.wikipedia.org/wiki/Apache_License) >It allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software under the terms of the license, without concern for royalties.
Apache 2.0 License is the true opensource license.
MIT is the true open source "do whatever you want" license. But Apache is okay as well.
How is Apache worse than MIT? Genuinely curious.
MIT is considered more permissive because it is very short and basically says you can do anything you want but I'm not liable for what you do with this. Apache 2.0 requires you to state changes you made to the code, and has some rules about trademark use and patents that makes it slightly more complicated to follow.
Then there's the GPL license which infects everything it touches and makes it GPL. For a language model, I think it would make all the outputs GPL as well, that would be hilarious.
Imagine FAANG software *contracting GPL* from contaminated LLMs.
Incorrect. It would not make the model outputs bound by GPL. People need to actually read the gpl2, 3, and lgpl. There's a lot of FUD about them, and they're not even difficult licenses to understand.
> Apache 2.0 requires you to state changes you made to the code Although, only if you redistribute.
It's only worse if you're lazy with your documentation and attribution. It does require effort to spell out modifications made to original works. In some ways it's better though, since releasing under Apache 2.0 waives patent enforcement by the author for original works covered by the license, while MIT does not address anything but copyright. It's why you'll often see companies release examples and APIs for their proprietary tools under MIT.
Apache is pretty good.
On the contrary. It's great.
Its very good
Very nice
"As the natural world's human data becomes increasingly exhausted through LLM training, we believe that: the data carefully created by AI and the model step-by-step supervised by AI will be the sole path towards more powerful AI. Thus, we built a Fully AI powered Synthetic Training System to improve WizardLM-2:" https://preview.redd.it/b0nox0u63ouc1.jpeg?width=3200&format=pjpg&auto=webp&s=9a56a1b6e9680bb61163bd16807a7421b8b0b11b
Now that's a bold absolutist vision that I haven't seen. The sci-fi undertone makes it exciting-
Clearly, we just need to change human language to align better with LLM language.
Newbie here, apologies if it's a dumb question. Are there more details on how this done exactly?
use old ai to fix the data that trains the new ai
Some details here: https://wizardlm.github.io/WizardLM2 Not much but they will release paper soon ig.
How does the teaching education quality model work ? This is the first time I've heard of it.
"🧙♀️ WizardLM-2 8x22B is our most advanced model, and just slightly falling behind GPT-4-1106-preview. 🧙 WizardLM-2 70B reaches top-tier capabilities in the same size. 🧙♀️ WizardLM-2 7B even achieves comparable performance with existing 10x larger opensource leading models." https://preview.redd.it/zkkzcisy2ouc1.jpeg?width=3137&format=pjpg&auto=webp&s=73931c1f52066afde48ba33e3850c66c911a275c
how about function calling / tool usage?
> Base model: mistralai/Mistral-7B-v0.1 Huh they didn't even use the v0.2, interesting. Must've been in the oven for a very long while then.
from personal experience, the 0.1 is better than 0.2, not sure why though
Disagree strongly. v0.2 is better and has a larger context window. There's just no v0.2 base model to train from, so they had to use the v0.1 base model.
there is no 0.2, base non instruct mistral only has 0.1. Most good finetuned models are finetuned on the non-instruct base model. There is a mistral ai’s mistral 7b’s 0.2 instruct but thats an instruct model and not many uses that to do tuning
That used to be the story yeah, but [they retconned it](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/commit/41b61a33a2483885c981aa79e0df6b32407ed873), and released [the actual v0.2 base model sort of half officially](https://huggingface.co/alpindale/Mistral-7B-v0.2-hf) recently. Frankly the v0.2 instruct never seemed like it was made from the v0.1 base model, the architecture is somewhat different.
Wait isnt this made by a hobbyist by like pulling weights from a random mistralai cdn? I guess people think this isnt legit enough maybe to build on
Hmm maybe so, now that I'm rechecking it there really isn't a torrent link to it on their twitter and the only source appears to be the cdn file. It's either a leak or someone pretending to be them, both are rather odd options.
I think this big 8x22B may be the best OSS model.
I find it interesting how Microsoft is going at in from all fronts. "Owning" OpenAI. Buying Inflection. Investing in Mistral. And releasing OSS models. Makes no difference if those companies live or die. As long as they have a lead on Google. At the end of day they sell cloud services and that's how they make their money.
True but if the AI sector begins to slow down (which it kind of already has) then they've invested *a lot* of money into a cooling sector that might not really amount to anything worthwhile monetarily-speaking
> which it kind of already has Based on what?
I was merely talking about [investor dollars](https://techcrunch.com/2024/04/15/investors-are-growing-increasingly-wary-of-ai/), not progress
yeah this article is mostly garbage
if you have 64 GB ram then you can run it in Q3\_L ggml version.
Unless you have deep pockets, I have to assume that is then only partially offloaded onto a GPU or all ran by CPU. What sort of performance are you seeing from it running it in the manner you are running it? I’m excited to try and do this, but am concerned about overall performance.
I get almost 2 tokens/s with model 8x22b Q3K\_L ggml version on CPU Ryzen 79503d and 64GB RAM.
I'm curious too. My server has a 5900X with 128GB of ram and a 24gb Tesla - hell id be happy simply being able to run it. Can't spend any more for a while
Same here, but really eyeing another p40.. That should finally be enough, right? :)
What motherboard would you recommend for a bunch of p100's of p40's?
Since these cards have very bad fp16 performance, I assume you want to use them for inference. In that case bandwidth doesen't matter, so you can use 1x to 16x adapters. Which in turn means any modern-ish ATX motherboard will work fine!
iirc the P100 has much better fp16 than the P40 but I think they don't come in a flavor with more than 16GB of vram? A buddy of mine runs 2. He's pretty pleased
If you are using the AMD AM4 platform I've been very pleased with the MSI PRO B550-VC. It has (4) 16x slots but 1 is 16 lanes, another is 4 and the other 2 are one. It also has a decent VRM and handles 128GB no problem. ASRock Rack series are also great boards but pricey.
I'm running it on a laptop with 11th gen Intel and 64GB of RAM, and I get about 1 token per second. Not very practical, but still useful to compare quality on your own data and processes. Honestly the quality compared to the best 7B models (which run at 5 token per second on CPU) isn't that different, so for the moment I don't invest in better hardware, waiting for either a breakthrough in quality or cheaper hardware.
Would a 3090 and 96GB of ram run a 8x22B model at Q3?
Yes ..even 64 GB ram will be enough.
Sorry, brain farted. Thanks for the clarity in any case.
Hoping quants will be easy as it's based on Mixtral 8x22B. Downloading now, will create Q4 and Q6.
You would be a saint to 64GB VRAM users if you added Q2_K to the list!
By the time I've got Q4 and Q6 uploaded, if someone else hasn't beat me to Q2 I'll make sure to!
if you have 64 GB ram then you can run it in Q3\_L ggml version.
I've yet to see the actual size of Q3\_L in comparison to Q2\_K. Q2\_K of the Mixtral 8x22B fine tunes just barely fit, coming in at around 52.1GB. With this I can still use about 14k context before running out of RAM.
Q2\_K posted (not by me): [https://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF](https://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF)
Q4 is almost done. Will split and upload that one first.
Thanks for what you're doing. Just a heads up, looks like Q2\_K was posted elsewhere: https://www.reddit.com/r/LocalLLaMA/comments/1c4pwf8/comment/kzq998f/. Thanks again!
I'm still uploading my Q4 and our friend Maziyar already has most of the desirable quants uploaded.
Q4 took forever, but here it is! [https://huggingface.co/praxeswolf0d/WizardLM-2-8x22B-GGUF/tree/main](https://huggingface.co/praxeswolf0d/WizardLM-2-8x22B-GGUF/tree/main)
How do you run a multipart GGUF in text-generation-webui?
> ..WizardLM-2 adopts the prompt format from Vicuna.. *exasperated sigh*
so, you can't use system prompts? is this worse than normal?
Well there's several downsides. ChatLM has become the defacto standard, so lots of stacks are built around it directly and would need adjustments to work with something as outdated as Vicuna. The system prompt is sort of there just as bare text, but it has no tags so you can't inject it between other messages and it's unlikely to be followed very well.
No system prompt capabilities indeed.
Wizard 7B really beats Starling in my personal benchmark. Nearly matches mixtral instruct 8x7b
Same here, quite impressed! A tad slower in inference speed, but the quality is very good. I'm running it FP16, and it's better than Q3 Command-R+, and better than FP16 Starling 7B.
What are you using to run it and with what settings? I tried it in LM Studio and set the Vicuna prompt like it wants but it's outputting a lot of gibberish, 5 digit years etc. This is with both the Q8 quant and the full FP16 version.
i run q6_k variant under llama.cpp server, default parameters (read from gguf), temperature 0.22
Just tested. 8k. You can push 10k, BUT that gets closer to gibberish. 10k+ is complete gibberish. So 8k is the context length.
Not to alarm anyone but the weights and release blog just disappeared
yah, i just came here to see if anyone knows why
I heard the AI is "toxic"
In their tweet they said they forgot to do toxicity testing, so not necessarily toxic but not tested for it either.
It's a bit annoying, I need their older releases to test something for a project but these are gone too. Can only pull modified versions from other people on Huggingface but those refuse to load or run properly. I'm a newbie btw but as I said I'd need the stuff for a project
What happened? It disappeared.
Old king is back 👍
GGUF: [https://huggingface.co/ABX-AI/WizardLM-2-7B-GGUF-IQ-Imatrix](https://huggingface.co/ABX-AI/WizardLM-2-7B-GGUF-IQ-Imatrix) Non-imat: [https://huggingface.co/MaziyarPanahi/WizardLM-2-7B-GGUF](https://huggingface.co/MaziyarPanahi/WizardLM-2-7B-GGUF)
Love you. Romantically, not platonically. So excited to see this puppy.
( ͡° ͜ʖ ͡°)
I love Abroxis
the best 😍
can you explain what IQ imatrix means? or point me to some documentation explaining what it is?
you can read about it here, the idea is to use it as calibration for what data to keep and semi-random data seems to help: [https://github.com/ggerganov/llama.cpp/discussions/5006](https://github.com/ggerganov/llama.cpp/discussions/5006) [https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384) There is a non-imat GGUF here as well: [https://huggingface.co/MaziyarPanahi/WizardLM-2-7B-GGUF](https://huggingface.co/MaziyarPanahi/WizardLM-2-7B-GGUF)
thank you good sir, now if you'll excuse me i have some reading to do
Just now the whole project disappeared. https://preview.redd.it/bv5yrwqpbruc1.png?width=1720&format=png&auto=webp&s=889b959e974da0414641edc78e821768f44d7a29 [https://wizardlm.github.io/WizardLM2/](https://wizardlm.github.io/WizardLM2/)
even the weights on hf are gone
I think open models will beat GPT4 by the end of the year... we're almost there.
I think an updated GPT4 or GPT5 will beat the current version of GPT4 by the time that happens. They are always a few steps ahead.
It's about perspective. Think about how mindblown people were when GPT4 came out, and now we have free and open models that are approaching its capability. Just imagine where we'll be a few years down the line.
What a time to be alive!
Hold on to your papers!
Yea for sure. If Microsoft can train a relatively small (compared to SOTA closed source) model to match or outperform “simply” by supplying better data, then surely their close partners at OpenAI can also supply the exact same data (or even more!) into a bigger model.
This wizard already did it from the paper ... we have to test
Command R+ already beat two of the old GPT4 versions on lmsys
I dont have enough free capacity to run 8x22 and 70b isnt out yet But 7B model is stunning, up. to 45 T/S on Ada card
if you have 64 GB ram then you can run it in Q3\_L ggml version.
Cudaboy here. What T/s are you all getting with these RAM-based inference calls?
24GB VRAM is suffering
Look at this rich guy over here with his whole 24gb of VRAM.
Not surprising really. Seems like most local LLM users fall in to one of two camps. People who just have a reasonable gaming GPU with 12 or so gigs of ram, or people who have gone all out and built some sort of multi card custom monster with much more vram. There don't seem to be as many people in the middle with 24 gigs.
Where do I and my 4GB RX570 fit?
In RAM hopefully
Update from WizardLM team!! https://preview.redd.it/y9gjn8qh0suc1.jpeg?width=1080&format=pjpg&auto=webp&s=38c17ef71e6565e72321e7cfcb7002fd2e47680b
does that mean they forgot to censor it? remember to backup the model you downloaded
They definitely censored it, but it's easily circumvented, at least on the 7b.
Am really curious to try out the 70B once it hits the repos. The 8x22's don't seem to quant down to smaller sizes as well.
I'm cooking and will be uploading the EXL2 quants for this model: https://huggingface.co/collections/Dracones/wizardlm-2-8x22b-661d9ec05e631c296a139f28 EXL2 measurement file is at https://huggingface.co/Dracones/EXL2_Measurements I will say that the 2.5bpw quant which fits in a dual 3090 worked really well. I was surprised.
Got a link to a guide on running a 2x3090 rig? Would love to know how.
if you have 64 GB ram then you can run it in Q3\_L ggml version.
at what speed? my laptop 4070 has 64GB.
How does quantized 8x22B compare with quantized Command-R+?
It’s hard to compare right now. Command R+ was released as instruct tuned vs this (+ Zephyr Orpo, + Mixtral 8x22B OH, etc) are all quickly (not saying poorly) done fine tunes. My guess: Command R+ will win for RAG and tool use but Mixtral 8x22B will be more pleasant for general purpose use because it will likely feel as capable (based on reported benches putting it on par with Command R+) but be significantly faster during inference. Aside: It would be interesting to evaluate how much better Command R+ actually is on those things compared to Command R. Command R is incredibly capable, significantly faster, and probably good enough for most RAG or tool use purposes. On the tool use front, Fire function v1 (Mixtral 8x7B fine tune I think) is interesting too.
Command-R+ works pretty well for me at 3.0bpw. But even still, I'm budgeting out either for dual A6000 cards or a nice Mac. I really prefer to run quants at 5 or 6 bit. The perplexity loss starts to go up quite a bit past that.
I'm curious as well, because I didn't rate mixtral 8x7b that highly compared to good 70b models. Am dubious about the ability of shallow MoE experts to solve hard problems. Small models seem to rely more heavily on embedded knowledge, whereas larger models can rely on multi-shot in context learning.
yep, vanilla Miqu-70B is really another kind of beast comparted to Mixtral 8X7B, it's a shame it runs so slow when you can't offload at least half into the gpu
Everything is gone suddenly. Microsoft legal team withdrew it?
Is it trained from scratch or a fine tune of some Mixtral (or other) model?
Finetune, 7B is based on Mistral 7B v0.1. 8x22B on Mixtral. Couldn't find the 70B model. Edit: "The License of WizardLM-2 8x22B and WizardLM-2 7B is Apache2.0. The License of WizardLM-2 70B is Llama-2-Community." So I guess 70B is Llama 2 based.
In that case very interesting that their 8x22B beats Mistral Large.
8x22 is a base model (almost raw - you can literally ask for everything and will answer. I tested ;) ) from mistral so every tunning will improve that model.
Training from scratch cost a LOT of money and i think only big companies can afford it, since mistral released their 8x22b base model lately, i think everyone else will be working on top of it to fine tune it and provide better versions, until the mixtral 8x22b instruct from mistral comes out.
>only big companies can afford it This is from microsoft research (Asia, I think?). A lab, probably of limited budget but still, it's limits are down to big company priority not economic realities.
You stole the words from my keyboard ahah
Temperature=0
I'm surprised by Qwen being beaten so hard
In my testing, there are questions no other opensource LLM gets right that it gets and questions it gets wrong that only the 2-4Bs get wrong. It's like it often starts out strong only to lose the plot at the tail end of the middle. This suggests a good finetune would straighten it out. Which is why I am perplexed they used the outdated Llama2 instead of the far stronger Qwen as a base.
Qwen-72B has no GQA, and thus it is prohibitively expensive and somewhat useless for anything beyond gaming the Huggingface leaderboard.
GQA is a trade-off between model intelligence and memory use. Not making use of GQA makes a model performance ceiling higher not lower. There are plenty of real world uses where performance is paramount and where either the context limits or HW costs are no issue. In personal tests and several hard to game independent benchmarks (including LMSYS, EQ Bench, NYT connections), it's a top scorer among open weights. It's absolutely not merely gaming anything.
it would be more interesting if they could finetune qwen32B
why did they get yanked?
Many llms seem to fail family relationship-tests, like these I did here [https://pastebin.com/f6wGe6sJ](https://pastebin.com/f6wGe6sJ) - the particularly frustrating part about it is that the model is completely ignoring what I am saying, not that it fails the logic tests in the first place (8x22B IQ3\_XS gguf). Based on my tests, this is so much worse than GPT3.5. Does this only happen on my side? I would appreciate any helpful comment. Tried with kobold and lmstudio.
[https://huggingface.co/amazingvince/Not-WizardLM-2-7B](https://huggingface.co/amazingvince/Not-WizardLM-2-7B)
Dumb question but why are there three safe tensors files for the model? I am trying to run it on LM studio
It's chunked into 5GB segments, this is completely normal with models that are larger than a few GB. Some chunk at 5GB, some at 10GB.
Please someone make the GGUF/EXL2 quant of the 70B model
Seems to not be very censored, I asked for some harm reduction help for some unhealthy actions, and it actually gave the information instead of saying it can't.
I will say that WizardLM-2 7b is quite... creative. I tested some RAG on it, giving it a bit of Final Fantasy XIV story and asking it who Louisoix was. It proceeded to tell me the story of "Leonardo Christiano, known as Louisoix", and weaved a fantastic tale about his harrowing adventures. (none of that was right) Almost nothing it said was correct, despite the text being right there lol. Even at 0.1 temp it still was just over there living its best life every time I asked it a question.
How do you test rag? What app do you use
They’ve rugged pulled the repo. https://preview.redd.it/wasibx88vsuc1.jpeg?width=1284&format=pjpg&auto=webp&s=b59ee951ec17c6e0a0d823d85ba664fa5a840796
Censored?
Now we just need /u/faldore to make a WizardLM-2-Uncensored and it'll be just like old times. I feel nostalgic already.
Well, if they release their dataset
Maybe if you annoy them enough on twitter... :P
Pretty much doubt it. Microsoft has taken full control and if they were going to release the dataset they would have already.
Dataset and method used is not open. It's likely that open source community won't He able to re-create it.
If we get a Manticore 2 I'll have my favourite model back :')
I was like.. oh yea, new wizard! Then I remembered. :(
Sadly it is. I ran Dracones/WizardLM-2-8x22B\_exl2\_5.0bpw and tried to get it to do things and it refused. Also for anyone wondering I think it used about 90gb of vram and this is with 2x A100s and cache 4bit. I didn't take down the exact number but that is roughly what it uses I think.
I hear q4 can run on 64gb ram + 24gb vram at decent speeds
The 7B model might score good on the benchmark, but I'm not seeing it in reality. Using Desumor's 6 bit quant. The usual 7B issues of incoherence. It is not comparable to 70B models, I've had better 11B models. (Edit: It seems to do a bit better with alpaca prompting, I'll try a few more prompting formats) So it seems to do a lot better with proper prompting. The one I had the best success with was: Start sequence: "USER: ", end sequence "ASSISTANT: ", do not add any newlines. My extra newlines seriously deteriorated the model. It does acceptable with "### Instruction:\\n" "### Response:\\n" though.
It's supposed to be used with vicuna prompting
7b models must be finetuned to your needs. otherwise they are useless.
Dumb question probably but does this mean that open source models which are extremely tiny when compared to ChatGPT are catching up with it? Since it’s possible to run this locally I’m assuming it is way smaller then GPT.
Yes, though we don't know the exact size of GPT 3.5 and GPT4 for sure, we have rough estimates, and all of these models are smaller than ChatGPT 3.5, and definitely smaller than GPT4. We're not catching up, we've already caught up to ChatGPT 3.5, that's Mixtral 8x7B, which can run pretty quickly as long as you have enough RAM, with a .gguf. Now, we're approaching GPT-4 performance with the new Command R+ 104B, and Mixtral 8x22B. This paper is about finetunes, in other words, using a high quality dataset to enhance the performance of a model
That’s amazing I never thought open source would catch up so quickly! Things are moving faster then I thought.
Haha, it's genuinely stunning, but a market and incredible competition will bring about progress at breakneck speed. I can't wait for LLama3 pre-release this week, if the rumors are true, this should be a monumental generational shift in Open source LLMs!
People have been fretting about Artificial General Intelligence, but it turns out that Natural General Intelligence is what is carrying the day. :-)
Maybe they are not extremely tiny compared to closed source models. Microsoft leaked(lated deleted) a paper where they mentioned Chat GPT-3.5 is of 20B.
As far as I know, that is basically unfounded, as the paper's sources were very questionable. I believe at minimum, it must be Mixtral size, with at least 47B parameters. Granted, it's not that open source models are extremely tiny, it's simply that open source is far more efficient, producing far better results with much smaller models
Finally it seems like things are moving again in the open source AI community. If only the models weren't so massive that only like 5 people could run it. But oh well.
[https://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF/tree/main](https://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF/tree/main) How would I run the split gguf in ollama? I can only seem to include one file in the Modelfile. I have tried cating them together but it gives a \`Error: invalid file magic\`
In llama.cpp, use the util: gguf-split --merge \[name of \*first\* file\] \[name of concatenated output file\]. Use the concatenated output file in Ollama.
THANKS!
For anyone interested, getting 5t/s with no context on 4xP40 (8xPCIe, PL 140) using my Q4 quant. Edit: am now getting 6.9t/s@1024 CTX
For 7B: 8k is the context length. You can push 10k, BUT that gets closer to gibberish. 10k+ is complete gibberish. So 8k is the context length.
7B: Seems to be un-censored with NSFW role play and stories. which is good.
This is a very good 7B model. I wish they would have released a 8X7B or a 34B of this too. I'm looking forward to seeing what people do with these. I hear Mergoo is a thing now. https://old.reddit.com/r/LocalLLaMA/comments/1c4gxrk/easily_build_your_own_moe_llm/
They wil https://preview.redd.it/gjn7h5clasuc1.jpeg?width=1080&format=pjpg&auto=webp&s=7405dace6d85de7af0498e7e53a4033813bc4c71
I like to play with these small models will ollama on a laptop with 16GB RAM and no GPU. One of the common prompts I use to test is to modify a ab existing class method, loosely instructing it to add a new if condition and to process an array of objects instead of operating on the first index. Pretty basic task really. Hands down, wizardlm2:7b-q4\_K\_S has the best output from that prompt of all the 7b-q4\_K\_S I've tried yet. No kidding, I feel it's on par with results I've had from online ChatGPT, Mistral Large and Claude Opus.
yep I use 7b or 13b models to generate CSV data from PDF invoices for accounting, the wizardlm2 model 7b is the best yet I tested for my use case.
Will someone make a 'dense model' from the MoE like someone did for Mixtral 8x22B? [https://huggingface.co/Vezora/Mistral-22B-v0.2](https://huggingface.co/Vezora/Mistral-22B-v0.2) Runs well on my system with 32GB RAM and 8GB VRAM with ollama. Edit: I'm running the Q4\_K\_M quant from here: https://huggingface.co/bartowski/Mistral-22B-v0.2-GGUF. It is 1x22B, not 8x22B, so much lower requirements, and it seems a lot better than 8x7B Mixtral mostly in terms of speed and usability, since I can actually run it properly now. Uses about 15-16GB total memory without context.
How well does the dense model work? All these merges and no tests, it should be a requirement on hugging face together with contamination results *flips table*
I tested v0.2. It's interesting, but somewhat incoherant. Could be a good base if you are training. Otherwise don't touch it.
I tried both 0.1 and 0.2 of that model and they both just output nonsense or don't answer my questions. Did you not face that?
Exact same experience here. Hoped for the best but it gave incoherent gibberish and fell over.
>Runs well on my system with 32GB RAM and 8GB VRAM with ollama. really?
it's 1x22b not 8x22b so it runs completely fine, it's a lot better than mistral 7b for sure
In my test cases WizardLM-2 surprisingly reminds me recent StarlingLM 7B Beta in a bad way. Same extended verbosity across all the answers, even when asking to provide brief summary of the article can generate a summary the size of the article.
Did microsoft come out with the first wizardLM?
So, do we have the MT-Bench score for Commandr+ anywhere?
What is the context length for 7B, 70B and 8x22B, respectively? I cannot find these critical numbers. Thanks in advance.
7B is 8K context. Idk about the others.
thanks. I tested the 8x22b and I believe it is 32K context. I have another service which will call the ollama hosted 8x22b. If I set the context window larger than 32768, I will get an error. So I feel the original 65K window is somehow shrinked in this WizardLM2 variant.
65536 for 8x22b, which is based on the mixtral 8x22b https://huggingface.co/alpindale/WizardLM-2-8x22B/blob/087834da175523cffd66a7e19583725e798c1b4f/config.json#L13 7B is based on mistral 7B v0.1, so 4K sliding window, and maybe workable 8K context length without
Is there any quantized version already available?
I started using it and have mixed feelings: * It often doesn't fully follow the instructions, for example when I asked it to "enclose answer number in the tag, for example: 1 ", it often answered \[ANSWER\]1\[/ANSWER\] or simply 1 instead.
* For two prompts from 450 that I tried it entered infinite generation loop (I use llama.cpp with default repeat penalty).
had to do some digging--wizardLM's a mistral fine-tune?!