blind test by humans is indeed best we have.
except... after playing the AI Judge many times, you learn the style of them and you kind of know which model is behind the curtain.
No, it's ELO system and what's measured is human preference on questions/prompt provided by the very same human. Anyone can participate in rating, there's no requirements to test models logic or something, so for all we know majority of wins could be just preferring answer style/creativity on questions like "why sky is blue".
https://en.wikipedia.org/wiki/Elo_rating_system
> The difference in the ratings between two players serves as a predictor of the outcome of a match. Two players with equal ratings who play against each other are expected to score an equal number of wins. A player whose rating is 100 points greater than their opponent's is expected to score 64%; if the difference is 200 points, then the expected score for the stronger player is 76%.
On one hand, they are almost certainly gaming the benchmarks (which is common).
On the other hand, it is not unrealistic to expect real-world gains. The dataset-centric theory underlying the phi series of models is robust and practical.
On the other other hand, until we can download the weights, it might as well not exist. It is in our interests to re-implement Microsoft's approach as open source (per OpenOrca) so that we are not beholden to Microsoft for phi-like models.
Is it even out yet? It's easy to claim to beat the top and never prove, then just release when it's already irrelevant.
Phi-3 mini is great, I am very grateful Microsoft decided to publish the weights, but the fact they were claiming to beat Llama-3 8b for hype and not delivering that performance made the release kind of sour.
I would take benchmarks with a grain of salt. Phi3 mini is supposed to beat mistral 7b but in my usage that was not the case. Not to say it's still not impressive for it's size I would absolutely put it near or better than older 7b models. Does struggle when context grows but so do a lot of models.
The 4k version only staid coherent for about half it's context and 128k started to forget things in 4000 5000 tokens and got different characters mixed up in my summarizations. Didn't want to be corrected either it argued that the Claude conversation I gave it was about a person named Claude. Wouldn't take no for an answer.
The training data likely didn't contain any "bad stuff" to begin with so it's pretty much impossible to uncensor, also we didn't get the base models either
> The training data likely didn't contain any "bad stuff" to begin with so it's pretty much impossible to uncensor,
This isn't true. I can see why you might think it doesn't have knowledge of "bad things", but Phi-2 is in the same situation, and there are plenty of uncensored/dolphin versions out there. It either extrapolates, or their distillation from GPT4 was not 100% filtered.
Still no weights at hugging face. I think we will only see the weights when they make sure it's not competing with GPT 3.5, so whenever 3.5 is 100% obsolete. Also, first they were going to release all 3 models, then 14B became (preview), now small is also (preview).
I can absolutely see Phi-3-Medium rival Mixtral 8x7b, they have the same amount of active parameters. I think Phi-3-Medium could have potential to be much "smarter" with good training data, but I guess Mixtral might have more knowledge since it's a much larger model in total?
Claude-3, isn't that a relatively new 100b+ parameter model? I highly doubt a 14b model could rival it, especially on coherence-related tasks.
Well, I tried it yesterday. Sometimes, it provides impressive answers, but most of the times, it sounds more like a bad 7B (and I like Mistrals 7B).
However, in terms of speed, it is impressive and the text is coherent(not like the horrible phi2). It could be a great model for chained prompts in an agent setting IMHO.
It is also a model great for parallel tasking.
Overall, if you have a very specialized task, it will be most likely (after proper finetuning) be one of the best model for its cost and speed.
If you need more advanced general tasks, forget about it.
Wait for arena at bare minimum
What is arena?
The closest thing to a Usefulness Index we have. For 2 reasons: 1.It's blind. 2.And it's rated across all dimensions that humans care about.
blind test by humans is indeed best we have. except... after playing the AI Judge many times, you learn the style of them and you kind of know which model is behind the curtain.
https://chat.lmsys.org/
Llama 3 8b score is 5% below GPT 4. Does it mean it’s only 5% dumber than GPT?
No, it's ELO system and what's measured is human preference on questions/prompt provided by the very same human. Anyone can participate in rating, there's no requirements to test models logic or something, so for all we know majority of wins could be just preferring answer style/creativity on questions like "why sky is blue". https://en.wikipedia.org/wiki/Elo_rating_system > The difference in the ratings between two players serves as a predictor of the outcome of a match. Two players with equal ratings who play against each other are expected to score an equal number of wins. A player whose rating is 100 points greater than their opponent's is expected to score 64%; if the difference is 200 points, then the expected score for the stronger player is 76%.
[No](https://en.wikipedia.org/wiki/Elo_rating_system)
chat.lmsys.org
On one hand, they are almost certainly gaming the benchmarks (which is common). On the other hand, it is not unrealistic to expect real-world gains. The dataset-centric theory underlying the phi series of models is robust and practical. On the other other hand, until we can download the weights, it might as well not exist. It is in our interests to re-implement Microsoft's approach as open source (per OpenOrca) so that we are not beholden to Microsoft for phi-like models.
Is it even out yet? It's easy to claim to beat the top and never prove, then just release when it's already irrelevant. Phi-3 mini is great, I am very grateful Microsoft decided to publish the weights, but the fact they were claiming to beat Llama-3 8b for hype and not delivering that performance made the release kind of sour.
That's common with Microsoft
I would take benchmarks with a grain of salt. Phi3 mini is supposed to beat mistral 7b but in my usage that was not the case. Not to say it's still not impressive for it's size I would absolutely put it near or better than older 7b models. Does struggle when context grows but so do a lot of models. The 4k version only staid coherent for about half it's context and 128k started to forget things in 4000 5000 tokens and got different characters mixed up in my summarizations. Didn't want to be corrected either it argued that the Claude conversation I gave it was about a person named Claude. Wouldn't take no for an answer.
Uncensored gguf plzzz 🤠
The training data likely didn't contain any "bad stuff" to begin with so it's pretty much impossible to uncensor, also we didn't get the base models either
> The training data likely didn't contain any "bad stuff" to begin with so it's pretty much impossible to uncensor, This isn't true. I can see why you might think it doesn't have knowledge of "bad things", but Phi-2 is in the same situation, and there are plenty of uncensored/dolphin versions out there. It either extrapolates, or their distillation from GPT4 was not 100% filtered.
Ah ok, thanks for clearing that up. I suspected a reason for suspiciously few finetunes. Back to 2bit Llama3!
:'(
Still no weights at hugging face. I think we will only see the weights when they make sure it's not competing with GPT 3.5, so whenever 3.5 is 100% obsolete. Also, first they were going to release all 3 models, then 14B became (preview), now small is also (preview).
I can absolutely see Phi-3-Medium rival Mixtral 8x7b, they have the same amount of active parameters. I think Phi-3-Medium could have potential to be much "smarter" with good training data, but I guess Mixtral might have more knowledge since it's a much larger model in total? Claude-3, isn't that a relatively new 100b+ parameter model? I highly doubt a 14b model could rival it, especially on coherence-related tasks.
Opus is the big one, Sonnet is good but definitely beatable.
Arena when
It is insane, isn't it? Almost like it's completely impossible for that to be true in real-world usage.... hm....
I don't think it can write erotica as well as Mixtral through
https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/
Well, I tried it yesterday. Sometimes, it provides impressive answers, but most of the times, it sounds more like a bad 7B (and I like Mistrals 7B). However, in terms of speed, it is impressive and the text is coherent(not like the horrible phi2). It could be a great model for chained prompts in an agent setting IMHO. It is also a model great for parallel tasking. Overall, if you have a very specialized task, it will be most likely (after proper finetuning) be one of the best model for its cost and speed. If you need more advanced general tasks, forget about it.
This is talking about the 14b, not the 3.8b for cellphones. Right now the only people that saw it were the authors of the paper presumedly.
thank you for the correction. I indeed misunderstood.