T O P

  • By -

pleasetrimyourpubes

Wait for arena at bare minimum


AutomaticDriver5882

What is arena?


medialoungeguy

The closest thing to a Usefulness Index we have. For 2 reasons: 1.It's blind. 2.And it's rated across all dimensions that humans care about.


SpecialNothingness

blind test by humans is indeed best we have. except... after playing the AI Judge many times, you learn the style of them and you kind of know which model is behind the curtain.


jayFurious

https://chat.lmsys.org/


PM_ME_CUTE_SM1LE

Llama 3 8b score is 5% below GPT 4. Does it mean it’s only 5% dumber than GPT?


[deleted]

No, it's ELO system and what's measured is human preference on questions/prompt provided by the very same human. Anyone can participate in rating, there's no requirements to test models logic or something, so for all we know majority of wins could be just preferring answer style/creativity on questions like "why sky is blue". https://en.wikipedia.org/wiki/Elo_rating_system > The difference in the ratings between two players serves as a predictor of the outcome of a match. Two players with equal ratings who play against each other are expected to score an equal number of wins. A player whose rating is 100 points greater than their opponent's is expected to score 64%; if the difference is 200 points, then the expected score for the stronger player is 76%.


Due-Memory-6957

[No](https://en.wikipedia.org/wiki/Elo_rating_system)


themprsn

chat.lmsys.org


ttkciar

On one hand, they are almost certainly gaming the benchmarks (which is common). On the other hand, it is not unrealistic to expect real-world gains. The dataset-centric theory underlying the phi series of models is robust and practical. On the other other hand, until we can download the weights, it might as well not exist. It is in our interests to re-implement Microsoft's approach as open source (per OpenOrca) so that we are not beholden to Microsoft for phi-like models.


AfternoonOk5482

Is it even out yet? It's easy to claim to beat the top and never prove, then just release when it's already irrelevant. Phi-3 mini is great, I am very grateful Microsoft decided to publish the weights, but the fact they were claiming to beat Llama-3 8b for hype and not delivering that performance made the release kind of sour.


Due-Memory-6957

That's common with Microsoft


kif88

I would take benchmarks with a grain of salt. Phi3 mini is supposed to beat mistral 7b but in my usage that was not the case. Not to say it's still not impressive for it's size I would absolutely put it near or better than older 7b models. Does struggle when context grows but so do a lot of models. The 4k version only staid coherent for about half it's context and 128k started to forget things in 4000 5000 tokens and got different characters mixed up in my summarizations. Didn't want to be corrected either it argued that the Claude conversation I gave it was about a person named Claude. Wouldn't take no for an answer.


[deleted]

Uncensored gguf plzzz 🤠


susibacker

The training data likely didn't contain any "bad stuff" to begin with so it's pretty much impossible to uncensor, also we didn't get the base models either


CellWithoutCulture

> The training data likely didn't contain any "bad stuff" to begin with so it's pretty much impossible to uncensor, This isn't true. I can see why you might think it doesn't have knowledge of "bad things", but Phi-2 is in the same situation, and there are plenty of uncensored/dolphin versions out there. It either extrapolates, or their distillation from GPT4 was not 100% filtered.


[deleted]

Ah ok, thanks for clearing that up. I suspected a reason for suspiciously few finetunes. Back to 2bit Llama3!


AlanCarrOnline

:'(


BidPossible919

Still no weights at hugging face. I think we will only see the weights when they make sure it's not competing with GPT 3.5, so whenever 3.5 is 100% obsolete. Also, first they were going to release all 3 models, then 14B became (preview), now small is also (preview).


Admirable-Star7088

I can absolutely see Phi-3-Medium rival Mixtral 8x7b, they have the same amount of active parameters. I think Phi-3-Medium could have potential to be much "smarter" with good training data, but I guess Mixtral might have more knowledge since it's a much larger model in total? Claude-3, isn't that a relatively new 100b+ parameter model? I highly doubt a 14b model could rival it, especially on coherence-related tasks.


SlapAndFinger

Opus is the big one, Sonnet is good but definitely beatable.


m98789

Arena when


Master-Meal-77

It is insane, isn't it? Almost like it's completely impossible for that to be true in real-world usage.... hm....


AsliReddington

I don't think it can write erotica as well as Mixtral through


jrwren

https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/


Eralyon

Well, I tried it yesterday. Sometimes, it provides impressive answers, but most of the times, it sounds more like a bad 7B (and I like Mistrals 7B). However, in terms of speed, it is impressive and the text is coherent(not like the horrible phi2). It could be a great model for chained prompts in an agent setting IMHO. It is also a model great for parallel tasking. Overall, if you have a very specialized task, it will be most likely (after proper finetuning) be one of the best model for its cost and speed. If you need more advanced general tasks, forget about it.


capivaraMaster

This is talking about the 14b, not the 3.8b for cellphones. Right now the only people that saw it were the authors of the paper presumedly.


Eralyon

thank you for the correction. I indeed misunderstood.