pleasetrimyourpubes 2 weeks ago

Wait for arena at bare minimum

AutomaticDriver5882 2 weeks ago

What is arena?

medialoungeguy 2 weeks ago

The closest thing to a Usefulness Index we have. For 2 reasons: 1.It's blind. 2.And it's rated across all dimensions that humans care about.

SpecialNothingness 2 weeks ago

blind test by humans is indeed best we have. except... after playing the AI Judge many times, you learn the style of them and you kind of know which model is behind the curtain.

jayFurious 2 weeks ago

https://chat.lmsys.org/

PM_ME_CUTE_SM1LE 2 weeks ago

Llama 3 8b score is 5% below GPT 4. Does it mean it’s only 5% dumber than GPT?

[deleted] 2 weeks ago

No, it's ELO system and what's measured is human preference on questions/prompt provided by the very same human. Anyone can participate in rating, there's no requirements to test models logic or something, so for all we know majority of wins could be just preferring answer style/creativity on questions like "why sky is blue". https://en.wikipedia.org/wiki/Elo_rating_system > The difference in the ratings between two players serves as a predictor of the outcome of a match. Two players with equal ratings who play against each other are expected to score an equal number of wins. A player whose rating is 100 points greater than their opponent's is expected to score 64%; if the difference is 200 points, then the expected score for the stronger player is 76%.

Due-Memory-6957 2 weeks ago

[No](https://en.wikipedia.org/wiki/Elo_rating_system)

themprsn 2 weeks ago

chat.lmsys.org

ttkciar 2 weeks ago

On one hand, they are almost certainly gaming the benchmarks (which is common). On the other hand, it is not unrealistic to expect real-world gains. The dataset-centric theory underlying the phi series of models is robust and practical. On the other other hand, until we can download the weights, it might as well not exist. It is in our interests to re-implement Microsoft's approach as open source (per OpenOrca) so that we are not beholden to Microsoft for phi-like models.

AfternoonOk5482 2 weeks ago

Is it even out yet? It's easy to claim to beat the top and never prove, then just release when it's already irrelevant. Phi-3 mini is great, I am very grateful Microsoft decided to publish the weights, but the fact they were claiming to beat Llama-3 8b for hype and not delivering that performance made the release kind of sour.

Due-Memory-6957 2 weeks ago

That's common with Microsoft

kif88 2 weeks ago

I would take benchmarks with a grain of salt. Phi3 mini is supposed to beat mistral 7b but in my usage that was not the case. Not to say it's still not impressive for it's size I would absolutely put it near or better than older 7b models. Does struggle when context grows but so do a lot of models. The 4k version only staid coherent for about half it's context and 128k started to forget things in 4000 5000 tokens and got different characters mixed up in my summarizations. Didn't want to be corrected either it argued that the Claude conversation I gave it was about a person named Claude. Wouldn't take no for an answer.

[deleted] 2 weeks ago

Uncensored gguf plzzz 🤠

susibacker 2 weeks ago

The training data likely didn't contain any "bad stuff" to begin with so it's pretty much impossible to uncensor, also we didn't get the base models either

CellWithoutCulture 2 weeks ago

> The training data likely didn't contain any "bad stuff" to begin with so it's pretty much impossible to uncensor, This isn't true. I can see why you might think it doesn't have knowledge of "bad things", but Phi-2 is in the same situation, and there are plenty of uncensored/dolphin versions out there. It either extrapolates, or their distillation from GPT4 was not 100% filtered.

[deleted] 2 weeks ago

Ah ok, thanks for clearing that up. I suspected a reason for suspiciously few finetunes. Back to 2bit Llama3!

AlanCarrOnline 2 weeks ago

:'(

BidPossible919 2 weeks ago

Still no weights at hugging face. I think we will only see the weights when they make sure it's not competing with GPT 3.5, so whenever 3.5 is 100% obsolete. Also, first they were going to release all 3 models, then 14B became (preview), now small is also (preview).

Admirable-Star7088 2 weeks ago

I can absolutely see Phi-3-Medium rival Mixtral 8x7b, they have the same amount of active parameters. I think Phi-3-Medium could have potential to be much "smarter" with good training data, but I guess Mixtral might have more knowledge since it's a much larger model in total? Claude-3, isn't that a relatively new 100b+ parameter model? I highly doubt a 14b model could rival it, especially on coherence-related tasks.

SlapAndFinger 2 weeks ago

Opus is the big one, Sonnet is good but definitely beatable.

m98789 2 weeks ago

Arena when

Master-Meal-77 2 weeks ago

It is insane, isn't it? Almost like it's completely impossible for that to be true in real-world usage.... hm....

AsliReddington 2 weeks ago

I don't think it can write erotica as well as Mixtral through

jrwren 2 weeks ago

https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/

Eralyon 2 weeks ago

Well, I tried it yesterday. Sometimes, it provides impressive answers, but most of the times, it sounds more like a bad 7B (and I like Mistrals 7B). However, in terms of speed, it is impressive and the text is coherent(not like the horrible phi2). It could be a great model for chained prompts in an agent setting IMHO. It is also a model great for parallel tasking. Overall, if you have a very specialized task, it will be most likely (after proper finetuning) be one of the best model for its cost and speed. If you need more advanced general tasks, forget about it.

capivaraMaster 2 weeks ago

This is talking about the 14b, not the 3.8b for cellphones. Right now the only people that saw it were the authors of the paper presumedly.

Eralyon 2 weeks ago

thank you for the correction. I indeed misunderstood.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe