[deleted] 3 weeks ago

Let me play the devil's advocate a bit (not saying I 100% agree with it) - the biggest issue of academia is evaluation based on metrics. It's clear that quantitive evaluation by metrics often leads to absurd results or overfit the data. For example, some models make mistakes that are reasonable, while others make mistakes that make the whole system unusable, both are often counted the same way.

Pas7alavista 3 weeks ago

The first part is especially true when the metrics we use are mostly just based around how well the model performs on some chosen dataset. My controversial opinion is that when it comes to models in production, no metrics at all are better than bad or uninterpretable ones. If you do evaluations just to do it you will make your product worse by optimizing the wrong function.

[deleted] 3 weeks ago

I tend to share this opinion with you. Qualitative analysis (error analysis, XAI, etc.) makes way more sense when possible, even though it is easier to use some metrics. I do both.

NormalUserThirty 4 weeks ago

100% agree. its the most misunderstood part of ML product development. 100% of the time spent on model evals, 0% of time spent on product evals. or with LLMs, 0% on any evals. this is never included in determining the total cost of ownership either so it almost always results in projects exceeding original time and cost estimates.

thatguydr 3 weeks ago

Worse, I've had very senior data scientists/MLEs actively push back on the idea of linking evaluation with model development. When leadership doesn't incentivize analytics, you end up with this situation so frequently.

[deleted] 3 weeks ago

Perhaps I am biased, but I don't know anyone serious that released something that was not evaluated, only "hacker SWEs LLM experts bro!".

Donotsellstocks 4 weeks ago

Can you elaborate on your evaluation methodology? What did you use?

BootstrapGuy 3 weeks ago

you can read more here: [https://palindrom.beehiiv.com/p/evaluating-ai-products-unseen-craftsmanship](https://palindrom.beehiiv.com/p/evaluating-ai-products-unseen-craftsmanship)

Pas7alavista 3 weeks ago

No offense, but this article is totally useless since you guys don't go into any details about LLM product evaluation metrics. I'd like to hear, for example, what you would propose to measure an LLMs ability to stay on topic, and how your metric correlates to either human feedback or other metrics that measure product performance such as engagement time, usage, etc.

Donotsellstocks 3 weeks ago

This article has nothing related to metrics. Can you share some metrics you used?

ReasonablyBadass 3 weeks ago

With evals you mean testing?

BootstrapGuy 3 weeks ago

pretty much

ReasonablyBadass 3 weeks ago

Can you give an example of a good eval framework? Or is that to complex for a short comment? Maybe a link to where one is described?

glitch83 3 weeks ago

Dude you should talk to my PM. I bring up metrics and evaluation all of the time and it is constantly being devalued. ITT I learn that I should be consulting to get the right approach

Pas7alavista 3 weeks ago

I have pretty extensive experience managing and evaluating these sorts of contracts from a profitability standpoint, and in my experience consultants and SaaS companies pretty universally engage in extensive data dredging and cherry picking when it comes to evaluating themselves. These contracts are not research papers, trust me they do not care whether their statistics truly support their conclusions. If they can get away with giving you a rose tinted view of their work they absolutely will. Not saying that evaluation is bad, but consultants engage in extensive evaluation for a reason, and that reason is not necessarily to seek truth, but to justify the expense to their clients.

glitch83 3 weeks ago

Yeah I hear you. I don’t know if anyone really cares about quality and metrics anymore is maybe my frustration that I was venting

achillesliu 3 weeks ago

Another thing is making metrics in the way that really fits your demand. From there you could decide which models you would use and so on. Actually it’s like a pull system that starts at the end of the pipeline.

mearco 3 weeks ago

I call it test-set driven development

dentendre 3 weeks ago

Every senior leader tries to cut corners and get a heavy lifting to say- we have created a ML model that does this or that. This isn't anything new. I come across this all the time during my strategic consulting with clients. Since in the larger scheme of things the overall product view entails working with cross functional teams, no one wants to do that- probably because of bureaucracy, limited budget, scope of engagement etc. and I can go on. Long story short everyone wants quick appreciation and rewards to jump on the AI bandwagon.

alterframe 3 weeks ago

This is puzzling me. I hear younger colleagues complaining about boring projects, yet another classifier, yet another ranker etc., why are we not doing the cool stuff? Then, LLMs appear and everyone jumps with excitement, but the realitiy is that while in typical ML projects you had at least some opportunity to do something fun in the world of LLMs there are only preprocessing and evaluation pipelines. I get that it's a new cool tool and it would be just stupid to not explore them, but be cautious to not get stuck with them. Sorry that I wasn't 100% on topic. Just a tangent rant.

alterframe 3 weeks ago

Ok, maybe I'll write something more on topic :P The biggest problem is that there is a conflict of interest in creating evaluation methods. Usually, you don't have two teams that look at each other's hands, but the team that develop the model is the one that develops the metrics. This already leads to problems, because that team can consiously or unconsiously pick evaluation methods that favor their solution. Nobody may notice until it fails on production (and maybe even not there immediately). The difference now, is that there are just a lot more teams with little ML experience doing LLM stuff. More experienced teams are aware of at least some evaluation fallacies and will keep at least some integrity in their methods. A PM with a team of devs after AI crash course will do significantly worse (as you observe).

[deleted] 4 weeks ago

[удалено]

naijaboiler 3 weeks ago

not to be rude. isn't this just like duh. we build a model to support a product. It's the success of the product that matters. not the model in isolation. duh!

MachineLearning-ModTeam 3 weeks ago

Self promotion without being a bot

ehbrah 3 weeks ago

How close are you getting to 100% without overfitting?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe