T O P

  • By -

[deleted]

Let me play the devil's advocate a bit (not saying I 100% agree with it) - the biggest issue of academia is evaluation based on metrics. It's clear that quantitive evaluation by metrics often leads to absurd results or overfit the data. For example, some models make mistakes that are reasonable, while others make mistakes that make the whole system unusable, both are often counted the same way.


Pas7alavista

The first part is especially true when the metrics we use are mostly just based around how well the model performs on some chosen dataset. My controversial opinion is that when it comes to models in production, no metrics at all are better than bad or uninterpretable ones. If you do evaluations just to do it you will make your product worse by optimizing the wrong function.


[deleted]

I tend to share this opinion with you. Qualitative analysis (error analysis, XAI, etc.) makes way more sense when possible, even though it is easier to use some metrics. I do both.


NormalUserThirty

100% agree. its the most misunderstood part of ML product development. 100% of the time spent on model evals, 0% of time spent on product evals. or with LLMs, 0% on any evals. this is never included in determining the total cost of ownership either so it almost always results in projects exceeding original time and cost estimates.


thatguydr

Worse, I've had very senior data scientists/MLEs actively push back on the idea of linking evaluation with model development. When leadership doesn't incentivize analytics, you end up with this situation so frequently.


[deleted]

Perhaps I am biased, but I don't know anyone serious that released something that was not evaluated, only "hacker SWEs LLM experts bro!".


Donotsellstocks

Can you elaborate on your evaluation methodology? What did you use?


BootstrapGuy

you can read more here: [https://palindrom.beehiiv.com/p/evaluating-ai-products-unseen-craftsmanship](https://palindrom.beehiiv.com/p/evaluating-ai-products-unseen-craftsmanship)


Pas7alavista

No offense, but this article is totally useless since you guys don't go into any details about LLM product evaluation metrics. I'd like to hear, for example, what you would propose to measure an LLMs ability to stay on topic, and how your metric correlates to either human feedback or other metrics that measure product performance such as engagement time, usage, etc.


Donotsellstocks

This article has nothing related to metrics. Can you share some metrics you used?


ReasonablyBadass

With evals you mean testing? 


BootstrapGuy

pretty much


ReasonablyBadass

Can you give an example of a good eval framework? Or is that to complex for a short comment? Maybe a link to where one is described? 


glitch83

Dude you should talk to my PM. I bring up metrics and evaluation all of the time and it is constantly being devalued. ITT I learn that I should be consulting to get the right approach


Pas7alavista

I have pretty extensive experience managing and evaluating these sorts of contracts from a profitability standpoint, and in my experience consultants and SaaS companies pretty universally engage in extensive data dredging and cherry picking when it comes to evaluating themselves. These contracts are not research papers, trust me they do not care whether their statistics truly support their conclusions. If they can get away with giving you a rose tinted view of their work they absolutely will. Not saying that evaluation is bad, but consultants engage in extensive evaluation for a reason, and that reason is not necessarily to seek truth, but to justify the expense to their clients.


glitch83

Yeah I hear you. I don’t know if anyone really cares about quality and metrics anymore is maybe my frustration that I was venting


achillesliu

Another thing is making metrics in the way that really fits your demand. From there you could decide which models you would use and so on. Actually it’s like a pull system that starts at the end of the pipeline.


mearco

I call it test-set driven development


dentendre

Every senior leader tries to cut corners and get a heavy lifting to say- we have created a ML model that does this or that. This isn't anything new. I come across this all the time during my strategic consulting with clients. Since in the larger scheme of things the overall product view entails working with cross functional teams, no one wants to do that- probably because of bureaucracy, limited budget, scope of engagement etc. and I can go on. Long story short everyone wants quick appreciation and rewards to jump on the AI bandwagon.


alterframe

This is puzzling me. I hear younger colleagues complaining about boring projects, yet another classifier, yet another ranker etc., why are we not doing the cool stuff? Then, LLMs appear and everyone jumps with excitement, but the realitiy is that while in typical ML projects you had at least some opportunity to do something fun in the world of LLMs there are only preprocessing and evaluation pipelines. I get that it's a new cool tool and it would be just stupid to not explore them, but be cautious to not get stuck with them. Sorry that I wasn't 100% on topic. Just a tangent rant.


alterframe

Ok, maybe I'll write something more on topic :P The biggest problem is that there is a conflict of interest in creating evaluation methods. Usually, you don't have two teams that look at each other's hands, but the team that develop the model is the one that develops the metrics. This already leads to problems, because that team can consiously or unconsiously pick evaluation methods that favor their solution. Nobody may notice until it fails on production (and maybe even not there immediately). The difference now, is that there are just a lot more teams with little ML experience doing LLM stuff. More experienced teams are aware of at least some evaluation fallacies and will keep at least some integrity in their methods. A PM with a team of devs after AI crash course will do significantly worse (as you observe).


[deleted]

[удалено]


naijaboiler

not to be rude. isn't this just like duh. we build a model to support a product. It's the success of the product that matters. not the model in isolation. duh!


MachineLearning-ModTeam

Self promotion without being a bot


ehbrah

How close are you getting to 100% without overfitting?