Something’s not right
A stakeholder pings you on Slack: “The model’s wrong.” You brush it off initially because metrics looked fine, probably just bad luck. Then you check, and it’s not bad luck. It’s bad performance.
So you do what teams generally do when faced with an underperforming machine learning system: you rush to try and fix the model itself.
Your first reflex to tweak the model
In this scenario the model can be anything. A linear regression, an image classifier, an LLM accessed via some third party API or whatever takes some data in and spits out some data out via the interaction of learned parameters.
The fixes go from simple to pretty esoteric, but usually involve changing the model behavior in some way. The stuff I’ve seen most often includes:
- Training for more epochs.
- Changing the system prompt so instead of “Act as an expert…” it says “You are an expert…”.
- Changing the model backbone or freezing some layers.
- Playing with the temperature.
- Maybe if we switch from logistic regression to LASSO we’ll fix it?
- Running some automated feature engineering library that will most likely only introduce noise.
- Tuning the learning rate.
We’re all guilty of some or all of these. I think the reason why is that doing so usually involves changing a few lines of code and waiting, whereas the real fix takes a lot longer and is a lot more tedious.
My experience is that these adjustments don’t generally yield significant improvements, and more often than not lead to worse performance.
Why better data matters
On the other hand, better data does generally lead to better performance. In fact, in 100% of the times I was involved in one of these situations, working on the data led to significant improvements.
Better data can be many things.
- Training data: more samples, with higher diversity and with fewer errors.
- Eval/test data: more samples, more aligned with the data universe the model will operate in, and obviously with fewer errors.
- Problem definition: what should the model be trying to solve and where? Sometimes we expect too much of machine learning systems, and sometimes the data universe has drifted too much and we need to refocus.
It’s also reading reasoning traces from your LLM of choice to understand where in the model’s thought process things went south and how you can prevent that in the future.
One time my team at Mercado Libre was asked to take over a multilabel text classifier because the other team didn’t have the bandwidth to maintain and improve it. Along with it came a 500k row dataset that we discarded as soon as we realized how mislabeled it was. We started over and started manually rebuilding it, and with little over 10k samples we were getting much better F1 scores accross all labels.
The unglamorous fix
The problem is that data work is tedious and has absolutely no glory. It is often associated with entry-level grunt work that’s better suited to interns in boring companies. Junior data scientists yearn the day they can score a promotion and stop going insane from looking at data for hours on end.
Yet I argue that staring at data manually and fixing errors is the most efficient way of spending time for someone responsible for a machine learning system. That is to say, a day spent looking at data in a spreadsheet will yield a larger model improvement than a day swapping learning rate schedulers or any other model tweak.
I also argue that it’s something that everyone should do regardless of seniority and skillset. In my team at Mercado Libre, engineering managers and product owners manually look at data and their insights are often pivotal in improving model performance.
Even a few hours of error analysis in which you focus on the model’s most obvious mistakes by comparing predictions with actual labels can do wonders1.
Even if there are some low hanging fruit, the best way to improve your model is to improve your data2.
Models are code; systems are data. If you’re debugging the former without understanding the latter, you’re debugging blind.
-
Check out an old and not particularly pretty implementation of error analysis that handles different types of targets. ↩︎
-
In the API fueled LLM world we are all moving towards you can often get significant improvements by choosing a larger and more recent model, like going from GPT-4.1 to GPT-5. This will often come with cost and latency tradeoffs, plus the usual “my amazing curated prompt suddenly doesn’t work at all and I need to fix it”. ↩︎