Evaluating ML Models: It’s About Choosing Your Mistakes
When we talk about evaluating ML models, we often jump straight to metrics.
Accuracy, precision, recall, RMSE…
But that skips the real question:
What kind of mistakes can your system afford to make?
Because every model will be wrong. The only control you have is how it is wrong.
Classification: Start with the Mistakes
In classification, every prediction falls into one of four buckets:
- True Positive (TP): correctly predicted positive
- True Negative (TN): correctly predicted negative
- False Positive (FP): predicted positive, actually negative
- False Negative (FN): predicted negative, actually positive
This is your confusion matrix. Everything else comes from here.
Now the key intuition: FP and FN are not equal.
Think in real scenarios:
- Spam filter:
- FP real email goes to spam bad user experience
- FN spam reaches inbox annoying but tolerable
- Result: You care more about precision.
- Medical diagnosis:
- FP unnecessary follow-up test
- FN missed disease potentially fatal
- Result: You care more about recall.
So before metrics, decide: Which error is more expensive?
The Metrics (Built on That Intuition)
Accuracy
Works only when classes are balanced.
Precision
Of what we predicted as positive, how much was correct?
Recall
Of actual positives, how many did we catch?
F1 Score
Balances precision and recall (penalizes extreme imbalance).
[!IMPORTANT] is not “better”—it’s just a compromise when you can’t clearly prioritise FP vs FN.
Regression: Now It’s About Distance
Regression is different. You’re not classifying—you’re estimating a value. So evaluation becomes: How far off are we?
Mean Absolute Error (MAE)
Average error, easy to interpret.
Mean Squared Error (MSE)
Penalizes large errors more heavily.
Root Mean Squared Error (RMSE)
Same unit as output, most practical.
The Subtle but Critical Insight
A number by itself means nothing.
- Great for house prices
- Terrible for predicting a £2 item
So always anchor error to:
- Scale of the problem
- Baseline (e.g., mean prediction)
- Domain tolerance
What Most People Miss
Metrics are not objective truth. They are proxies for business cost.
- Precision = cost of false alarms
- Recall = cost of missing real events
- RMSE = sensitivity to large deviations
If you don’t define the cost, you will optimize the wrong thing—very efficiently. And that’s where most ML systems quietly fail.
ML Classics
Related Notes
My Super Simple ML Workbench (That Covers ~80% of Classic ML)
For most tabular ML work — loading CSVs, training scikit-learn baselines, plotting results — you need surprisingly little tooling. Here's the boring, reproducible workbench I actually use: uv for environments and deps, VS Code for notebooks and scripts, and nothing else.
The ML Process: From Raw Data to Deployed Model
Once you get past the hype and framework wars, most ML work reduces to a surprisingly small loop: problem definition, data acquisition, feature prep, training, evaluation, and deployment.