Evaluating ML Models: It’s About Choosing Your Mistakes

When we talk about evaluating ML models, we often jump straight to metrics—accuracy, precision, recall, RMSE… But that skips the real question: what kind of mistakes can your system afford to make? Every model will be wrong; the only control you have is how it is wrong.

Classification: Start with the Mistakes

In classification, every prediction falls into one of four buckets:

True Positive (TP): correctly predicted positive
True Negative (TN): correctly predicted negative
False Positive (FP): predicted positive, actually negative
False Negative (FN): predicted negative, actually positive

This is your confusion matrix. Everything else comes from here. Now, the key intuition: FP and FN are not equal.

Think about real scenarios:

Spam filter:
- FP $\rightarrow$ real email goes to spam $\rightarrow$ bad user experience
- FN $\rightarrow$ spam reaches inbox $\rightarrow$ annoying but tolerable
- Result: You care more about precision.
Medical diagnosis:
- FP $\rightarrow$ unnecessary follow-up test
- FN $\rightarrow$ missed disease $\rightarrow$ potentially fatal
- Result: You care more about recall.

So before metrics, decide: Which error is more expensive?

The Metrics (Built on That Intuition)

Accuracy

$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$ $\rightarrow$ Works only when classes are balanced.

Precision

$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$ $\rightarrow$ Of what we predicted as positive, how much was correct?

Recall

$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$ $\rightarrow$ Of actual positives, how many did we catch?

F1 Score

$\text{F}_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$ $\rightarrow$ Balances precision and recall (penalizes extreme imbalance).

Important

$\text{F}_1$ is not “better”—it’s just a compromise when you can’t clearly prioritise FP vs FN.

Regression: Now It’s About Distance

Regression is different. You’re not classifying—you’re estimating a value. So evaluation becomes: How far off are we?

Mean Absolute Error (MAE)

$\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|$ $\rightarrow$ Average error, easy to interpret.

Mean Squared Error (MSE)

$\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$ $\rightarrow$ Penalizes large errors more heavily.

Root Mean Squared Error (RMSE)

$\text{RMSE} = \sqrt{\text{MSE}}$ $\rightarrow$ Same unit as output, most practical.

The Subtle but Critical Insight

A number by itself means nothing. For example, an $\text{RMSE} = 10$ might be great for house prices, but terrible for predicting the price of a £2 item.

Always anchor error to:

The scale of the problem
A baseline (e.g., mean prediction)
Domain tolerance

What Most People Miss

Metrics are not objective truth. They are proxies for business cost:

Precision = cost of false alarms
Recall = cost of missing real events
RMSE = sensitivity to large deviations

If you don’t define the cost, you will optimize the wrong thing—very efficiently. And that’s where most ML systems quietly fail.