Evaluating ML Models: It’s About Choosing Your Mistakes

· 3 min read
·
Model Evaluation Metrics Fundamentals

When we talk about evaluating ML models, we often jump straight to metrics.

Accuracy, precision, recall, RMSE…

But that skips the real question:

What kind of mistakes can your system afford to make?

Because every model will be wrong. The only control you have is how it is wrong.

Classification: Start with the Mistakes

In classification, every prediction falls into one of four buckets:

  • True Positive (TP): correctly predicted positive
  • True Negative (TN): correctly predicted negative
  • False Positive (FP): predicted positive, actually negative
  • False Negative (FN): predicted negative, actually positive

This is your confusion matrix. Everything else comes from here.

Now the key intuition: FP and FN are not equal.

Think in real scenarios:

  • Spam filter:
    • FP \rightarrow real email goes to spam \rightarrow bad user experience
    • FN \rightarrow spam reaches inbox \rightarrow annoying but tolerable
    • Result: You care more about precision.
  • Medical diagnosis:
    • FP \rightarrow unnecessary follow-up test
    • FN \rightarrow missed disease \rightarrow potentially fatal
    • Result: You care more about recall.

So before metrics, decide: Which error is more expensive?

The Metrics (Built on That Intuition)

Accuracy

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}

\rightarrow Works only when classes are balanced.

Precision

Precision=TPTP+FP\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}

\rightarrow Of what we predicted as positive, how much was correct?

Recall

Recall=TPTP+FN\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

\rightarrow Of actual positives, how many did we catch?

F1 Score

F1=2PrecisionRecallPrecision+Recall\text{F}_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

\rightarrow Balances precision and recall (penalizes extreme imbalance).

[!IMPORTANT] F1\text{F}_1 is not “better”—it’s just a compromise when you can’t clearly prioritise FP vs FN.

Regression: Now It’s About Distance

Regression is different. You’re not classifying—you’re estimating a value. So evaluation becomes: How far off are we?

Mean Absolute Error (MAE)

MAE=1ni=1nyiy^i\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|

\rightarrow Average error, easy to interpret.

Mean Squared Error (MSE)

MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2

\rightarrow Penalizes large errors more heavily.

Root Mean Squared Error (RMSE)

RMSE=MSE\text{RMSE} = \sqrt{\text{MSE}}

\rightarrow Same unit as output, most practical.

The Subtle but Critical Insight

A number by itself means nothing.

RMSE=10\text{RMSE} = 10

  • Great for house prices
  • Terrible for predicting a £2 item

So always anchor error to:

  1. Scale of the problem
  2. Baseline (e.g., mean prediction)
  3. Domain tolerance

What Most People Miss

Metrics are not objective truth. They are proxies for business cost.

  • Precision = cost of false alarms
  • Recall = cost of missing real events
  • RMSE = sensitivity to large deviations

If you don’t define the cost, you will optimize the wrong thing—very efficiently. And that’s where most ML systems quietly fail.

In this Series

ML Classics

3 / 3