The ML Process: From Raw Data to Deployed Model

ML Process Diagram

Once you get past the hype and framework wars, most ML work reduces to a surprisingly small loop:

Decide what you’re trying to predict and why.
Get data that actually reflects that problem.
Turn that data into something a model can use.
Train, evaluate, and tune a model.
Deploy it, watch it break, and repeat.

Tools change. This loop doesn’t.

This piece is a short refresher on that loop, ignoring library specifics (Python, scikit‑learn, etc.), so that when you do see the code, it feels like implementation detail rather than magic.

1. Start with the Problem, Not the Model

Most failed ML projects start with “Let’s use XGBoost / deep learning for this.”

Better starting point:

What decision are we trying to support or automate?
Is the output a category (spam / not spam, disease / no disease) or a number (price, demand, time)?
How will we judge “good enough”? Accuracy? RMSE? Something cost-based?

That gives you three things:

Type of problem: classification, regression, or something else.
Rough idea of acceptable error.
Language to talk to stakeholders about trade‑offs.

Until this is clear, everything else is just busywork.

2. Data Acquisition: Reality in a Messy CSV

Next: get data that reflects the reality your model will see.

Typical sources:

Application logs or event data.
Product databases (transactions, clicks, tickets).
External sources (APIs, open datasets, purchased data).

Two checks that matter more than volume:

Coverage: Do we see the full variety of cases we care about?
Alignment: Are the labels really what we think they are? (e.g., “churn” defined consistently.)

If your data doesn’t reflect the real decision context, no amount of modelling will fix it.

3. Data Cleaning and Feature Preparation

This is usually the unglamorous 70–80% of the work. You are:

Handling missing values (drop, impute, or flag them).
Dealing with inconsistent categories (different spellings, encodings).
Normalising or scaling numeric features where needed.
Encoding categorical variables (one‑hot, target encoding, etc.).
Engineering features that capture useful signal (ratios, counts, time deltas).

A simple rule of thumb:

If a human would look at two columns together to understand the story, there is probably a useful feature to build from them.

Good models on bad features underperform. Reasonable models on good features often win.

4. Train / Test Split: Protect Yourself from Self‑Deception

If you train and test on the same data, you’re measuring memorisation, not learning. So we split:

Training set: for fitting the model.
Test (or validation) set: for checking performance on unseen data.

The point is not the exact percentage (80/20, 70/30, etc.). The point is:

Test data should simulate future data.
You never tune hyperparameters directly on the final test set.

Cross‑validation just formalises this idea by rotating which slice is “held out” at each step.

5. Model Training: Fitting Patterns, Not Finding Truth

Once the data is ready and split, the training step is conceptually simple:

Choose a family of models (linear, tree‑based, neural nets, etc.).
Pick initial hyperparameters (or use defaults).
Fit on the training data.

The important mental model:

You are choosing a bias about what patterns you think are plausible (linear vs non‑linear, additive vs interactions) and then letting the optimiser find the best fit under that bias.

For most practical problems, simple models with good features beat exotic ones with poor understanding of the data.

6. Model Evaluation: Choosing Your Mistakes

Now we’re back to the earlier classification/regression story. I have written about this in detail in Evaluating ML Models: It’s About Choosing Your Mistakes.

For classification:

Use the confusion matrix to reason about TP / FP / TN / FN.
Choose metrics that reflect risk: precision, recall, F1, ROC, PR curves.
Remember: False Positives vs False Negatives have very different costs in different domains.

For regression:

Use MAE, MSE, RMSE to capture how far off predictions are.
Always interpret error relative to the scale of the target and a simple baseline (like predicting the mean).

The key is not “What is the metric?” but:

“Which mistakes are we willing to make more often?”
“Is this better than a dumb but stable baseline?”

Without that, you just optimise numbers.

7. Deployment and Monitoring: The Part Everyone Forgets

A model is only useful once it’s in the loop of real decisions.

Deployment can be:

As simple as a batch job that writes predictions to a table.
As complex as a low‑latency API integrated into a product.

But after deployment, two things matter:

Data drift: The world changes; your training data becomes stale.
Feedback loop: You need to capture actual outcomes to retrain and recalibrate.

The ML process is therefore not linear; it’s a loop: $\text{New data} \rightarrow \text{re-clean} \rightarrow \text{retrain} \rightarrow \text{re-evaluate} \rightarrow \text{redeploy}$

Ignoring this is how models quietly degrade while dashboards still look “green”.

Why This Process Matters More than the Library

Once you internalise this flow, libraries become interchangeable:

In scikit‑learn, everything looks like fit $\rightarrow$ predict $\rightarrow$ score.
In other frameworks, the syntax may differ, but the steps don’t.

When you’re reading API docs or example notebooks, you can always map them back to:

Where is the data acquired and cleaned?
How do they split training vs evaluation?
What assumptions does the model family make?
Which metrics do they pick, and do those align with real‑world cost?
How would this survive contact with production data?

That’s the mental scaffolding this refresher is trying to reinforce.