Blogs · Evaluation · MLOps

Testing in Machine Learning

A practical checklist for testing data, models, ML systems, and CI/CD pipelines.

2024.04.19 · 6 min read · by Zhenlin Wang

Machine learning testing is broader than ordinary software testing. A model can pass unit tests and still fail because the data distribution changed, the evaluation metric is wrong, the serving path is overloaded, or the product behavior is biased.

In MLOps, a useful testing strategy needs to cover four layers:

The sections below are a practical checklist for those layers.

Data Quality and Diversity

Data quality is the first testing surface. If the data is inconsistent, leaked, mis-split, or unrepresentative, later model evaluation becomes difficult to trust.

Data Consistency

Consistency checks validate the integrity, accuracy, completeness, shape, and range of the datasets used for training, validation, and testing. Common tools include data profiling, schema validation, anomaly detection, and simple invariant checks.

At minimum, test for:

Data Drift

Data drift is a change in the input distribution over time. Concept drift is a change in the relationship between inputs and labels. Both can degrade production performance even when the model code has not changed.

At minimum, monitor:

Model Quality

Model quality testing asks whether the model is correct enough, stable enough, and useful enough for the product context. It should happen before full system testing, but it depends on a reliable evaluation framework.

Evaluation Framework

Before regression or robustness testing, test the evaluation pipeline itself. It is easy to waste days improving a model against a broken metric.

Check that:

Regression Testing

Regression testing checks whether a new model, prompt, feature, or serving change makes existing behavior worse. Regression can happen during training or inference, so it should be covered in development, staging, and production.

At minimum, test:

Robustness

Robustness testing checks whether the model behaves consistently under perturbations, adversarial inputs, noisy data, and out-of-distribution examples.

Useful methods include:

Fairness and Bias

Fairness testing is part of product quality, not an optional ethics appendix. The relevant risks depend heavily on the domain, but teams should explicitly test for disparate performance, representational gaps, and harmful outputs.

For traditional ML systems, this can mean testing metrics across demographic groups, auditing labels and features for proxy variables, and monitoring bias after launch. For LLM systems, it can also mean alignment evaluations, red-team prompts, harmful-output checks, and reviewer audits.

There is no single universal baseline here. The right tests depend on the affected users, business context, legal constraints, and product failure modes. A good starting point is IBM AI Fairness 360, which catalogs fairness metrics and mitigation techniques. LLM-based evaluators can help with review at scale, but they also introduce their own evaluation bias, so they should not be treated as the only judge.

System Testing

System testing checks whether the training and inference pipelines can run reliably, recover from failure, and meet production constraints. The details differ between training and serving, but both need correctness, scalability, and fault tolerance.

For training systems, test:

For inference systems, test:

ML Testing and Software Engineering

High-quality ML code still needs ordinary software engineering discipline. The difference is that ML tests often need to avoid large fixtures, expensive models, and hidden assumptions about learned weights.

Good unit-test practices include:

ML Testing in CI/CD

ML testing should be continuous, but it is not always deterministic. Data changes, stochastic training, and human evaluation can make CI/CD harder than in traditional software projects.

A practical pipeline usually combines:

Jeremy Jordan’s testing guide has a helpful diagram of the model-development loop. The key point is the same one used throughout this post: combine software tests, data tests, model behavior tests, and production monitoring instead of treating validation score as the only check.

For a deeper enterprise MLOps view, the Databricks Big Book of MLOps is a useful reference for how development, UAT, production testing, and monitoring fit together.

References