Blogs · Deep Learning · System Design

MLOps Post-Training Considerations

A practical overview of experiment tracking, model registries, serving, and monitoring after model training.

2024.03.08 · 7 min read · by Zhenlin Wang

Introduction

Training a model is only the middle of an ML system. After training, the team still needs to preserve the experiment, store the model artifact, deploy it behind a reliable interface, and monitor whether it continues to behave well in production.

This post is a practical map of the post-training lifecycle:

Machine learning workflow from model training to registry, serving, monitoring, and drift detection
A compact post-training workflow: track experiments, register models, serve inference, and monitor behavior.

The details can become deep quickly, so this post focuses on the main design decisions rather than every tool-specific setup step.

Experiment Tracking

Experiment tracking makes model development reproducible and comparable. Without it, the team can lose track of which dataset, code version, hyperparameters, metric implementation, and environment produced a result.

Good tracking should capture:

Common tools include MLflow, Weights & Biases, Neptune, TensorBoard, and cloud-native experiment trackers.

MLflow Example

MLflow is a common default because it covers experiment tracking and model registry workflows.

from datetime import datetime

import mlflow


experiment_name = "credit-risk"
run_name = datetime.now().strftime("%Y%m%d-%H%M")

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name=run_name):
    mlflow.log_param("model_type", "random_forest")
    mlflow.log_param("n_estimators", 200)

    # Train and evaluate the model.
    model = train_model()
    metrics = evaluate_model(model)

    mlflow.log_metric("validation_auc", metrics["auc"])
    mlflow.log_artifact("reports/confusion_matrix.png")

The exact API depends on the model framework, but the design principle is stable: a run should tell you what happened, what it produced, and whether it is worth comparing to another run.

Tracking Configuration

Configuration should be stored as a first-class artifact. A plain YAML file is often enough:

project: home-credit-default-risk

data:
  n_cv_splits: 5
  validation_size: 0.2
  stratified_cv: true

model:
  type: random_forest
  n_estimators: 2000
  max_depth: 40
  min_samples_split: 50
  class_weight: balanced

post_processing:
  aggregation_method: rank_mean

Load it safely:

import yaml


with open(config_path, "r", encoding="utf-8") as file:
    config = yaml.safe_load(file)

print(config["data"]["n_cv_splits"])

For larger projects, Hydra can help compose environment-specific configs:

import hydra
from omegaconf import DictConfig, OmegaConf


@hydra.main(version_base=None, config_path="conf", config_name="config")
def train(cfg: DictConfig) -> None:
    print(OmegaConf.to_yaml(cfg))
    print(cfg.model.n_estimators)


if __name__ == "__main__":
    train()

Tracking Data and Environment

Data is harder to version than code because it may live in object storage, databases, feature stores, or vendor systems. Use a data registry or versioning layer when the dataset matters for reproducibility. DVC, lakehouse tables, feature stores, and cloud experiment platforms can all play this role.

The runtime environment should also be recoverable. Docker is usually the most reliable option:

FROM python:3.11-slim

WORKDIR /app
COPY pyproject.toml README.md ./
COPY src ./src

RUN python -m pip install --upgrade pip \
    && python -m pip install -e .

CMD ["python", "-m", "your_project.train"]

Conda or uv can also work. The important part is that the environment is described explicitly and stored with the experiment or model artifact.

Model Registry

A model registry stores trained model artifacts and their metadata. It answers questions like:

Some teams keep registry functionality inside MLflow or a cloud ML platform. Others use object storage plus metadata tables. The implementation matters less than the contract: a production model should never be a mystery file on someone’s machine.

Development vs Production Registries

Development registries are optimized for iteration:

Production registries are optimized for reliability:

The two environments can use the same tool, but they should not have the same permissions or promotion rules.

Save, Package, Register

It helps to separate three actions:

For PyTorch, saving model state may look like this:

import torch


checkpoint = {
    "model_state": model.state_dict(),
    "optimizer_state": optimizer.state_dict(),
    "config": config,
}

torch.save(checkpoint, "model.pt")

For serving across runtimes, packaging might involve ONNX, TorchScript, a Python wheel, or a Docker image. For registry, MLflow can log and register framework-specific models:

import mlflow


with mlflow.start_run():
    mlflow.log_params(config)
    mlflow.sklearn.log_model(
        sk_model=model,
        artifact_path="model",
        registered_model_name="credit-risk-model",
    )

Model Serving

Model serving turns a trained model into a usable product capability. The simplest version loads a model, transforms input features, runs inference, and returns predictions. Production serving adds schemas, latency budgets, scaling, error handling, logging, and rollback.

Common serving modes:

API Architecture

Three common interface styles are REST, gRPC, and WebSocket.

REST is simple, widely supported, and easy to debug. It is often the best default for prediction endpoints that do not require streaming.

gRPC is efficient and strongly typed. It is useful when low latency, internal service communication, or streaming matters.

WebSocket supports long-lived bidirectional communication. It is useful for real-time updates, but connection management is more complex.

Pick the interface based on the product path, not based on novelty. A recommendation batch job, an LLM streaming chat endpoint, and a fraud-scoring service may all need different serving patterns.

Online and Offline Together

Many ML systems mix online and offline inference. A recommender system might precompute candidate embeddings offline, then do final ranking online. A search system might refresh document embeddings in batch, then use them for real-time retrieval.

This split can reduce latency and cost, but it introduces a contract: the online service must know which offline artifacts, feature versions, and embedding versions it is using.

ETL-Based Deployment

ETL-style deployment is useful when predictions do not need to happen in real time. The job extracts records, transforms features, runs inference, and loads results into a destination table or storage system.

This fits:

Tools such as Airflow, Apache Beam, Spark, and cloud batch systems are often better for this than a web server.

Event-Driven Serving

In a larger distributed system, model inference may be triggered by messages rather than HTTP requests. Kafka, RabbitMQ, Celery, or cloud queues can decouple producers from inference workers.

This is useful when inference can be asynchronous, when workloads spike, or when multiple downstream consumers need the same prediction event.

Model Monitoring

Monitoring ML systems has three layers.

Model metrics

System metrics

Resource metrics

Software monitoring tells you whether the service is healthy. Model monitoring tells you whether the predictions are still useful.

Prometheus and Grafana Pattern

Jeremy Jordan’s monitoring example uses a practical open-source stack: expose a model service through FastAPI, instrument it with metrics, collect those metrics with Prometheus, visualize them with Grafana, and simulate traffic with Locust.

The general flow is:

  1. Create a containerized model service with prediction and health endpoints.
  2. Expose metrics through a /metrics endpoint.
  3. Configure Prometheus to scrape the metrics endpoint.
  4. Build Grafana dashboards for model, system, and resource metrics.
  5. Use load testing to generate traffic before relying on production dashboards.

For Kubernetes, Prometheus can discover scrape targets through service discovery or explicit service monitors:

endpoints:
  - path: /metrics
    port: app
    interval: 15s

Drift Detection

Drift detection can be handled in several ways:

Be careful with what you store. Full payload logging can be useful for debugging, but it may create privacy, cost, and compliance problems.

Monitoring Practices

For Prometheus:

For Grafana:

Summary

Post-training work is where a model becomes a system. The important questions are practical:

The more serious the model’s product role becomes, the more these post-training practices matter.

References