Blogs · Deep Learning · System Design

Deep Learning System Design: A Checklist, Part II

A practical checklist for the production side of deep learning systems: packaging, deployment, serving, monitoring, logging, and model operations.

2024.02.10 · 4 min read · by Zhenlin Wang

Recap

Part I covered the early system-design work: data, modeling, evaluation, training, and experiment tracking.

This part covers the production side:

  1. Packaging and model artifacts.
  2. Deployment.
  3. Serving.
  4. Monitoring.
  5. Logging and operations.

This is where a trained model becomes a system that users and other services can depend on.

Step 5: Packaging and Model Artifacts

Packaging means turning a trained model into a reproducible artifact that another process can load safely.

A production artifact should include:

The artifact should answer a simple question: “Can this model be loaded and evaluated without guessing what produced it?”

Storage Choices

Common storage patterns:

Avoid mystery files. A file named best_model_final_v7.pt is not a deployment strategy.

Step 6: Deployment

Deployment is the process of promoting a model into an environment where it can be used.

Common patterns:

The right pattern depends on risk. A low-stakes internal classifier may only need a simple rollout. A high-traffic ranking model should usually use shadowing, canaries, and rollback.

Deployment Checklist

Before rollout:

Step 7: Serving

Serving turns the model into a callable capability.

Decide:

Online Serving

Online serving is request-response inference. It is appropriate when users or downstream services need fresh predictions immediately.

Watch:

Batch Serving

Batch serving is useful when predictions can be computed on a schedule.

Good use cases:

Batch jobs still need monitoring. Silent failure can be worse than an obvious API error.

Feedback Collection

Serving should preserve enough information for debugging and improvement:

Be careful with privacy. Log enough to debug, not everything by default.

Step 8: Monitoring

Monitoring should cover both the system and the model.

System Metrics

Track:

These metrics tell you whether the service is operationally healthy.

Model Metrics

Track:

These metrics tell you whether the model is still behaving well.

Logging

Use structured logs with timestamps, severity, request IDs, model versions, and relevant metadata.

Good logs make it possible to answer:

Step 9: Updating the Model

Model updates should be deliberate.

Trigger retraining or replacement when:

Do not retrain blindly on a schedule if nobody reviews whether the new model is better. Automated retraining still needs gates.

Production Quality Bar

A production ML system should be:

The model is only one part of that quality bar.

Closing

Deep learning systems become valuable when the model is surrounded by engineering discipline: packaged artifacts, tested deployment paths, clear serving contracts, monitoring, logs, and ownership.

The checklist is not meant to slow the team down. It is meant to keep the team from discovering production requirements after users already depend on the system.

Reference