Deep Learning System Design - A Checklist (Part I)
Introduction
After reviewing many blog posts and working on several DL-based projects myself, I've compiled a list of must-do's for a robust, complete Deep Learning System. In general, when we consider a DL system to be “complete”, it needs to have the following components:
- Data
- Modeling
- Training & Optimization
- Experiments
- Packaging and Deployment
- Serving
- Monitoring
I'll walk through each step and provide checklists for each of them, detailing rationales and provide examples wherever possible.
Step 1: Data
Data Source
- What is the availability of data?
- What is the size/scale of data?
- Do we have user feedback data?
- Do we use system/operation data (logs? API req/resp?)
- Are there privacy issues?
- A note about logs: Store logs for as long as they are useful, and can discard them when they are no longer relevant for you to debug your current system
Data ETL
- What is the data size before/after transformation, this often involves granularity
- What is the data format?
- JSON
- CSV (Row-format)
- Parquet (Column format, Hadoop, AWS Redshift)
- Row-major vs Column-major
- Overall, row-major formats are better when you have to do a lot of writes, whereas column-major ones are better when you have to do a lot of column-based reads.
- Note:
Pandas
is column-major,NumPy
is row-major by default (if not specified). AccessPandas DataFrame
rows are faster after we dodf.to_numpy()
- Model related:
- Metadata
- Training data
- Monitoring data (sometimes for iterative deployment with model updates)
- Where is the data stored (Cloud? Local? Edge?)
- Most of the time, it is cloud. Afterall, it costs little for school-level project to store data in AWS S3.
- Consider spliting app-related dat from model-related data (e.g. WandB vs MongoDB)
- Processing
- Recall ACID and BASE
- Tranactional: OLTP
- low latency (often for streaming service)
- high availability
- transaction won't go through if system cannot process it
- Often row-major
- Eventual consistency
- Analytical: OLAP
- Tolerant to higher query latency (often require trasnformation)
- less available: can afford some downtime
- delayed operation, but will go through during system overload
- Often uses a columnar storage format for better query performance.
- Strong consistency
Data Routine
- ETL daily routine
- Example: using Airflow
Data Quality & Data Validation
- Are the feature information complete? Any missing data?
- Is the training/testing data fully labeled? (can we use self-supervised to do ML-based annotation?)
- Are there data drifts? Are there bias in the data? Packages to detect them?
- Routine to validate data?
- Example: use
pandera
package
1 | import pandera as pa |
Step 2: Modeling
Model selection
- Start with model suitable for the task -> task categorization
- with/without label/partial label
- numeric/categorical output
- generation/prediction (for generation you need to learn the latent space)
- Baseline selection
- Random Baseline
- Human Heuristic
- Simplest ML model
Metric Selection
What is the task type
Classification Metrics: Binary Classification
- Accuracy
- Precision
- Recall
- F1 Score
- Area Under the Receiver Operating Characteristic curve (AUC-ROC)
- Area Under the Precision-Recall curve (AUC-PR)
- True Positive Rate (Sensitivity or Recall)
- True Negative Rate (Specificity)
- False Positive Rate
- False Negative Rate
Classification Metrics: Multi-Class Classification
- Micro/Macro/Average Precision
- Micro/Macro/Average Recall
- Micro/Macro/Average F1 Score
- Confusion Matrix
- Multi-class Log Loss
- Cohen\'s Kappa
- Jaccard Similarity Score
Regression Metrics
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- R-squared (Coefficient of Determination)
- Mean Squared Logarithmic Error (MSLE)
- Mean Absolute Percentage Error (MAPE)
- Huber Loss
Clustering Metrics
- Silhouette Score
- Davies-Bouldin Index
- Calinski-Harabasz Index
- Inertia (within-cluster sum of squares)
- Adjusted Rand Index
- Normalized Mutual Information (NMI)
- Homogeneity, Completeness, and V-Measure
Anomaly Detection Metrics
- Precision at a given recall
- Area Under the Precision-Recall curve (AUC-PR)
- F1 Score
- Receiver Operating Characteristic curve (ROC)
- Area Under the Receiver Operating Characteristic curve (AUC-ROC)
Natural Language Processing (NLP) Metrics
- BLEU Score
- ROUGE Score
- METEOR Score
- CIDEr Score
- Perplexity
- Accuracy, Precision, Recall for NER tasks
Ranking Metrics
- Mean Reciprocal Rank (MRR)
- Normalized Discounted Cumulative Gain (NDCG)
- Mean Average Precision
- Precision at K
- Recall at K
Recommender System Metrics
- Precision at K
- Recall at K
- Mean Average Precision (MAP)
- Bayesian Personalized Ranking (BPR)
- Root Mean Squared Error (RMSE) for collaborative filtering
Image Segmentation Metrics
- Intersection over Union (IoU)
- Dice Coefficient
- Pixel Accuracy
- Mean Intersection over Union (mIoU)
- F1 Score
Time Series Forecasting Metrics
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Percentage Error (MAPE)
- Symmetric Mean Absolute Percentage Error (SMAPE)
- Mean Directional Accuracy (MDA)
Reinforcement Learning Metrics
- Average Reward
- Discounted Sum of Rewards
- Entropy of Policy
- Exploration-Exploitation Tradeoff Metrics
What is the business objective
Imbalance and Cost Sensitivtiy
Threshold Selection
Data Type
Interpretability
Robustness
(IMPT!) Evaluation methods for Model comparison & Model quality control
- When drawing conclusion about model performance, consider Students t-test
- Perturbation test (corruption, adversarial attack)
- Invariance test (Bias removal)
- Directional Expectation test (Common sense directions. E.g.: rainy season shouldn't have much higher temperature than dry season)
- Model calibration (when standalone probability in the output matters) see page 10
- Confidence Evaluation (usefulness threshold for each individual prediction)
- Slice-based Evaluation (model performance on subgroups)
Step 3: Training & Optimization
- On what platform is the model trained?
- Do we use distribtued training?
- What are the potential issues
- Hardware (GPU memory, inter-GPU communication speed)
- Overfitting/underfitting
- Concept Drift
- training stability (less fluctuations)
- dead neuron
- Local minima
- vanishing/exploding gradients
- How to do debugging
- Start simple and gradually add more components
- (*)Overfit a single batch: If model can't overfit a small amount of data, there's something wrong with your implementation.
- Set seed properly
- Is hyperparameter tuning needed? Setup routine for tuning?
- How to optimize the training to make it feasible/efficient/fault tolerant?
- Mixed Precision
- Quantization
- FSDP/DDP/Tensor/Model/Pipeline
- Checkpointing
- Accumulation
- Knowledge Distillation
- PEFT? (LoRA, Prefix Tuning)
- What optimizer do we use? its scheduler?
- What loss do we use?
- Most of the time it is just same as metrics
- other scenarios include:
- Reconstruction loss: mean squared error (MSE) for continuous data or binary cross-entropy for binary data
- KL Divergence
- Contrastive Loss: Encourages similarity between augmented versions of the same sample and dissimilarity between different samples. (Siamese Networks / Triplet Loss / SimCLR / Contrastive Divergence Loss (restricted boltzmann machine))
Step 4: Experiments
Experiment tracking are important, especially when the scale of training is large, and teamwork is involved. There are a lot of tracking tools available out there. kubeflow
, mlflow
, wandb
, neptune.ai
… you name it. When using these tools, what's critical is to consider things to keep track of.
Must-have's
- Code:
- Preprocessing + training + evaluation scripts,
- Notebooks for feature engineering
- Other util codes
- Environment:
- Save the environment configuration files like
Dockerfile
(Docker),requirements.txt
(pip),pyproject.toml
(e.g., hatch and poetry), orconda.yml
(conda). - (IMPT) Save Docker images on Docker Hub or your own container repository is always a good practice before running experiment
- Save the environment configuration files like
- Data:
- Saving data versions (as a hash or locations of immutable data resources)
- You can also use modern data versioning tools like DVC (and save the .dvc files to your experiment tracking tool).
- Parameters:
- Experiment run’s configuration
- Save parameters used via the command line (e.g., through argparse, click, or hydra)
- Metrics:
- Logging evaluation metrics on train, validation, and test sets for every run.
Task-specific
- Traditional ML
- Model weights
- Evaluation charts (ROC curves, Confusion matrix)
- Prediction distributions
- Deep Learning
- Model checkpoints (both during and after training, but beware of the cost)
- Gradient norms (to control for vanishing or exploding gradient problems)
- Best/worst predictions on the validation and test set after training
- Hardware resources: CPU/GPU Utility, Memory Utility, Disk I/O, Network Utility, throughput
- Computer Vision
- Model predictions after every epoch (labels, overlayed masks or bounding boxes)
- NLP, LLM
- Inference time
- Prompts (in the case of generative LLMs)
- Specific evaluation metrics (e.g., ROUGE for text summarization or BLEU for translation between languages)
- Embedding size and dimensions, type of tokenizer, and number of attention heads (when training transformer models from scratch)
- Feature importance, attention-based, or example-based explanations (see this overview for specific algorithms and more ideas)
- Structure Data
- Input data snapshot (
.head()
on DataFrames if you are using pandas) - Feature importance (e.g., permutation importance)
- Prediction explanations like SHAP or partial dependence plots (they are all available in DALEX)
- RL
- Episode info: return, length, intermediate states
- Total environment steps, wall time, steps per second
- Value and police function losses
- Aggregate statistics over multiple environments and/or runs
- Hyper Optim
- Run score: the metric you are optimizing after every iteration
- Run parameters: parameter configuration tried at each iteration
- Best parameters: best parameters so far and overall best parameters after all runs have concluded
- Parameter comparison charts: there are various visualizations that you may want to log during or after training, like parallel coordinates plot or slice plot (they are all available in Optuna, by the way)
Final thoughts on ML training (aka model exploration)
when developing ML models with exploratory experiments, I’ve always enjoyed the pseudocode from this blog post. This is really what it means to do ML in a real industrial setup.
1 | time, budget, business_goal = business_specification() |
To be continued…
Usually this is where the school-based projects will end. However, to fully develop a system based on the model and the idea you derive, you still need some engineering skills in building the pipeline for deployment, serving and performance monitoring. Luckily we have a lot of tools for these that do the dirty works for us. Nonetheless, not paying attention to some details in these steps could lead to serious bugs or even cost you thousands of dollars!
Hence to both remind myself and give some suggestions for the readers, I've created a part II for the checklist for these steps. Please check it out if you are interested!
Deep Learning System Design - A Checklist (Part I)
https://criss-wang.github.io/post/blogs/mlops/deep-learning-system-design-1/