Blogs · Deep Learning · System Design

Deep Learning System Design - A Checklist (Part I)

An Overview of how to design a full-stack Deep Learning System

2024.02.09 · 8 min read · by Zhenlin Wang

Introduction

After reviewing many blog posts and working on several DL-based projects myself, I’ve compiled a list of must-do’s for a robust, complete Deep Learning System. In general, when we consider a DL system to be “complete”, it needs to have the following components:

  1. Data
  2. Modeling
  3. Training & Optimization
  4. Experiments
  5. Packaging and Deployment
  6. Serving
  7. Monitoring

I’ll walk through each step and provide checklists for each of them, detailing rationales and provide examples wherever possible.

Step 1: Data

Data Source

Data ETL

Data Routine

Data Quality & Data Validation

import pandera as pa
from azureml.core import Run

run = Run.get_context(allow_offline=True)

if run.id.startswith("OfflineRun"):
    import os

    from azureml.core.dataset import Dataset
    from azureml.core.workspace import Workspace
    from dotenv import load_dotenv

    load_dotenv()

    ws = Workspace.from_config(path=os.getenv("AML_CONFIG_PATH"))

    liko_data = Dataset.get_by_name("liko_data")
else:
    liko_data = run.input_datasets["liko_data"]

df = liko_data.to_pandas_dataframe()

# ---------------------------------
# Include code to prepare data here
# ---------------------------------

liko_data_schema = pa.DataFrameSchema({
    "Id": pa.Column(pa.Int, nullable=False),
    "AccountNo": pa.Column(pa.Bool, nullable=False),
    "BVN": pa.Column(pa.Bool, nullable=True, required=False),
    "IdentificationType": pa.Column(pa.String checks=pa.Check.isin([
        "NIN", "Passport", "Driver's license"
    ]),
    "Nationality": pa.Column(pa.String, pa.Check.isin([
        "NG", "GH", "UG", "SA"
    ]),
    "DateOfBirth": pa.Column(
        pa.DateTime,
        nullable=True,
        checks=pa.Check.less_than_or_equal_to('2000-01-01')
    ),
    "*_Risk": pa.Column(
        pa.Float,
        coerce=True,
        regex=True
    )
}, ordered=True, strict=True)

run.log_table("liko_data_schema", liko_data_schema)
run.parent.log_table("liko_data_schema", liko_data_schema)

# -----------------------------------------------
# Include code to save dataframe to output folder
# -----------------------------------------------

##### Downstream task
liko_data_schema.validate(data_sample)

Step 2: Modeling

Model selection

Metric Selection

(IMPT!) Evaluation methods for Model comparison & Model quality control

  1. When drawing conclusion about model performance, consider Students t-test
  2. Perturbation test (corruption, adversarial attack)
  3. Invariance test (Bias removal)
  4. Directional Expectation test (Common sense directions. E.g.: rainy season shouldn’t have much higher temperature than dry season)
  5. Model calibration (when standalone probability in the output matters) see page 10
  6. Confidence Evaluation (usefulness threshold for each individual prediction)
  7. Slice-based Evaluation (model performance on subgroups)

Step 3: Training & Optimization

Step 4: Experiments

Experiment tracking are important, especially when the scale of training is large, and teamwork is involved. There are a lot of tracking tools available out there. kubeflow, mlflow, wandb, neptune.ai… you name it. When using these tools, what’s critical is to consider things to keep track of.

Must-have’s

Task-specific

  1. Traditional ML
  1. Deep Learning
  1. Computer Vision
  1. NLP, LLM
  1. Structure Data
  1. RL
  1. Hyper Optim

Final thoughts on ML training (aka model exploration)

when developing ML models with exploratory experiments, I’ve always enjoyed the pseudocode from this blog post. This is really what it means to do ML in a real industrial setup.

time, budget, business_goal = business_specification()

creative_idea = initial_research(business_goal)

while time and budget and not business_goal:
   solution = develop(creative_idea)
   metrics = evaluate(solution, validation_data)
   if metrics > best_metrics:
      best_metrics = metrics
      best_solution = solution
   creative_idea = explore_results(best_solution)

   time.update()
   budget.update()

To be continued…

Usually this is where the school-based projects will end. However, to fully develop a system based on the model and the idea you derive, you still need some engineering skills in building the pipeline for deployment, serving and performance monitoring. Luckily we have a lot of tools for these that do the dirty works for us. Nonetheless, not paying attention to some details in these steps could lead to serious bugs or even cost you thousands of dollars!

Hence to both remind myself and give some suggestions for the readers, I’ve created a part II for the checklist for these steps. Please check it out if you are interested!