Blogs · Deep Learning · System Design

Deep Learning System Design - A Checklist (Part II)

An Overview of how to design a full-stack Deep Learning System

2024.02.10 · 5 min read · by Zhenlin Wang

A quick recap

In the previous post, checklist part I, we’ve talked about the early stage of designing a deep learning system. These steps are often of paramount importance when we build some ml projects in our courseworks. At the end of these steps, we often have a ready-to-use model that solves the problem at hand. However, if we really want to make it a product, to benefit thousand of users or the community, a lot of engineering work on the backend still need to be done. This includes:

Let’s go through each of the step one by one.

Step 5: Packaging and Deployment

Technically speaking, packaging a model isn’t really the right word to describe the process of saving a trained model for usage. When people talk about packaging a model, they usually mean storing the trained model somewhere to deploy it for future usage. Thus it is closely related to deployments. Hence, when it comes to saving the model, here’s the things to look out for:

  1. What platform do you use to store the model: local? cloud? edge?
  2. What metadata do you need?
    • model hyperparams?
    • dependencies (this can be tricky a lot of times)
    • model json files? (example: hugging face models)
  3. how do you do the
  4. what’s the size requirement?
  5. can we containerize it? (i.e. building an environment easy for deployment and serving)
  6. Is model-versioning done effectively?
  7. Does the saved model work perfectly in the infrastructure? (GPU? Memory? Network?)
  8. Knowing when to update the model

when deploying the model, several strategies can be considered as well. For example:

Step 6: Serving

This is where the endpoint becomes crucial, you need to consider several components

Step 7: Monitoring

Don’t forget to do logging as it is super important. Make it structured with time stamps and severity levels. Some of the objects for the data and model components you should log include:

Some best practices include:

Conclusion

While a deep learning system can “almost” be always built following the checklist I made here, we must stay close to our business objective for the system to be truly useful. In that sense, a close connection to our user would be very important, and things like defensive programming, friendly UI and user feedbacks play super important roles. In future posts, I’ll talk about some of them. Stay tuned ~

References

  1. A Comprehensive Guide on How to Monitor Your Models in Production - Neptune.ai