Needle: High-performance DL System

2023-07-29 to 2024-07-28Projects , C++ , Python , CUDA , Deep Learning System

Introduction

Needle is a Deep Learning framework with customized GPU and CPU backend in C++ and Python. This is an attempt to simulate PyTorch's imperative style, especially its way of auto-differentiation and computational graph traversal. In the meantime, we enable accelerated computing with custom ndarrays implementation via low level C++ CUDA programming. This enables tensor operations to run on GPUs and other specialized hardwares.

Key contributions

Modular DL framework
Build models from scratch
- ResNet (Residual Blocks, Skip Connection, BatchNorm2D)
- LSTM (Cell State, Hidden State, Forget/Input/Output Gate, Activations)
- Transformer (Multi-Head Self/Cross Attention, LayerNorm, Dropout, Positional Encoding, FFN, Skip Connection, Attention Masking)
Optimization with GPU & CUDA
- SIMT
- Tensor Model Parallelism
- Register Tiling
- Block Tiling
- GPipe
- Best result: ~5x speedup in 4-core distributed training

Tech Stack & Methodology

Acknowledgement

This project is inspired by 10-414/714 Deep Learning Systems by Carnegie Mellon University. Extensions based on this are built and are still under development (more to come!).

Needle: High-performance DL System

https://criss-wang.github.io/post/projects/Needle/

Author

Zhenlin Wang

Posted on

2023-07-29

Updated on

2024-07-28

Needle: High-performance DL System

Introduction

Key contributions

Tech Stack & Methodology

Acknowledgement

Author

Posted on

Updated on

Licensed under

Tags

Catalogue