Needle: High-performance DL System
Introduction
Needle is a Deep Learning framework with customized GPU and CPU backend in C++ and Python. This is an attempt to simulate PyTorch's imperative style, especially its way of auto-differentiation and computational graph traversal. In the meantime, we enable accelerated computing with custom ndarrays implementation via low level C++ CUDA programming. This enables tensor operations to run on GPUs and other specialized hardwares.
Key contributions
- Modular DL framework
- Build models from scratch
- ResNet (Residual Blocks, Skip Connection, BatchNorm2D)
- LSTM (Cell State, Hidden State, Forget/Input/Output Gate, Activations)
- Transformer (Multi-Head Self/Cross Attention, LayerNorm, Dropout, Positional Encoding, FFN, Skip Connection, Attention Masking)
- Optimization with GPU & CUDA
- SIMT
- Tensor Model Parallelism
- Register Tiling
- Block Tiling
- GPipe
- Best result: ~5x speedup in 4-core distributed training
Tech Stack & Methodology
Acknowledgement
This project is inspired by 10-414/714 Deep Learning Systems by Carnegie Mellon University. Extensions based on this are built and are still under development (more to come!).
Needle: High-performance DL System