Needle: High-performance DL System


Needle is a Deep Learning framework with customized GPU and CPU backend in C++ and Python. This is an attempt to simulate PyTorch's imperative style, especially its way of auto-differentiation and computational graph traversal. In the meantime, we enable accelerated computing with custom ndarrays implementation via low level C++ CUDA programming. This enables tensor operations to run on GPUs and other specialized hardwares.

Key contributions

  • Modular DL framework
  • Build models from scratch
    • ResNet (Residual Blocks, Skip Connection, BatchNorm2D)
    • LSTM (Cell State, Hidden State, Forget/Input/Output Gate, Activations)
    • Transformer (Multi-Head Self/Cross Attention, LayerNorm, Dropout, Positional Encoding, FFN, Skip Connection, Attention Masking)
  • Optimization with GPU & CUDA
    • SIMT
    • Tensor Model Parallelism
    • Register Tiling
    • Block Tiling
    • GPipe
    • Best result: ~5x speedup in 4-core distributed training

Tech Stack & Methodology



This project is inspired by 10-414/714 Deep Learning Systems by Carnegie Mellon University. Extensions based on this are built and are still under development (more to come!).

Needle: High-performance DL System


Zhenlin Wang

Posted on


Updated on


Licensed under