Large generative models (e.g., large language models, diffusion models) have shown remarkable performance, but they require a massive amount of computational resources. To make them more accessible, it is crucial to improve their efficiency.

This course will introduce efficient AI computing techniques that enable powerful deep learning applications on resource-constrained devices. Topics include model compression, pruning, quantization, neural architecture search, distributed training, data/model parallelism, gradient compression, and on-device fine-tuning. It also introduces application-specific acceleration techniques for large language models, diffusion models, video recognition, and point cloud. This course will also cover topics about quantum machine learning. Students will get hands-on experience deploying large language models (e.g., LLaMA 2) on a laptop.

The slides and lab assignments from the last semester are available for access here.


  • Online Lectures: Lecture recordings are available at YouTube.
  • Live Streaming: Lectures are live streamed at live.efficientml.ai every Tuesday/Thursday 3:35-5:00pm Eastern Time.
  • Location: 36-156
  • Office Hour: Thursday 5:00-6:00 pm Eastern Time, 38-344 Meeting Room
  • Discussion: Discord
  • Homework submission: Canvas
  • Resources: MIT HAN Lab, HAN Lab Github, TinyML, MCUNet, OFA, SmoothQuant
  • Contact:
    • Students can ask all course-related questions on Discord.
    • For external inquiries, personal matters, or emergencies, you can email us at [email protected].
    • If you are interested in getting updates, please sign up here to join our mailing list to get notified!

Read More