Syllabus for CS6787

Advanced Machine Learning Systems — Spring 2024

TermSpring 2024InstructorChristopher De Sa
RoomPhillips Hall 101E-mail[email hidden]
ScheduleMW 7:30pm – 8:45pmOffice hoursW 2:30pm – 3:30pm
ForumEd DiscussionOfficeGates 426

So you've taken a machine learning class. You know the models people use to solve their problems. You know the algorithms they use for learning. You know how to evaluate the quality of their solutions.

But when we look at a large-scale machine learning application that is deployed in practice, it's not always exactly what you learned in class. Sure, the basic models, the basic algorithms are all there. But they're modified a bit, in a bunch of different ways, to run faster and more efficiently. And these modifications are really important—they often are what make the system tractable to run on the data it needs to process.

CS6787 is a graduate-level introduction to these system-focused aspects of machine learning, covering guiding principles and commonly used techniques for scaling up learning to large data sets. Informally, we will cover the techniques that lie between a standard machine learning course and an efficient systems implementation: both statistical/optimization techniques based on improving the convergence rate of learning algorithms and techniques that improve performance by leveraging the capabilities of the underlying hardware. Topics will include stochastic gradient descent, acceleration, variance reduction, methods for choosing hyperparameters, parallelization within a chip and across a cluster, popular ML frameworks, and innovations in hardware architectures. An open-ended project in which students apply these techniques is a major part of the course.

Prerequisites: Knowledge of machine learning at the level of CS4780. If you are an undergraduate, you should have taken CS4780 or an equivalent course, since it is a prerequisite. Knowledge of computer systems and hardware on the level of CS 3410 is recommended, but this is not a prerequisite.

Format: About half of the classes will involve traditionally formatted lectures. For the other half of the classes, we will read and discuss two seminal papers relevant to the course topic. These classes will involve presentations by groups of students of the paper contents (each student will sign up in a group to present one paper for 15-20 minutes) followed by breakout discussions about the material. Historically, the lectures have occurred on Mondays and the discussions have occurred on Wednesdays, but due to the non-standard timeline this semester, these course elements will be scheduled irregularly (see schedule below).

Grading: Students will be evaluated on the following basis.

20%Paper presentation
10%Discussion participation
20%Paper reviews
10%Programming assignments
40%Final project

Paper review parameters: Paper reviews should be about one page (single-spaced) in length. The review guidelines should mirror what an actual conference review would look like (although you needn't assign scores or anything like that). In particular you should at least: (1) summarize the paper, (2) discuss the paper's strengths and weaknesses, and (3) discuss the paper's impact. For reference, you can read the ICML reviewer guidelines. Of course, your review will not be precisely like a real review, in large part because we already know the impact of these papers. You can submit any review up to two days late with no penalty. Students who presented a paper do not have to submit a review of that paper (although you can if you want).

Final project parameters (subject to change): The final project can be done in groups of up to three (although more work will be expected from groups with more people). The subject of the project is open-ended, but it must include:

The project proposal should satisfy the following constraints: The project will culminate in a project report of at least four pages, not including references. The project report should be formatted similarly to a workshop paper, and should use the ICML 2019 style or a similar style. The project proposal is due on Monday, March 25, 2024. A draft of the final abstract is due for presentation and discussion in class on Monday, April 29, 2024. Per the registrar, the final project report is due on May 15, 2024 at 4:30 PM.


Course Calendar

Monday, January 22
In Person
Jan
21
Jan
22
Jan
23
Jan
24
Jan
25
Jan
26
Jan
27
Lecture #1: Overview.
[Slides] [Demo Notebook]
  • Overview
  • Course outline and syllabus
  • Learning with gradient descent
  • Stochastic gradient descent: the workhorse of machine learning
  • Theory of SGD for convex objectives: our first look at trade-offs
Wednesday, January 24
In Person
Jan
21
Jan
22
Jan
23
Jan
24
Jan
25
Jan
26
Jan
27
Lecture #2: Backpropagation & ML Frameworks.
[Slides] [Demo Notebook]
  • Backpropagation and automatic differentiation
  • Machine learning frameworks I: the user interface
  • Overfitting
  • Generalization error
  • Early stopping
Optional extra reading. Some older papers on SGD and backpropagation!
Monday, January 29
In Person
Jan
28
Jan
29
Jan
30
Jan
31
Feb
1
Feb
2
Feb
3
Lecture #3: Hyperparameters and Tradeoffs.
[Slides] [Demo Notebook]
  • Our first hyperparameters: step size/learning rate, minibatch size
  • Regularization
  • Application-specific forms of regularization
  • The condition number
  • Momentum and acceleration
  • Momentum for quadratic optimization
  • Momentum for convex optimization
Released: Programming Assignment 1.
Wednesday, January 31
In Person
Jan
28
Jan
29
Jan
30
Jan
31
Feb
1
Feb
2
Feb
3
Paper Discussion 1a.
Attention is all you need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin.
In Advances in neural information processing systems (NeurIPS), 2017.

Paper Discussion 1b.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
Sergey Ioffe, Christian Szegedy.
Proceedings of the International Conference on Machine Learning (ICML), 2015.
Monday, February 5
In Person
Feb
4
Feb
5
Feb
6
Feb
7
Feb
8
Feb
9
Feb
10
Lecture #4: Kernels and Dimensionality Reduction.
[Slides] [Demo Notebook]
  • The kernel trick
  • Gram matrix versus feature extraction: systems tradeoffs
  • Adaptive/data-dependent feature mappings
  • Dimensionality reduction
Wednesday, February 7
In Person
Feb
4
Feb
5
Feb
6
Feb
7
Feb
8
Feb
9
Feb
10
Paper Discussion 2a.
Palm: Scaling language modeling with pathways.
Aakanksha Chowdhery et al.
Journal of Machine Learning Research (JMLR), 2023.

Paper Discussion 2b.
Language models are few-shot learners.
Tom Brown et al.
In Advances in neural information processing systems (NeurIPS), 2020.

Due: Review of paper 1a or 1b.
Monday, February 12
In Person
Feb
11
Feb
12
Feb
13
Feb
14
Feb
15
Feb
16
Feb
17
Lecture #5: Adaptive Methods & Non-Convex Optimization.
[Slides] [Demo Notebook]
  • Adaptive methods
  • AdaGrad
  • Adam
  • Non-convex optimization
Due: Programming Assignment 1.
Wednesday, February 14
In Person
Feb
11
Feb
12
Feb
13
Feb
14
Feb
15
Feb
16
Feb
17
Paper Discussion 3a.
Random features for large-scale kernel machines.
Ali Rahimi and Benjamin Recht.
In Advances in Neural Information Processing Systems (NeurIPS), 2007.

Paper Discussion 3b.
Feature Hashing for Large Scale Multitask Learning.
Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford and Alex Smola.
Proceedings of the International Conference on Machine Learning (ICML), 2009.

Released: Programming Assignment 2.
Monday, February 19
Online Only
Feb
18
Feb
19
Feb
20
Feb
21
Feb
22
Feb
23
Feb
24
Lecture #6: Hyperparameter Optimization.
[Slides] [Demo Notebook]
  • Hyperparameter optimization
  • Assigning parameters from folklore
  • Random search over parameters
Wednesday, February 21
In Person
Feb
18
Feb
19
Feb
20
Feb
21
Feb
22
Feb
23
Feb
24
Paper Discussion 4a.
Random shuffling beats sgd after finite epochs.
Jeff Haochen and Suvrit Sra.
Proceedings of the International Conference on Machine Learning (ICML), 2019.

Paper Discussion 4b.
Adam: A method for stochastic optimization.
Diederik Kingma and Jimmy Ba.
Proceedings of the International Conference on Learning Representations (ICLR), 2015.

Due: Review of paper 3a or 3b.
Monday, February 26February Break: No classes.
Wednesday, February 28
In Person
Feb
25
Feb
26
Feb
27
Feb
28
Feb
29
Mar
1
Mar
2
Paper Discussion 5a.
Random search for hyper-parameter optimization.
James Bergstra and Yoshua Bengio.
Journal of Machine Learning Research (JMLR), 2012.

Paper Discussion 5b.
Scaling laws for neural language models.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.
arXiv preprint arXiv:2001.08361, 2020.
Monday, March 4
In Person
Mar
3
Mar
4
Mar
5
Mar
6
Mar
7
Mar
8
Mar
9
Lecture #7: Parallelism.
[Slides] [Demo Notebook]
  • Hardware trends that lead to parallelism
  • Sources of parallelism in hardware
  • Data parallelism
  • Extracting parallelism at different places in the computation
  • Simple parallelism on multicore
Due: Programming Assignment 2.
Wednesday, March 6
In Person
Mar
3
Mar
4
Mar
5
Mar
6
Mar
7
Mar
8
Mar
9
Paper Discussion 6a.
Map-reduce for machine learning on multicore.
Cheng-Tao Chu, Sang K Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng, and Kunle Olukotun
In Advances in Neural Information Processing Systems (NeurIPS), 2007.

Paper Discussion 6b.
Hogwild: A lock-free approach to parallelizing stochastic gradient descent.
Feng Niu, Benjamin Recht, Christopher Re, and Stephen Wright.
In Advances in Neural Information Processing Systems (NeurIPS), 2011.
Monday, March 11
In Person
Mar
10
Mar
11
Mar
12
Mar
13
Mar
14
Mar
15
Mar
16
Lecture #8: Distributed Learning.
[Slides]
  • Learning on multiple machines
  • SGD with all-reduce
  • The parameter server
  • Asynchronous parallelism on multiple machines
  • Decentralized and local SGD
  • Model and pipeline parallelism

Due: Review of paper 5a or 5b.
Wednesday, March 13
In Person
Mar
10
Mar
11
Mar
12
Mar
13
Mar
14
Mar
15
Mar
16
Paper Discussion 7a.
Flashattention: Fast and memory-efficient exact attention with io-awareness.
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré.
In Advances in Neural Information Processing Systems (NeurIPS), 2022.

Paper Discussion 7b.
A System for Massively Parallel Hyperparameter Tuning.
Liam Li et al.
Proceedings of the 2nd Conference on Machine Learning and Systems (MLSys), 2020.
Monday, March 18
In Person
Mar
17
Mar
18
Mar
19
Mar
20
Mar
21
Mar
22
Mar
23
Lecture #9: Low-Precision Arithmetic.
[Slides]
  • Memory
  • Low-precision formats
  • Floating-point machine epsilon
  • Low-precision training
  • Scan order

Due: Review of paper 6a or 6b.

In-class project feedback activity.
Wednesday, March 20
In Person
Mar
17
Mar
18
Mar
19
Mar
20
Mar
21
Mar
22
Mar
23
Paper Discussion 8a.
Large scale distributed deep networks.
Jeff Dean et al.
In Advances in Neural Information Processing Systems (NeurIPS), 2012.

Paper Discussion 8b.
Towards federated learning at scale: System design.
Keith Bonawitz et al.
In Proceedings of the 2nd MLSys Conference (MLSys), 2019.
Monday, March 25
In Person
Mar
24
Mar
25
Mar
26
Mar
27
Mar
28
Mar
29
Mar
30
Lecture #10: Inference and Compression.
[Demo Notebook]
  • Efficient inference
  • Metrics we care about when inferring
  • Compression
  • Fine-tuning
  • Hardware for inference

Due: Review of paper 7a or 7b.

Due: Final project proposals.
Wednesday, March 27
In Person
Mar
24
Mar
25
Mar
26
Mar
27
Mar
28
Mar
29
Mar
30
Paper Discussion 9a.
Gpipe: Efficient training of giant neural networks using pipeline parallelism.
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Yonghui Wu.
In Advances in Neural Information Processing Systems (NeurIPS), 2019.

Paper Discussion 9b.
Efficiently scaling transformer inference.
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean.
In Proceedings of Machine Learning and Systems (MLSys), 2023.
Monday, April 1Spring Break: No classes.
Wednesday, April 3Spring Break: No classes.
Monday, April 8
In Person
Apr
7
Apr
8
Apr
9
Apr
10
Apr
11
Apr
12
Apr
13
Lecture #11: Machine Learning Frameworks II.
  • Large scale numerical linear algebra
  • Eager vs lazy
  • ML frameworks in Python

Due: Review of paper 8a or 8b.
Wednesday, April 10
In Person
Apr
7
Apr
8
Apr
9
Apr
10
Apr
11
Apr
12
Apr
13
Paper Discussion 10a.
Deep learning with limited numerical precision.
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan.
Proceedings of the International Conference on Machine Learning (ICML), 2015.

Paper Discussion 10b.
LoRA: Low-Rank Adaptation of Large Language Models.
Edward J. Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
Proceedings of the International Conference on Learning Representations (ICLR), 2021.
Monday, April 15
In Person
Apr
14
Apr
15
Apr
16
Apr
17
Apr
18
Apr
19
Apr
20
Lecture #12: Hardware for Machine Learning.
  • CPUs vs GPUs
  • What makes for good ML hardware?
  • How can hardware help with ML?
  • What does modern ML hardware look like?

Due: Review of paper 9a or 9b.
Wednesday, April 17
In Person
Apr
14
Apr
15
Apr
16
Apr
17
Apr
18
Apr
19
Apr
20
Paper Discussion 11a.
Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding.
Song Han, Huizi Mao, and William J Dally.
Proceedings of the International Conference on Learning Representations (ICLR), 2016.

Paper Discussion 11b.
GPTQ: Accurate post-training quantization for generative pre-trained transformers.
Frantar, Elias, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh.
Proceedings of the International Conference on Learning Representations (ICLR), 2023.
Monday, April 22
In Person
Apr
21
Apr
22
Apr
23
Apr
24
Apr
25
Apr
26
Apr
27
Lecture #13: Modern Generative AI.
  • Scaling for large language models
  • Challenges for LLM inference
  • What does the future of generative AI look like?
  • What are the policy and social implications of this technology?

Due: Review of paper 10a or 10b.
Wednesday, April 24
Online Only
Apr
21
Apr
22
Apr
23
Apr
24
Apr
25
Apr
26
Apr
27
Paper Discussion 12a.
In-datacenter performance analysis of a tensor processing unit.
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al.
In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017.

Paper Discussion 12b.
A Configurable Cloud-Scale DNN Processor for Real-Time AI.
Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengills, et al.
In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA), 2018.
Monday, April 29
In Person
Apr
28
Apr
29
Apr
30
May
1
May
2
May
3
May
4
Lecture #14: Large Scale ML on the Cloud.
[Slides]
  • Challenges of deployment
  • Distributed learning at datacenter scale

Due: Review of paper 11a or 11b.

Due: Final project abstract draft. Can be submitted late until Wednesday afternooon; will discuss in class on Wednesday.
Wednesday, May 1
In Person
Apr
28
Apr
29
Apr
30
May
1
May
2
May
3
May
4
Lecture #15: Final Project Disussion.
Monday, May 6
In Person
May
5
May
6
May
7
May
8
May
9
May
10
May
11
Lecture #16: Final Project Disussion.