Syllabus for CS4787

Principles of Large-Scale Machine Learning — Spring 2021

Term	Spring 2021	Instructor	Christopher De Sa
Course website	www.cs.cornell.edu/courses/cs4787/2021sp/	E-mail	cdesa@cs.cornell.edu
Schedule	MW 7:30-8:45PM	Office hours	Wednesdays 2PM
Room	Zoom	Office	Zoom

[Canvas] [Discussion] [CMS]

Description: CS4787 explores the principles behind scalable machine learning systems. The course will cover the algorithmic and the implementation principles that power the current generation of machine learning on big data. We will cover training and inference for both traditional ML algorithms such as linear and logistic regression, as well as deep models. Topics will include: estimating statistics of data quickly with subsampling, stochastic gradient descent and other scalable optimization methods, mini-batch training, accelerated methods, adaptive learning rates, methods for scalable deep learning, hyperparameter optimization, parallel and distributed training, and quantization and model compression.

Prerequisites: CS4780 or equivalent, CS 2110 or equivalent

Format: Lectures during the scheduled lecture period will cover the course content. Problem sets will be used to encourage familiarity with the content and develop competence with the more mathematical aspects of the course. Programming assignments will help build intuition and familiarity with how machine learning algorithms run. There will be one midterm exam and one final exam, each of which will test both theoretical knowledge and programmming implementation of concepts.

Material: The course is based on books, papers, and other texts in machine learning, scalable optimization, and systems. Texts will be provided ahead of time on the website on a per-lecture basis. You aren't expected to necessarily read the texts, but they will provide useful background for the material we are discussing.

Grading: Students will be evaluated on the following basis.

20%	Problem sets
40%	Programming assignments
15%	Prelim Exam
25%	Final Exam

Inclusiveness: You should expect and demand to be treated by your classmates and the course staff with respect. You belong here, and we are here to help you learn—and enjoy—this course. If any incident occurs that challenges this commitment to a supportive and inclusive environment, please let the instructor know so that we can address the issue. We are personally committed to this, and subscribe to the Computer Science Department's Values of Inclusion.

TA Office Hours

Direct link

Course calendar may be subject to change.

Course Calendar

Monday, February 8 Feb 7Feb 8Feb 9Feb 10Feb 11Feb 12Feb 13	Lecture 1. Introduction and course overview. [Notes] Problem Set 1 Released.
Wednesday, February 10 Feb 7Feb 8Feb 9Feb 10Feb 11Feb 12Feb 13	Lecture 2. Linear algebra done efficiently: Mapping mathematics to numpy. [Slides Notebook] [Slides HTML]
Monday, February 15 Feb 14Feb 15Feb 16Feb 17Feb 18Feb 19Feb 20	Lecture 3. Scaling to complex models by learning with optimization algorithms. Gradient descent, convex optimization and conditioning. [Notes] [Slides Notebook] [Slides HTML] [Demo Notebook] [Demo HTML] Programming Assignment 1 Released. Background reading material: On convexity and gradient descent: Chapters 12 and 14.1 of Understanding Machine Learning: From Theory to Algorithms. On optimization methods: Chapter 3 of Optimization Methods for Large-Scale Machine Learning. More on optimization: Chapter 3.2 of Convex Optimization: Algorithms and Complexity.
Wednesday, February 17 Feb 14Feb 15Feb 16Feb 17Feb 18Feb 19Feb 20	Lecture 4. Gradient descent continued. Stochastic gradient descent. [Notes] [Slides Notebook] [Slides HTML] [Demo Notebook] [Demo HTML] Background reading material: On stochastic gradient descent: Chapter 14.3 of Understanding Machine Learning: From Theory to Algorithms. On SGD: Chapters 3 and 4 of Optimization Methods for Large-Scale Machine Learning. More on SGD: Chapter 6 of Convex Optimization: Algorithms and Complexity.
Monday, February 22 Feb 21Feb 22Feb 23Feb 24Feb 25Feb 26Feb 27	Lecture 5. Stochastic gradient descent continued. Scaling to huge datasets with subsampling. [Notes] [Slides Notebook] [Slides HTML] [Demo Notebook] [Demo HTML] Problem Set 1 Due. Background reading material: On stochastic gradient descent: Chapter 14.3 of Understanding Machine Learning: From Theory to Algorithms. On SGD: Chapters 3 and 4 of Optimization Methods for Large-Scale Machine Learning. More on SGD: Chapter 6 of Convex Optimization: Algorithms and Complexity.
Wednesday, February 24 Feb 21Feb 22Feb 23Feb 24Feb 25Feb 26Feb 27	Lecture 6. Adapting algorithms to hardware. Minibatching and the effect of the learning rate. Our first hyperparameters. [Notes] [Slides Notebook] [Slides HTML] [Demo Notebook] [Demo HTML] Problem Set 2 Released. Note that this is a half-length problem set designed to be done in 1 week rather than 2, so that it can be finished before the prelim exam. Background reading material: On the convergence of SGD: Chapter 4 of Optimization Methods for Large-Scale Machine Learning. On stochastic gradient descent: Chapters 14.3 and 14.5 of Understanding Machine Learning: From Theory to Algorithms. More on SGD: Chapter 6 of Convex Optimization: Algorithms and Complexity.
Monday, March 1 Feb 28Mar 1Mar 2Mar 3Mar 4Mar 5Mar 6	Lecture 7. The mathematical hammers behind subsampling. Estimating large sums with samples, e.g. the empirical risk. Concentration inequalities. [Notes] [Slides Notebook] [Slides HTML] [Demo Notebook] [Demo HTML] Programming Assignment 1 Due. Background reading material: On empirical risk minimization and convergence: Chapters 2 and 4 of Understanding Machine Learning: From Theory to Algorithms. On some examples of ERM: Chapter 2 of Optimization Methods for Large-Scale Machine Learning. On concentration inequalities: Chapter 1 of Concentration Inequalities.
Wednesday, March 3 Feb 28Mar 1Mar 2Mar 3Mar 4Mar 5Mar 6	Lecture 8. Optimization techniques for efficient ML. Accelerating SGD with momentum. [Notes] [Slides Notebook] [Slides HTML] [Demo Notebook] [Demo HTML] Problem Set 2 Due.In order for late days to not conflict with the prelim, this problem set can be submitted late until Monday, March 8 with no penalty. Background reading material: On momentum: Chapter 7 of Optimization Methods for Large-Scale Machine Learning. More on Nesterov's method: Chapter 3.7 of Convex Optimization: Algorithms and Complexity. Even more on Nesterov's method, and great proofs: Chapter 2.2 of Introductory Lectures on Convex Programming by Yuri Nesterov.
Thursday, March 4 Feb 28Mar 1Mar 2Mar 3Mar 4Mar 5Mar 6	Prelim Exam. 8:30PM. Exam released on Gradescope and on Canvas. The exam may cover topics up to Lecture 8, including scalability in ML, gradient descent, stochastic gradient descent, convexity and strong convexity, the computational cost of learning algorithms, concentration inequalities, momentum, and writing learning algorithms in numpy.
Monday, March 8 Mar 7Mar 8Mar 9Mar 10Mar 11Mar 12Mar 13	Lecture 9. Optimization techniques for efficient ML, continued. Accelerating SGD with preconditioning and adaptive learning rates. [Notes] [Slides Notebook] [Slides HTML] Programming Assignment 2 Released. Background reading material: On diagonal preconditioning and adaptive learning rates: Chapter 6.5 of Optimization Methods for Large-Scale Machine Learning. On adaptive learning rates for deep learning: Chapter 8.5 of The Deep Learning textbook. On AdaGrad: the original paper Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. On ADAM: the original paper Adam: A method for stochastic optimization.
Wednesday, March 10 Mar 7Mar 8Mar 9Mar 10Mar 11Mar 12Mar 13	Wellness Day. No Classes. No lecture.
Monday, March 15 Mar 14Mar 15Mar 16Mar 17Mar 18Mar 19Mar 20	Lecture 10. Optimization techniques for efficient ML, continued. Accelerating SGD with variance reduction and averaging. [Notes] [Slides Notebook] [Slides HTML] Problem Set 3 Released. Background reading material: On noise reduction methods: Chapter 5 of Optimization Methods for Large-Scale Machine Learning. A bit more on SVRG: Chapter 6.3 of Convex Optimization: Algorithms and Complexity. On SVRG: the original paper Accelerating stochastic gradient descent using predictive variance reduction. On Polyak averaging: the original paper Acceleration of Stochastic Approximation by Averaging.
Wednesday, March 17 Mar 14Mar 15Mar 16Mar 17Mar 18Mar 19Mar 20	Lecture 11. Dimensionality reduction and sparsity. [Notes] [Slides Notebook] [Slides HTML] [Demo Notebook] [Demo HTML] Background reading material: For a bird's-eye-view of random projection and PCA, these Wikipedia articles are actually pretty good: Johnson–Lindenstrauss lemma and Principal component analysis. For a more detailed intro to these topics see CS4786. On autoencoders: Chapter 14 of The Deep Learning textbook.
Monday, March 22 Mar 21Mar 22Mar 23Mar 24Mar 25Mar 26Mar 27	Lecture 12. Deep neural networks. Matrix multiply as computational core of learning. [Notes] [Demo Notebook] [Demo HTML] Programming Assignment 2 Due. Programming Assignment 3 Released. Background reading material: On neural networks: Chapter 20 of Understanding Machine Learning: From Theory to Algorithms. More on neural networks: Chapter 6 of The Deep Learning textbook.
Wednesday, March 24 Mar 21Mar 22Mar 23Mar 24Mar 25Mar 26Mar 27	Lecture 13. Automatic differentiation and ML frameworks. [Notes] [Demo Notebook] [Demo HTML] Background reading material: On backpropagation: Chapter 20.6 of Understanding Machine Learning: From Theory to Algorithms. More on backpropagation: Chapter 6.5 of The Deep Learning textbook. Google's introduction to TensorFlow: https://www.tensorflow.org/tutorials. Similar tutorials for PyTorch: https://pytorch.org/tutorials/. Similar tutorials for MxNet: https://mxnet.apache.org/versions/master/tutorials/.
Monday, March 29 Mar 28Mar 29Mar 30Mar 31Apr 1Apr 2Apr 3	Lecture 14. Accelerating DNN training: early stopping and batch normalization. [Notes] [Demo Notebook] [Demo HTML] Problem Set 3 Due. Problem Set 4 Released. Background reading material: On early stopping: Chapter 7.8 of The Deep Learning textbook. On batch normalization: Chapter 20.6 of Understanding Machine Learning: From Theory to Algorithms. More on batch normalization: Chapter 8.7.1 of The Deep Learning textbook.
Wednesday, March 31 Mar 28Mar 29Mar 30Mar 31Apr 1Apr 2Apr 3	Lecture 15. Hyperparameter optimization. Grid search. Random search. [Notes] [Slides Notebook] [Slides HTML] Background reading material: On hyperparameter optimization: Chapter 11.4 of The Deep Learning textbook. On random search vs. grid search: Random Search for Hyper-Parameter Optimization (JMLR 2012).
Monday, April 5 Apr 4Apr 5Apr 6Apr 7Apr 8Apr 9Apr 10	Lecture 16. Kernels and kernel feature extraction. [Notes] [Slides Notebook] [Slides HTML] Programming Assignment 3 Due. Programming Assignment 4 Released. Background reading material: On kernels: Chapter 16 of Understanding Machine Learning: From Theory to Algorithms. On random feature extraction with Fourier features: Random Features for Large-Scale Kernel Machines (NIPS 2007).
Wednesday, April 7 Apr 4Apr 5Apr 6Apr 7Apr 8Apr 9Apr 10	Lecture 17. Bayesian optimization 1. [Notes] Background reading material: On model-based hyperparameter optimization: Chapter 11.4.5 of The Deep Learning textbook. On Gaussian processes: Chapters 1-2 of Gaussian Processes for Machine Learning. On Bayesian optimization: Practical Bayesian Optimization of Machine Learning Algorithms by Snoek, Larochelle and Adams (NIPS 2012). It's a nice paper on the practical side of using Bayesian optimization for hyperparameter optimization, and it's short!
Monday, April 12 Apr 11Apr 12Apr 13Apr 14Apr 15Apr 16Apr 17	Lecture 18. Bayesian optimization 2. [Notes] Problem Set 4 Due. Problem Set 5 Released. Background reading material: same as Bayesian optimization 1.
Wednesday, April 14 Apr 11Apr 12Apr 13Apr 14Apr 15Apr 16Apr 17	Lecture 19. Parallelism. [Notes] [Slides Notebook] [Slides HTML] [Demo Notebook] [Demo HTML] Background reading material: Good resource on parallel programming, particularly on GPUs: Chapter 1 of Programming Massively Parallel Processors: A Hands-On Approach, Second Edition (by David B. Kirk and Wen-mei W. Hwu). This book is available on the Cornell library. Classical work providing background on parallelism in computer architecture: Chapters 3, 4, and 5 of Computer Architecture: A Quantitative Approach. This book is available on the Cornell library.
Monday, April 19 Apr 18Apr 19Apr 20Apr 21Apr 22Apr 23Apr 24	Lecture 20. Memory locality and memory bandwidth. [Notes] [Slides Notebook] [Slides HTML] Programming Assignment 4 Due. Programming Assignment 5 Released. Background reading material: same as Parallelism 1.
Wednesday, April 21 Apr 18Apr 19Apr 20Apr 21Apr 22Apr 23Apr 24	Lecture 21. Machine learning on GPUs; matrix multiply returns. [Notes] [Slides Notebook] [Slides HTML] Background reading material: Parallel programming on GPUs: Chapters 2-5 of Programming Massively Parallel Processors: A Hands-On Approach, Second Edition (by David B. Kirk and Wen-mei W. Hwu). This book is available on the Cornell library.
Monday, April 26 Apr 25Apr 26Apr 27Apr 28Apr 29Apr 30May 1	Wellness Day. No Classes. No lecture.
Wednesday, April 28 Apr 25Apr 26Apr 27Apr 28Apr 29Apr 30May 1	Lecture 22. Quantized, low-precision machine learning. [Notes] [Slides] [Demo Notebook] [Demo HTML] Problem Set 5 Due. Problem Set 6 Released. Background reading material: An example of a blog post illustrating the use of low-precision arithmetic for deep learning.
Monday, May 3 May 2May 3May 4May 5May 6May 7May 8	Lecture 23. Distributed learning and the parameter server. [Notes] [Slides] Programming Assignment 5 Due. Programming Assignment 6 Released. Background reading material: Deep learning on the parameter server: Large Scale Distributed Deep Networks by Jeff Dean et al, NIPS 2012. More on parameter servers: Scaling Distributed Machine Learning with the Parameter Server by Li et al, OSDI 2014. Overview of how TensorFlow does distributed training. Some interesting recent work of distributing across the layers of a neural network: GPipe.
Wednesday, May 5 May 2May 3May 4May 5May 6May 7May 8	Lecture 24. Deployment and low-latency inference. Deep neural network compression and pruning. [Notes] [Slides] [Demo Notebook] [Demo HTML] Background reading material: An interesting blog post comparing GPUs and CPUs for inference. A very popular paper on model compression via pruning Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. ICLR, 2016.
Monday, May 10 May 9May 10May 11May 12May 13May 14May 15	Lecture 25. Online Learning and Realtime Learning. [Notes] [Slides Notebook] [Slides HTML] Background reading material: The original TPU paper In-datacenter performance analysis of a tensor processing unit ISCA, 2017.
Wednesday, May 12 May 9May 10May 11May 12May 13May 14May 15	Lecture 26. Machine learning accelerators, and Course Summary. [Notes] [Slides Notebook] [Slides HTML] Problem Set 6 Due. Background reading material: On online learning: Chapter 21 of Understanding Machine Learning: From Theory to Algorithms.
Friday, May 14 May 9May 10May 11May 12May 13May 14May 15	(No lecture.) Programming Assignment 6 Due.
Tuesday, May 18 May 16May 17May 18May 19May 20May 21May 22	Final Exam. 9:30AM. Exam released on Gradescope and on Canvas. The exam may include any topics covered in the course.