Syllabus for CS4787

Principles of Large-Scale Machine Learning — Spring 2019

Term	Spring 2019	Instructor	Christopher De Sa
Course website	www.cs.cornell.edu/courses/cs4787/2019sp/	E-mail	cdesa@cs.cornell.edu
Schedule	MW 7:30-8:45PM	Office hours	Wednesdays 2PM
Room	Hollister Hall B14	Office	Bill and Melinda Gates Hall 450

[Piazza] [CMS] [Gradescope]

Description: CS4787 will explore the principles behind scalable machine learning systems. The course will cover the algorithmic and the implementation principles that power the current generation of machine learning on big data. We will cover training and inference for both traditional ML algorithms such as linear and logistic regression, as well as deep models. Topics will include: estimating statistics of data quickly with subsampling, stochastic gradient descent and other scalable optimization methods, mini-batch training, accelerated methods, adaptive learning rates, methods for scalable deep learning, hyperparameter optimization, parallel and distributed training, and quantization and model compression.

Prerequisites: CS4780 or equivalent, CS 2110 or equivalent

Format: Lectures during the scheduled lecture period will cover the course content. Problem sets will be used to encourage familiarity with the content and develop competence with the more mathematical aspects of the course. Programming assignments will help build intuition and familiarity with how machine learning algorithms run. There will be one midterm exam and one final exam.

Material: The course is based on books, papers, and other texts in machine learning, scalable optimization, and systems. Texts will be provided ahead of time on the website on a per-lecture basis. You aren't expected to necessarily read the texts, but they will provide useful background for the material we are discussing.

Grading: Students will be evaluated on the following basis.

15%	Problem sets
40%	Programming assignments
15%	Midterm Exam
30%	Final Exam

Resources: Download the course VM: https://cornell.app.box.com/s/r32b1mnw4sl4k5kdk64ctp9phhpqnqqh

TA Office Hours

Course calendar may be subject to change.

Course Calendar

Wednesday, January 23	Lecture 1. Introduction and course overview. [Notes]
Monday, January 28	Lecture 2. Estimating large sums with samples, e.g. the empirical risk. Concentration inequalities. [Notes] [Demo] Background reading material: On empirical risk minimization and convergence: Chapters 2 and 4 of Understanding Machine Learning: From Theory to Algorithms. On some examples of ERM: Chapter 2 of Optimization Methods for Large-Scale Machine Learning. On concentration inequalities: Chapter 1 of Concentration Inequalities.
Wednesday, January 30	Lecture 3. Exponential Concentration Inequalities and Emprical Risk Minimization. [Notes] Background reading material: On empirical risk minimization and convergence: Chapters 4 and 6 of Understanding Machine Learning: From Theory to Algorithms. On concentration inequalities: Chapter 1 of Concentration Inequalities.
Monday, February 4	Lecture 4. Learning with gradient descent, convex optimization and conditioning. [Notes] Background reading material: On convexity and gradient descent: Chapters 12 and 14.1 of Understanding Machine Learning: From Theory to Algorithms. On optimization methods: Chapter 3 of Optimization Methods for Large-Scale Machine Learning. More on optimization: Chapter 3.2 of Convex Optimization: Algorithms and Complexity.
Wednesday, February 6	Lecture 5. Stochastic gradient descent. [Notes] [Demo Jupyter] [Demo HTML] Background reading material: On stochastic gradient descent: Chapter 14.3 of Understanding Machine Learning: From Theory to Algorithms. On SGD: Chapters 3 and 4 of Optimization Methods for Large-Scale Machine Learning. More on SGD: Chapter 6 of Convex Optimization: Algorithms and Complexity.
Monday, February 11	Lecture 6. Minibatching and the effect of the learning rate. Our first hyperparameters. [Notes] [Demo Jupyter] [Demo HTML] Background reading material: On the convergence of SGD: Chapter 4 of Optimization Methods for Large-Scale Machine Learning. On stochastic gradient descent: Chapters 14.3 and 14.5 of Understanding Machine Learning: From Theory to Algorithms. More on SGD: Chapter 6 of Convex Optimization: Algorithms and Complexity.
Wednesday, February 13	Lecture 7. Accelerating SGD with momentum. [Notes] [Demo Jupyter] [Demo HTML] Background reading material: On momentum: Chapter 7 of Optimization Methods for Large-Scale Machine Learning. More on Nesterov's method: Chapter 3.7 of Convex Optimization: Algorithms and Complexity. Even more on Nesterov's method, and great proofs: Chapter 2.2 of Introductory Lectures on Convex Programming by Yuri Nesterov.
Monday, February 18	Lecture 8. Accelerating SGD with preconditioning and adaptive learning rates. [Notes] Background reading material: On diagonal preconditioning and adaptive learning rates: Chapter 6.5 of Optimization Methods for Large-Scale Machine Learning. On adaptive learning rates for deep learning: Chapter 8.5 of The Deep Learning textbook. On AdaGrad: the original paper Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. On ADAM: the original paper Adam: A method for stochastic optimization.
Wednesday, February 20	Lecture 9. Accelerating SGD with variance reduction and averaging. [Notes] Background reading material: On noise reduction methods: Chapter 5 of Optimization Methods for Large-Scale Machine Learning. A bit more on SVRG: Chapter 6.3 of Convex Optimization: Algorithms and Complexity. On SVRG: the original paper Accelerating stochastic gradient descent using predictive variance reduction. On Polyak averaging: the original paper Acceleration of Stochastic Approximation by Averaging.
Monday, February 25	No lecture. February break.
Wednesday, February 27	Lecture 10. Dimensionality reduction and sparsity. [Notes] Background reading material: For a bird's-eye-view of random projection and PCA, these Wikipedia articles are actually pretty good: Johnson–Lindenstrauss lemma and Principal component analysis. For a more detailed intro to these topics see CS4786. On autoencoders: Chapter 14 of The Deep Learning textbook.
Monday, March 4	Lecture 11. Deep neural networks. Matrix multiply as computational core of learning. [Notes] Background reading material: On neural networks: Chapter 20 of Understanding Machine Learning: From Theory to Algorithms. More on neural networks: Chapter 6 of The Deep Learning textbook.
Wednesday, March 6	Lecture 12. Automatic differentiation and ML frameworks. [Notes] Background reading material: On backpropagation: Chapter 20.6 of Understanding Machine Learning: From Theory to Algorithms. More on backpropagation: Chapter 6.5 of The Deep Learning textbook. Google's introduction to TensorFlow: https://www.tensorflow.org/tutorials. Similar tutorials for PyTorch: https://pytorch.org/tutorials/. Similar tutorials for MxNet: https://mxnet.apache.org/versions/master/tutorials/.
Monday, March 11	Lecture 13. Accelerating DNN training: early stopping and batch normalization. [Notes] [Demo Jupyter] [Demo PDF] Background reading material: On early stopping: Chapter 7.8 of The Deep Learning textbook. On batch normalization: Chapter 20.6 of Understanding Machine Learning: From Theory to Algorithms. More on batch normalization: Chapter 8.7.1 of The Deep Learning textbook.
Wednesday, March 13	In-class midterm. You are allowed to use one "cheat sheet" page of notes. Covers materials from Lectures 1—12. Some practice problems. And their solutions. Note that these practice problems are not comprehensive; they cover only the last bit of the content that was not covered by any homework. They are also generally more difficult than the problems on the exam. Also, nothing on the exam is "based on" these specific problems or will depend on you knowing them or being familiar with them. In comparison, I will expect you to be familiar with content from the homework and the programming assignments.
Monday, March 18	Lecture 14. Hyperparameter optimization. Grid search. Random search. [Notes] Background reading material: On hyperparameter optimization: Chapter 11.4 of The Deep Learning textbook. On random search vs. grid search: Random Search for Hyper-Parameter Optimization (JMLR 2012).
Wednesday, March 20	Lecture 15. Kernels and kernel feature extraction. [Notes] Background reading material: On kernels: Chapter 16 of Understanding Machine Learning: From Theory to Algorithms. On random feature extraction with Fourier features: Random Features for Large-Scale Kernel Machines (NIPS 2007).
Monday, March 25	Lecture 16. Bayesian optimization 1. [Notes] Background reading material: On model-based hyperparameter optimization: Chapter 11.4.5 of The Deep Learning textbook. On Gaussian processes: Chapters 1-2 of Gaussian Processes for Machine Learning. On Bayesian optimization: Practical Bayesian Optimization of Machine Learning Algorithms by Snoek, Larochelle and Adams (NIPS 2012). It's a nice paper on the practical side of using Bayesian optimization for hyperparameter optimization, and it's short!
Wednesday, March 27	Lecture 17. Bayesian optimization 2. [Notes] Background reading material: same as Bayesian optimization 1.
Monday, April 1	No lecture. Spring break.
Wednesday, April 3	No lecture. Spring break.
Monday, April 8	Lecture 18. Parallelism. [Notes] Background reading material: Good resource on parallel programming, particularly on GPUs: Chapter 1 of Programming Massively Parallel Processors: A Hands-On Approach, Second Edition (by David B. Kirk and Wen-mei W. Hwu). This book is available on the Cornell library. Classical work providing background on parallelism in computer architecture: Chapters 3, 4, and 5 of Computer Architecture: A Quantitative Approach. This book is available on the Cornell library.
Wednesday, April 10	Lecture 19. Parallelism 2. [Notes] Background reading material: same as Parallelism 1.
Monday, April 15	Lecture 20. Memory locality and memory bandwidth. [Notes] Background reading material: same as Parallelism 1.
Wednesday, April 17	Lecture 21. Machine learning on GPUs; matrix multiply returns. [Notes] Background reading material: Parallel programming on GPUs: Chapters 2-5 of Programming Massively Parallel Processors: A Hands-On Approach, Second Edition (by David B. Kirk and Wen-mei W. Hwu). This book is available on the Cornell library.
Monday, April 22	Lecture 22. Distributed learning and the parameter server. [Notes] Background reading material: Deep learning on the parameter server: Large Scale Distributed Deep Networks by Jeff Dean et al, NIPS 2012. More on parameter servers: Scaling Distributed Machine Learning with the Parameter Server by Li et al, OSDI 2014. Overview of how TensorFlow does distributed training. Some interesting recent work of distributing across the layers of a neural network: GPipe.
Wednesday, April 24	Lecture 23. Quantized, low-precision machine learning. [Notes] Background reading material: An example of a blog post illustrating the use of low-precision arithmetic for deep learning.
Monday, April 29	Lecture 24. Deployment and low-latency inference. Deep neural network compression and pruning. [Notes] Background reading material: An interesting blog post comparing GPUs and CPUs for inference. A very popular paper on model compression via pruning Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. ICLR, 2016.
Wednesday, May 1	Lecture 25. Machine learning accelerators. [Notes] Background reading material: The original TPU paper In-datacenter performance analysis of a tensor processing unit ISCA, 2017.
Monday, May 6	Lecture 26. Online Learning, Realtime Learning, and Course Summary. [Notes] Background reading material: On online learning: Chapter 21 of Understanding Machine Learning: From Theory to Algorithms.
Tuesday, May 14, 9:00 AM	Final Exam.