Syllabus for CS6787

Advanced Machine Learning Systems — Fall 2020

Term	Fall 2019	Instructor	Christopher De Sa
Room	Statler Auditorium 185	E-mail	cdesa@cs.cornell.edu
Schedule	MW 7:30pm – 8:45pm	Office hours	W 2:00pm – 3:00pm

[Piazza site] [Lecture zoom link] [Office hours zoom link] [Canvas lecture videos link]

Course Modality Info. CS6787 will be offered both hybrid-in-person and online, subject to the following policies.

While most of the lectures will be given in person (with online component via zoom), some of the lectures will be given online-only. The modality of each lecture will be listed on the course calendar. If a class is listed as being given online-only, do not come to the classroom. Note that lectures may be converted from in-person to online-only on an ad hoc basis as developments occur &emdash; you will be sent an email if this happens.
The classroom has a limited number of seats. You will be assigned a seat number sometime before the second lecture (after giving students a chance to add/drop after the first lecture's course overview). If you have not been assigned a seat, please do not come to the classroom (attend remotely instead).
If you are taking the class online, you will generally still be expected to join the zoom call for the lectures synchronously. While lectures will be recorded, in-class discussion is central to the learning goals of CS6787. Note, though, that you will not be graded on attendance, and students enrolled in the in-person class are not required to attend the lectures in person.
For the time being, students on the waitlist will be given full access to lectures and course materials, to enable course shopping as smoothly as possible.
Office hours will be given remotely via zoom only.

So you've taken a machine learning class. You know the models people use to solve their problems. You know the algorithms they use for learning. You know how to evaluate the quality of their solutions.

But when we look at a large-scale machine learning application that is deployed in practice, it's not always exactly what you learned in class. Sure, the basic models, the basic algorithms are all there. But they're modified a bit, in a bunch of different ways, to run faster and more efficiently. And these modifications are really important—they often are what make the system tractable to run on the data it needs to process.

CS6787 is a graduate-level introduction to these system-focused aspects of machine learning, covering guiding principles and commonly used techniques for scaling up learning to large data sets. Informally, we will cover the techniques that lie between a standard machine learning course and an efficient systems implementation: both statistical/optimization techniques based on improving the convergence rate of learning algorithms and techniques that improve performance by leveraging the capabilities of the underlying hardware. Topics will include stochastic gradient descent, acceleration, variance reduction, methods for choosing hyperparameters, parallelization within a chip and across a cluster, popular ML frameworks, and innovations in hardware architectures. An open-ended project in which students apply these techniques is a major part of the course.

Prerequisites: Knowledge of machine learning at the level of CS4780. If you are an undergraduate, you should have taken CS4780, since it is a prerequisite. Optionally, knowledge of computer systems and hardware on the level of CS 3410 would be useful, but this is not a prerequisite.

Format: About half of the classes will involve traditionally formatted lectures. For the other half of the classes, we will read and discuss two seminal papers relevant to the course topic. These classes will involve presentations by groups of students of the paper contents (each student will sign up in a group to present one paper for 15-20 minutes) followed by breakout discussions about the material. Historically, the lectures have occurred on Mondays and the discussions have occurred on Wednesdays, but due to the non-standard timeline this semester, these course elements will be scheduled irregularly (see schedule below).

Grading: Students will be evaluated on the following basis.

20%	Paper presentation
10%	Discussion participation
20%	Paper reviews
10%	Programming assignments
40%	Final project

Paper review parameters: Paper reviews should be about one page (single-spaced) in length. The review guidelines should mirror what an actual conference review would look like (although you needn't assign scores or anything like that). In particular you should at least: (1) summarize the paper, (2) discuss the paper's strengths and weaknesses, and (3) discuss the paper's impact. For reference, you can read the ICML reviewer guidelines. Of course, your review will not be precisely like a real review, in large part because we already know the impact of these papers. You can submit any review up to two days late with no penalty. Students who presented a paper do not have to submit a review of that paper (although you can if you want).

Final project parameters (subject to change): The final project can be done in groups of up to three (although more work will be expected from groups with more people). The subject of the project is open-ended, but it must include:

the implementation of a machine learning system for some task,
exploring one or more of the techniques discussed in the course (or similar techniques subject to instructor approval),
to empirically evaluate the performance and compare it with some baseline method, in two ways:

statistical performance (e.g. iterations to converge to some accuracy threshold), and
hardware performance (e.g. throughput or wall-clock time).

There will be an in-class feedback activity on Monday, October 19, and you should prepare a two-minute pitch of your ideas by then. Project proposals are due on Monday, October 26. The project proposal should satisfy the following constraints:

The main body should be about one page in length.
It should describe the project you intend to do.
It should contain at least one citation of a relevant paper that we did not cover in class (but preferably more).
It should include some preliminary or exploratory work you've already done, that helps to support the idea that your project is feasible (this preliminary work can be very minimal, but should indicate that you've got started—or at least have a clear idea how to do so).
In addition to the one-page text proposal, it should contain one short experiment plan per person, which should consist of:
- a hypothesis
- a proxy statement which describes what metric you are going to use to measure the variables you care about
- a short protocol statement describing what you are going to do
- the results you expect to get
The experiment plan should not be longer than half a page, and may be much shorter.

The project will culminate in a project report of at least four pages, not including references. The project report should be formatted similarly to a workshop paper, and should use the ICML 2019 style or a similar style. An abstract for the report is due on Wednesday, December 9, and we will discuss the abstracts in class on Monday, December 14 (these abstracts may be submitted late until Sunday, December 13 with no penalty). The final project report is due on Wednesday, December 16.

Course Calendar

Course calendar may be subject to change as events unfold.

Wednesday, September 2 Online Only Aug 30Aug 31Sep 1Sep 2Sep 3Sep 4Sep 5	Lecture #1: Overview. [Slides] [Demo Notebook] [Demo HTML] Overview Course outline and syllabus Learning with gradient descent Stochastic gradient descent: the workhorse of machine learning Theory of SGD for convex objectives: our first look at trade-offs
Monday, September 7 Online Only Sep 6Sep 7Sep 8Sep 9Sep 10Sep 11Sep 12	Lecture #2: Backpropagation & ML Frameworks. [Slides] [Demo Notebook] [Demo HTML] Backpropagation and automatic differentiation Machine learning frameworks I: the user interface Overfitting Generalization error Early stopping Optional extra reading. Some older papers on SGD and backpropagation! Hinton, Geoffrey E. Learning distributed representations of concepts. Proceedings of the eighth annual conference of the cognitive science society. Vol. 1. 1986. Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Cognitive modeling 5.3 (1988): 1. Tong Zheng. Solving large scale linear prediction problems using stochastic gradient descent algorithms. Proceedings of the International Conference on Machine Learning (ICML), 2004. Presentation signup. (survey link)
Wednesday, September 9 Online Only Sep 6Sep 7Sep 8Sep 9Sep 10Sep 11Sep 12	Lecture #3: Hyperparameters and Tradeoffs. [Slides] [Demo Notebook] [Demo HTML] Our first hyperparameters: step size/learning rate, minibatch size Regularization Application-specific forms of regularization The condition number Momentum and acceleration Momentum for quadratic optimization Momentum for convex optimization
Monday, September 14 In Person/Online Sep 13Sep 14Sep 15Sep 16Sep 17Sep 18Sep 19	Paper Discussion 1a. On the importance of initialization and momentum in deep learning. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. Proceedings of the International Conference on Machine Learning (ICML), 2013. Paper Discussion 1b. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Sergey Ioffe, Christian Szegedy. Proceedings of the International Conference on Machine Learning (ICML), 2015.
Wednesday, September 16 In Person/Online Sep 13Sep 14Sep 15Sep 16Sep 17Sep 18Sep 19	Lecture #4: Kernels and Dimensionality Reduction. [Slides] [Demo Notebook] [Demo HTML] The kernel trick Gram matrix versus feature extraction: systems tradeoffs Adaptive/data-dependent feature mappings Dimensionality reduction
Monday, September 21 In Person/Online Sep 20Sep 21Sep 22Sep 23Sep 24Sep 25Sep 26	Paper Discussion 2a. Random features for large-scale kernel machines. Ali Rahimi and Benjamin Recht. In Advances in Neural Information Processing Systems (NeurIPS), 2007. Paper Discussion 2b. Feature Hashing for Large Scale Multitask Learning. Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford and Alex Smola. Proceedings of the International Conference on Machine Learning (ICML), 2009. Due: Review of paper 1a or 1b. Released: Programming Assignment 1.
Wednesday, September 23 In Person/Online Sep 20Sep 21Sep 22Sep 23Sep 24Sep 25Sep 26	Lecture #5: Online Learning and Variance Reduction. [Slides] [Demo Notebook] [Demo HTML] Online versus offline learning Variance reduction SVRG Fast linear rates for convex objectives
Monday, September 28 In Person/Online Sep 27Sep 28Sep 29Sep 30Oct 1Oct 2Oct 3	Paper Discussion 3a. Identifying Suspicious URLs: An Application of Large-Scale Online Learning. Justin Ma, Lawrence K. Saul, Stefan Savage and Geoffrey M. Voelker. Proceedings of the International Conference on Machine Learning (ICML), 2009. Paper Discussion 3b. Accelerating stochastic gradient descent using predictive variance reduction. Rie Johnson and Tong Zhang. In Advances in Neural Information Processing Systems (NeurIPS), 2013. Due: Review of paper 2a or 2b.
Wednesday, September 30 In Person/Online Sep 27Sep 28Sep 29Sep 30Oct 1Oct 2Oct 3	Lecture #6: Hyperparameter Optimization. [Slides] [Demo Notebook] [Demo HTML] Hyperparameter optimization Assigning parameters from folklore Random search over parameters
Monday, October 5 In Person/Online Oct 4Oct 5Oct 6Oct 7Oct 8Oct 9Oct 10	Paper Discussion 4a. Random search for hyper-parameter optimization. James Bergstra and Yoshua Bengio. Journal of Machine Learning Research (JMLR), 2012. Paper Discussion 4b. Practical bayesian optimization of machine learning algorithms. Jasper Snoek, Hugo Larochelle, and Ryan P Adams. In Advances in Neural Information Processing Systems (NeurIPS), 2012. Due: Review of paper 3a or 3b. Due: Programming Assignment 1.
Wednesday, October 7 In Person/Online Oct 4Oct 5Oct 6Oct 7Oct 8Oct 9Oct 10	Lecture #7: Adaptive Methods & Non-Convex Optimization. [Slides] [Demo Notebook] [Demo HTML] Adaptive methods AdaGrad Adam Non-convex optimization
Monday, October 12 In Person/Online Oct 11Oct 12Oct 13Oct 14Oct 15Oct 16Oct 17	Paper Discussion 5a. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro and Benjamin Recht. In Advances in Neural Information Processing Systems (NeurIPS), 2017. Paper Discussion 5b. Adam: A method for stochastic optimization. Diederik Kingma and Jimmy Ba. Proceedings of the International Conference on Learning Representations (ICLR), 2015. Due: Review of paper 4a or 4b. Released: Programming Assignment 2.
Wednesday, October 14	Fall break: No classes.
Monday, October 19 In Person/Online Oct 18Oct 19Oct 20Oct 21Oct 22Oct 23Oct 24	Lecture #8: Parallelism. [Slides] [Demo Notebook] [Demo HTML] Hardware trends that lead to parallelism Sources of parallelism in hardware Data parallelism Extracting parallelism at different places in the computation Simple parallelism on multicore Due: Review of paper 5a or 5b. In-class project feedback activity.
Wednesday, October 21 In Person/Online Oct 18Oct 19Oct 20Oct 21Oct 22Oct 23Oct 24	Paper Discussion 6a. Map-reduce for machine learning on multicore. Cheng-Tao Chu, Sang K Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng, and Kunle Olukotun In Advances in Neural Information Processing Systems (NeurIPS), 2007. Paper Discussion 6b. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. Feng Niu, Benjamin Recht, Christopher Re, and Stephen Wright. In Advances in Neural Information Processing Systems (NeurIPS), 2011.
Monday, October 26 In Person/Online Oct 25Oct 26Oct 27Oct 28Oct 29Oct 30Oct 31	Lecture #9: Distributed Learning. [Slides] Learning on multiple machines SGD with all-reduce The parameter server Asynchronous parallelism on multiple machines Decentralized and local SGD Model and pipeline parallelism Due: Review of paper 6a or 6b. Due: Final project proposals.
Wednesday, October 28 In Person/Online Oct 25Oct 26Oct 27Oct 28Oct 29Oct 30Oct 31	Paper Discussion 7a. Large scale distributed deep networks. Jeff Dean et al. In Advances in Neural Information Processing Systems (NeurIPS), 2012. Paper Discussion 7b. Towards federated learning at scale: System design. Keith Bonawitz, et al. In Proceedings of the 2nd MLSys Conference (MLSys), 2019. Due: Programming Assignment 2.
Monday, November 2 In Person/Online Nov 1Nov 2Nov 3Nov 4Nov 5Nov 6Nov 7	Lecture #10: Low-Precision Arithmetic. [Slides] [Demo Notebook] [Demo HTML] Memory Low-precision formats Floating-point machine epsilon Low-precision training Scan order Due: Review of paper 7a or 7b.
Wednesday, November 4 In Person/Online Nov 1Nov 2Nov 3Nov 4Nov 5Nov 6Nov 7	Paper Discussion 8a. Deep learning with limited numerical precision. Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Proceedings of the International Conference on Machine Learning (ICML), 2015. Paper Discussion 8b. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
Monday, November 9 In Person/Online Nov 8Nov 9Nov 10Nov 11Nov 12Nov 13Nov 14	Lecture #11: Inference and Compression. [Demo Notebook] Efficient inference Metrics we care about when inferring Compression Fine-tuning Hardware for inference Due: Review of paper 8a or 8b.
Wednesday, November 11 In Person/Online Nov 8Nov 9Nov 10Nov 11Nov 12Nov 13Nov 14	Paper Discussion 9a. MobileNets: Efficient convolutional neural networks for mobile vision applications. Andrew G. Howard et al. on arxiv, 2017. Paper Discussion 9b. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. Song Han, Huizi Mao, and William J Dally. Proceedings of the International Conference on Learning Representations (ICLR), 2016.
Monday, November 16	Semi-final study days: No classes.
Wednesday, November 18	Semi-final exams: No classes.
Monday, November 23	Semi-final exams: No classes.
Wednesday, November 25	Thanksgiving break: No classes.
Monday, November 30 Online Only Nov 29Nov 30Dec 1Dec 2Dec 3Dec 4Dec 5	Lecture #12: Machine Learning Frameworks II. Large scale numerical linear algebra Eager vs lazy ML frameworks in Python Due: Review of paper 9a or 9b.
Wednesday, December 2 Online Only Nov 29Nov 30Dec 1Dec 2Dec 3Dec 4Dec 5	Paper Discussion 10a. TensorFlow: A System for Large-Scale Machine Learning. Martin Abadi et al. USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016. Paper Discussion 10b. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Adam Paszke et al. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019.
Monday, December 7 Online Only Dec 6Dec 7Dec 8Dec 9Dec 10Dec 11Dec 12	Lecture #13: Hardware for Machine Learning. CPUs vs GPUs What makes for good ML hardware? How can hardware help with ML? What does modern ML hardware look like? Due: Review of paper 10a or 10b.
Wednesday, December 9 Online Only Dec 6Dec 7Dec 8Dec 9Dec 10Dec 11Dec 12	Paper Discussion 11a. In-datacenter performance analysis of a tensor processing unit. Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017. Paper Discussion 11b. A Configurable Cloud-Scale DNN Processor for Real-Time AI. Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengills, et al. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA), 2018. Due: Final project abstract. Can be submitted late until Sunday; will discuss in class on Monday.
Monday, December 14 Online Only Dec 13Dec 14Dec 15Dec 16Dec 17Dec 18Dec 19	Lecture #15: Large Scale ML on the Cloud. Due: Review of paper 11a or 11b. Abstract discussion.
Wednesday, December 16 Online Only Dec 13Dec 14Dec 15Dec 16Dec 17Dec 18Dec 19	Lecture #16: Final Project Disussion. Due: Final project report.