CS 6241 Spring 2020

CS 6241: Numerical Methods for Data Science

Cornell University, Spring 2020.
Lectures: 2:55 PM–4:10 PM, Tuesdays and Thursdays, ~~203 Phillips Hall~~ virtual on zoom.

Instructor: Austin Benson, Assistant Professor, Computer Science (arb@cs.cornell.edu).
Office hours: 4:15 PM–5:15 Thursdays (after class) or by appointment, ~~413B Gates~~ virtual on zoom.

TA: Kangbo Li, PhD student, Computer Science (kl935@cornell.edu).
Office hours: 2:30–3:30pm Mondays, ~~405 Rhodes~~ virtual on zoom.

Resources

Course administration:

CMS for managing course work.
Jupyter notebooks and data for class demos.

Educational material:

CS 6241 Spring 2019 web page.
David Bindel's 2019 SJTU summer short course on numerical methods for data science.
Numerical Linear Algebra. L. N. Trefethen and D. Bau III.

Datasets:

Coursework schedule

Homework 1. Due Th Feb 13 at 11:59pm ET.
Homework 2. Due Th Mar 5 at 11:59pm ET.
Reaction paper. Due Th Apr 9 at 11:59pm ET.
Project proposal. Due Th Apr 23 at 11:59pm ET.
Project progress report. Due Th May 7 at 11:59pm ET.
Project final report. Due Th May 21 at 11:59pm ET.

Class schedule

This schedule is tentative and subject to change.

Week 1

Lecture 1 (Tu 1/21): overview and background
Topics: course overview, vector and matrix norms, linear systems, eigenvalues, basic matrix factorizations, basic optimization
Readings:

Background notes from David Bindel's 2019 SJTU summer course.
Numerical Optimization (Chapters 1 and 2).

Lecture notes:

Lecture 1

Lecture 2 (Th 1/23): linear least squares
Topics: solving linear least squares with matrix factorizations, statistical interpretations
Readings:

The Elements of Statistical Learning (sections 2.9, 3.1, and 3.2).
Numerical Optimization (Chapters 1 and 2).
Numerical Linear Algebra (Sections I and II).
D. Bindel's lecture notes on linear least squares.

Lecture notes:

Week 2

Lecture 3 (Tu 1/28): regularized linear least squares
Topics: Tikhonov regularization / ridge regression, Lasso, pivoted QR
Readings:

Applied Numerical Linear Algebra (section 3.5).
Rank Revealing QR Factorizations. T. F. Chan. LAA, 1987.
The Elements of Statistical Learning (section 3.4).
D. Bindel's lecture notes on regularized linear least squares.

Lecture notes:

Lecture 4 (Th 1/30): iterative optimization
Topics: (stochastic) gradient descent, (quasi-)Newton
Readings:

Numerical Optimization (sections 3.1–3.3)
Optimization Methods for Large-Scale Machine Learning. L. Bottou, F. E. Curtis, and J. Nocedal. SIREV, 2018.
D. Bindel's lecture notes on regularized linear least squares.
D. Bindel's lecture notes on optimization.

Lecture notes:

Lecture 4

Week 3

Lecture 5 (Tu 2/4): iterative optimization and start of latent factor models
Topics: gradient descent with errors, more SGD, momentum, acceleration, basic ideas for latent factor models
Readings:

D. Bindel's lecture notes on regularized linear least squares.
D. Bindel's lecture notes on optimization.
D. Bindel's lecture notes on latent factor models.
Optimization Methods for Large-Scale Machine Learning. L. Bottou, F. E. Curtis, and J. Nocedal. SIREV, 2018.
A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights. W. Su, S. Boyd, and E. J. Candès. JMLR, 2016.
The Advanced Matrix Factorization Jungle. I. Carron.

Lecture notes:

Lecture 5

Lecture 6 (Th 2/6): latent factor models and linear dimensionality reduction
Topics: PCA, robust PCA, role of truncated SVD and proximal methods
Readings:

C. Shalizi's lecture notes on PCA.
D. Bindel's lecture notes on collaborative filtering and other stories.
Robust Principal Component Analysis?. E. J. Candès et al. JACM, 2011.
Proximal Algorithms. N. Parikh and S. Boyd. Foundations and Trends in Optimization, 2014.
S. Boyd and L. Vandenberghe's lecture notes on subgradients.
S. Boyd's lecture notes on ADMM.

Lecture notes:

Week 4 [HW1 due Th 2/13 at 11:59pm ET]

Lecture 7 (Tu 2/11): latent factor models, dimensionality reduction, and matrix factorizations
Topics: SVD-based latent factors and matrix completion
Readings:

D. Bindel's lecture notes on collaborative filtering and other stories.
Netflix Update: Try This at Home. S. Funk, 2006.
Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. Y. Koren. KDD, 2008.
GloVe: Global Vectors for Word Representation. Pennington et al. EMNLP, 2014.
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model. Yang et al. ICLR, 2018.
Exact Matrix Completion via Convex Optimization. E. J. Candès and B. Recht. FOCM, 2009.

Lecture notes:

Lecture 7

Lecture 8 (Th 2/13): latent factor models, dimensionality reduction, and matrix factorizations
Topics: nonnegative matrix factorization (NMF)
Readings:

D. Bindel's lecture notes on non-negative matrix factorization.
The Why and How of Nonnegative Matrix Factorization. N. Gillis. arXiv, 2014.
Learning the parts of objects by non-negative matrix factorization. D. D. Lee and H. S. Seung. Nature, 1999.
Algorithms for Non-negative Matrix Factorization. D. D. Lee and H. S. Seung. NeurIPS, 2001.
When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts? D. Donoho and V. Stodden. NeurIPS, 2003.
On the complexity of nonnegative matrix factorization. S. A. Vavavis. SIOPT, 2009.
Computing a nonnegative matrix factorization—provably. S. Arora, R. Ge, R. Kannan, and A. Moitra. SICOMP, 2016.

Lecture notes:

Lecture 8

Week 5

Lecture 9 (Tu 2/18): interpretable latent factor models and start of tensors
Topics: interpolative decomposition, CUR, basics on tensors, problems with border rank
Readings:

D. Bindel's lecture notes on SVD and other low rank decompositions.
Randomized algorithms for the low-rank approximation of matrices. E. Liberty et al. PNAS, 2007.
CUR matrix decompositions for improved data analysis. M. W. Mahoney and P. Drineas. PNAS, 2009.
Tensor Rank and the Ill-Posedness of the Best Low-Rank Approximation Problem. V. de Silva and L.-H. Lim. SIMAX, 2008.

Lecture notes:

Lecture 10 (Th 2/20): low-rank tensor decompositions
Topics: more difficulties with tensors, CP and Tucker decompositions
Readings:

Most Tensor Problems Are NP-Hard. C. J. Hillar, L.-H. Lim. JACM, 2013.
A Counterexample to the Possibility of an Extension of the Eckart–Young Low-Rank Approximation Theorem for the Orthogonal Rank Tensor Decomposition. T. G. Kolda. SIMAX, 2003.
Tensor Decompositions and Applications. T. G. Kolda and B. W. Bader. SIREV, 2009.
On the Best Rank-1 and Rank-(R₁, R₂,...,R_N) Approximation of Higher-Order Tensors. L. De Lathauwer, B. De Moor, and J. Vandewalle. SIMAX, 2000.
Unsupervised Discovery of Demixed, Low-dimensional Neural Dynamics across Multiple Timescales through Tensor Components Analysis. A. H. Williams et al. Neuron, 2018.

Lecture notes:

Lecture 10

Week 6

No lecture Tu 2/25 (February break)

Lecture 11 (Th 2/27): Nonlinear dimensionality reduction
Topics: ISOMAP, LLE, t-SNE
Readings:

A Global Geometric Framework for Nonlinear Dimensionality Reduction. J. B. Tenenbaum, V. de Silva, and J. C. Langford. Science, 2000.
Nonlinear Dimensionality Reduction by Locally Linear Embedding. S. T. Roweis and L. K. Saul. Science, 2000.
Stochastic Neighbor Embedding. G. Hinton and S. Roweis. NeurIPS, 2002.
Visualizing Data using t-SNE. L. van der Maaten and G. Hinton. JMLR, 2008.
Accelerating t-SNE using Tree-Based Algorithms. L. van der Maaten. JMLR, 2014.
How to Use t-SNE Effectively. M. Wattenberg, F. Viégas, and I. Johnson. Distill, 2016.

Lecture notes:

Week 7 [HW2 due Th 3/5 at 11:59pm ET]

Lecture 12 (Th 3/3): Basic network analysis
Topics: matrices associated with graphs, common network properties
Readings:

The Structure and Function of Complex Networks. M. E. J. Newman. SIREV, 2003.
Mining Large Graphs. D. F. Gleich and M. W. Mahoney. Handbook of Big Data, Handbooks of modern statistical methods, 2016.
D. Spielman's spectral graph theory course.
The Anatomy of the Facebook Social Graph. J. Ugander et al. arXiv, 2011.
Collective dynamics of ‘small-world’ networks. D. J. Watts and S. H. Strogatz. Nature, 1998.
An Experimental Study of the Small World Problem. J. Travers and S. Milgram. Sociometry, 1969.

Lecture notes:

Lecture 12

Lecture 13 (Th 3/5): Basic network analysis
Topics: common network properties, random graph models
Readings:

The Structure and Function of Complex Networks. M. E. J. Newman. SIREV, 2003.
The Anatomy of the Facebook Social Graph. J. Ugander et al. arXiv, 2011.
The Diameter of Sparse Random Graphs F. Chung and L. Lu. Advances in Applied Mathematics, 2001.
Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations J. Leskovec, J. Kleinberg, and C. Faloutsos. KDD, 2005.
Configuring Random Graph Models with Fixed Degree Sequences. B. K. Fosdick et al. SIREV, 2018.
Stochastic blockmodels and community structure in networks. B. Karrer and M. E. J. Newman. PRE, 2011.
Community Detection and Stochastic Block Models: Recent Developments. E. Abbe. JMLR, 2017.
A Brief History of Generative Models for Power Law and Lognormal Distributions. M. Mitzenmacher. Internet Mathematics, 2004.
Power-Law Distributions in Empirical Data. A. Clauset, C. R. Shalizi, and M. E. J. Newman. SIREV, 2009.

Lecture notes:

Week 8

Lecture 14 (Th 3/10): Unsupervised learning on graphs
Topics: spectral methods for bipartitioning, ratio cut, normalized cut
Readings:

A tutorial on spectral clustering. U. von Luxburg. Statistics and Computing, 2007.
Communities in Networks. M. A. Porter, J.-P. Onnela, and P. J. Mucha. Notices of the AMS, 2009.
Community detection in networks: A user guide. S. Fortunato and D. Hric. Physics Reports, 2016.

Lecture notes:

Lecture 15 (Th 3/12): Unsupervised learning on graphs
Topics: conductance, Cheeger's inequality, k-way spectral clustering
Readings:

A tutorial on spectral clustering. U. von Luxburg. Statistics and Computing, 2007.
D. A. Spielman's lecture notes on Conductance, the Normalized Laplacian, and Cheeger’s Inequality.
Trace optimization and eigenproblems in dimension reduction methods. E. Kokiopoulou, J. Chen, and Y. Saad. Numer. Linear Algebra Appl., 2011.

Lecture notes:

Week 9

No lecture Tu 3/17 or Th 3/19 (Cornell classes suspended due to covid-19)

Week 10

No lecture Tu 3/24 or Th 3/26 (Cornell classes suspended due to covid-19)

Week 11

No lecture Tu 3/31 or Th 4/2 (spring break)

Week 12 [Reaction paper due Th 4/9 at 11:59pm ET]

Lecture 16 (Tu 4/7): Graph-based semi-supervised learning [VIRTUAL SYNCHRONOUS LECTURE]
Topics: classical graph-based semi-supervised techniques with connections to random walks and spectral clustering
Readings:

Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. X. Zhu, Z. Ghahraman, and J. Lafferty. ICML, 2003.
Learning with Local and Global Consistency. D. Zhou et al. NeurIPS, 2004.
Empirical stationary correlations for semi-supervised learning on graphs. Y. Xu, J. S. Dyer, and A. B. Owen. Ann. Appl. Stat., 2010.
Using Local Spectral Methods to Robustify Graph-Based Learning Algorithms. D. F. Gleich and M. W. Mahoney. KDD, 2015.

Lecture notes:

Lecture 16

Lecture 17 (Th 4/9): Node representation learning in graphs [VIRTUAL SYNCHRONOUS LECTURE]
Topics: latent space models, role discovery, node embeddings
Readings:

Latent Space Approaches to Social Network Analysis. P. D. Hoff, A. E. Raftery, and M. S. Handcock. JASA, 2002.
RolX: Structural Role Extraction & Mining in Large Graphs. K. Henderson et al. KDD, 2012.
DeepWalk: Online Learning of Social Representations. B. Perozzi, R. Al-Rfou, and S. Skiena. KDD, 2014.
node2vec: Scalable Feature Learning for Networks. A. Grover and J. Leskovec. KDD, 2016.
Representation Learning on Graphs: Methods and Applications. W. L. Hamilton, R. Ying, and J. Leskovec. Bull. of the IEEE Comp. Soc. TCDE, 2017.
Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec. J. Qiu et al. WSDM, 2018.

Lecture notes:

Lecture 17

Week 13

Lecture 18 (Th 4/14): Graph neural networks [VIRTUAL SYNCHRONOUS LECTURE]
Topics: high-level ideas, basic derivation of graph convolutional networks
Readings:

C. Shalizi's lecture notes on Logistic Regression.
Learning with Local and Global Consistency. D. Zhou et al. NeurIPS, 2004.
Representation Learning on Graphs: Methods and Applications. W. L. Hamilton, R. Ying, and J. Leskovec. Bull. of the IEEE Comp. Soc. TCDE, 2017.
Semi-Supervised Classification with Graph Convolutional Networks. T. N. Kipf and M. Welling. ICLR, 2017.

Lecture notes:

Lecture 18

Lecture 19 (Th 4/16): Graph neural networks [VIRTUAL SYNCHRONOUS LECTURE]
Topics: more general architectures, other prediction tasks
Readings:

Outcome Correlation in Graph Neural Network Regression. J. Jia and A. R. Benson. arXiv, 2020.
Inductive Representation Learning on Large Graphs. W. L. Hamilton, R. Ying, and J. Leskovec. NeurIPS, 2017.
Spectral networks and locally connected networks on graphs. J. Bruna et al. ICLR, 2014.

Lecture notes:

Lecture 19

Week 14 [Project proposal due Th 4/23 at 11:59pm ET]

"Lecture" 20 (Tu 4/21): small subgraph patterns [READ ON YOUR OWN]
Topics: network motifs and network structure
Readings:

Network Motifs: Simple Building Blocks of Complex Networks. R. Milo et al. Science, 2002.
Higher-order organization of complex networks. A. R. Benson, D. F. Gleich, and J. Leskovec. Science, 2016.
Subgraph Frequencies: Mapping the Empirical and Extremal Geography of Large Graph Collections. J. Ugander, L. Backstrom, and J. Kleinberg. WWW, 2013. [See also the companion page.]

"Lecture" 21 (Th 4/23): small subgraph patterns [VIRTUAL SYNCHRONOUS DISCUSSION]
Topics: discussion on network motifs and network structure, subgraph counting algorithms
Readings:

Arboricity and Subgraph Listing Algorithms. N. Chiba and T. Nishizeki. SICOMP, 1985.
Main-memory triangle computations for very large (sparse (power-law)) graphs. M. Latapy. Theoretical Computer Science, 2008.
Triadic Measures on Graphs: The Power of Wedge Sampling. C. Seshadhri, A. Pinar, and T. G. Kolda. SDM, 2013.
ESCAPE: Efficiently Counting All 5-Vertex Subgraphs. A. Pinar, C. Seshadhri, and V. Vishal. WWW, 2017.
The Power of Pivoting for Exact Clique Counting. S. Jain and C. Seshadhri. WSDM, 2020.
Retrieving Top Weighted Triangles in Graphs. R. Kumar et al. WSDM, 2020.

Lecture notes:

Lecture 21

Week 15

Lecture 22 (Tu 4/28): kernels [VIRTUAL SYNCHRONOUS LECTURE]
Topics: feature maps, kernel trick, function approximation, RKHSs
Readings:

D. Bindel's lecture notes on Many interpretations of kernels.
The Elements of Statistical Learning (section 14.5.4).
scikit-learn kernel PCA example.
Kernel Methods in Machine Learning. T. Hofmann, B. Schölkopf, and A. J. Smola. Ann. Stat., 2008.

Lecture notes:

Lecture 22

Lecture 23 (Th 4/30): kernels [VIRTUAL SYNCHRONOUS LECTURE]
Topics: RKHSs, native spaces, Moore–Aronszajn, Gaussian Processes, GP likelihood and hyperparameter optimization
Readings:

D. Bindel's lecture notes on Many interpretations of kernels.
D. Bindel's lecture notes on Approaches to kernel selection.
D. Bindel's lecture notes on Computing with GPs.
Gaussian Processes for Machine Learning. C. E. Rasmussen and C. K. I. Williams. MIT Press, 2006.

Lecture notes:

Lecture 23

Week 16 [Progres report due Th 5/7 at 11:59pm ET]

"Lecture" 24 (Tu 5/5): Gaussian Processes [READ ON YOUR OWN]
Topics: fast computation for hyperparameter optimization
Readings:

Scalable Log Determinants for Gaussian Process Kernel Learning. K. Dong et al. NeurIPS, 2017.

"Lecture" 25 (Th 5/7): Gaussian Processes [VIRTUAL SYNCHRONOUS DISCUSSION]
Topics: fast computation for hyperparameter optimization
Readings:

scikit-learn Kernels for Gaussian Processes.
Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP). A. G. Wilson and H. Nickisch. ICML, 2015.
Scaling Gaussian Process Regression with Derivatives. D. Eriksson et al. NeurIPS, 2018.
Exact Gaussian Processes on a Million Data Points. K. A. Wang et al. NeurIPS, 2019.
Fast estimation of tr(f(A)) via stochastic Lanczos quadrature. S. Ubaru, J. Chen, and Y. Saad. SIMAX, 2017.
D. Bindel's lecture notes on Computing with GPs.

Lecture notes

Lecture 25

Week 17

Project feedback sessions (Mon 5/11 and Tu 5/12)

Week 18 [Final project report due Th 5/21 at 11:59pm ET]

Coursework

Coursework will be managed through and assignments submitted on CMS. The required coursework consists of three components:

Homework (20%)
There are two homeworks that will have a theoretical component and/or an implementation with data analysis component.
Reaction paper (20%)
The second component of the course is composing a (roughly) 5-page reaction paper that summarizes and critiques at least two closely related published papers relevant to the class. The motivation of this assignment is to get everyone thinking about research issues related to the class material and to stimulate ideas for the course project (described below). You can work individually or in groups of two or three. Your group members can be the same as your project partners, but they do not have to be. There may be a group presentation involved. Additional details will be posted soon.
Course project (60%)
Most of your grade will be determined by a course project, where the topic is of your choosing but must be related to the class. You can work individually or in groups of two or three (the scope of the project should be scaled according to the group size). We will be looking for three key pieces in the course project: (i) some theoretical or mathematical discussion of a numerical method or algorithm (broadly construed); (ii) some software implementation of a method or algorithm; and (iii) some analysis of a real-world dataset. The project will be split into three components: a proposal, a progress report, and a final report, worth 10%, 15%, and 35% of the final grade, respectively. More details will be posted soon.