Programming Assignment 3: Accelerating SGD

CS4787 — Principles of Large-Scale Machine Learning — Spring 2020

Project Due: April 8, 2020 at 11:59pm

Late Policy: Up to two slip days can be used for the final submission.

Please submit all required documents to CMS.

This is a partner project. You can either work alone, or work ONLY with your chosen partner. Failure to adhere to this rule (by e.g. copying code) may result in an Academic Integrity Violation.

Overview: In this project, you will be testing some of the methods we discussed in class for accelerating stochastic gradient descent on the same MNIST task as in Programming Assignment 1. These methods are all used in practice for machine learning systems at even the largest scales, and the goal of this assignment is to give you some experience working with them so that you can build your intuition for how they work.

Background: In the last programming assignment, we looked at minibatched SGD with sequential sampling (Algorithm 4 from that project):

\begin{algorithm} \caption{Minibatch SGD with Fixed $\alpha$ and Sequential Sampling} \begin{algorithmic} \PROCEDURE{SGD}{$(f_1, \ldots, f_n), \alpha, w_0, B, \texttt{num\_epochs}$} \STATE \textbf{initialize} $w \leftarrow w_0$ \FOR{$t = 1$ \TO $\texttt{num\_epochs}$} \FOR{$i = 0$ \TO $(n / B) - 1$} \STATE $w \leftarrow w - \alpha \cdot \frac{1}{B} \sum_{b = 1}^B \nabla f_{i \cdot B + b}(w)$ \ENDFOR \ENDFOR \RETURN $w$ \ENDPROCEDURE \end{algorithmic} \end{algorithm}

As usual, \(\texttt{num\_epochs}\) denotes the number of epochs, or passes through the dataset, used in training, and the total number of iterations here will be \(T = \texttt{num\_epochs} \cdot n / B\), where we assume that \(B\) divides \(n\) evenly. When using stochastic gradients in this programming assignment, we are going to base our algorithms off of this one, by using for all our stochastic training algorithms both (1) minibatching and (2) sequential scan through the data.

Instructions: This project is split into three parts: the implementation and evaluation of momentum on top of gradient descent, the implementation and evaluation of momentum with stochastic gradients, and the exploration of Adam (an algorithm that uses adaptive learning rates).

Part 1: Momentum with gradient descent.

  1. Implement a function to compute the gradient of a batch of training examples for multinomial logistic regression (with l2 regularization) on MNIST.
  2. Implement a function to compute the training/test error of multinomial logistic regression on MNIST.
  3. Implement a function to compute the training loss of multinomial logistic regression (with l2 regularization) on MNIST.
  4. Implement a function to run gradient descent on multinomial logistic regression (with l2 regularization) on MNIST.
  5. Implement a function to run gradient descent using Nesterov's momentum on multinomial logistic regression (with l2 regularization) on MNIST. Here is some pseudocode for that algorithm.
    \begin{algorithm} \caption{Nesterov's momentum GD} \begin{algorithmic} \PROCEDURE{Nesterov}{$f, \alpha, \beta, w_0, \texttt{num\_epochs}$} \STATE \textbf{initialize} $w_0 \leftarrow w_0$ \STATE \textbf{initialize} $v_0 \leftarrow w_0$ \FOR{$t = 1$ \TO $\texttt{num\_epochs}$} \STATE $v_t \leftarrow w_{t-1} - \alpha \nabla f(w_{t-1})$ \STATE $w_t \leftarrow v_t + \beta (v_t - v_{t-1})$ \ENDFOR \RETURN $w$ \ENDPROCEDURE \end{algorithmic} \end{algorithm}
  6. Using an L2 regularization parameter of \(\gamma = 0.0001\) and step size \(\alpha = 1.0\) for 100 epochs/iterations, record the value of the parameters every iteration (i.e. once per full gradient step), for the following three algorithms and parameter settings:
  7. Evaluate the training error, test error, and training loss of each recorded model exactly, using the entire MNIST training and test dataset.
  8. Plot the resulting error and loss against the number of epochs in three figures, one for Training error, one for Test error, and one for Training Loss. How does the performance of the three methods compare?
  9. Measure the runtime of the two algorithms (Gradient descent and Nesterov's momentum) under the above parameter settings. How does the time to run an iteration compare between the two algorithms?
  10. Now test some other values of the hyperparameters \(\alpha\) and \(beta\) for both algorithms. How do the best settings you can come up with compare between gradient descent and Nesterov's momentum? List at least two hyperparameter settings you tried for each of the two algorithms, and describe what you observed by looking at the training error, test error, and training loss.

Part 2: Momentum with SGD.

  1. Implement a function to compute minibatch sequential sampling SGD for multinomial logistic regression (with l2 regularization) on MNIST.
  2. Implement a function to compute minibatch SGD with momentum and sequential sampling for multinomial logistic regression (With l2 regularization) on MNIST. Here is some pseudocode for that algorithm.
    \begin{algorithm} \caption{SGD with Momentum using Minibatching and Sequential Sampling} \begin{algorithmic} \PROCEDURE{MomentumSGD}{$(f_1, \ldots, f_n), \alpha, \beta, w_0, B, \texttt{num\_epochs}$} \STATE \textbf{initialize} $w \leftarrow w_0$ \STATE \textbf{initialize} $v \leftarrow 0$ \FOR{$t = 1$ \TO $\texttt{num\_epochs}$} \FOR{$i = 0$ \TO $(n / B) - 1$} \STATE $v \leftarrow \beta v - \alpha \cdot \frac{1}{B} \sum_{b = 1}^B \nabla f_{i \cdot B + b}(w)$ \STATE $w \leftarrow w + v$ \ENDFOR \ENDFOR \RETURN $w$ \ENDPROCEDURE \end{algorithmic} \end{algorithm}
  3. Using an L2 regularization parameter of \(\gamma = 0.0001\), step size \(\alpha = 0.2\), and minibatch size \(B = 600\) for 10 epochs, record the value of the parameters every 10 iterations (i.e. ten times per epoch), for the following three algorithms and parameter settings:
  4. Evaluate the training error, test error, and training loss of each recorded model exactly, using the entire MNIST training and test dataset.
  5. Plot the resulting error and loss against the number of epochs in three figures, one for Training error, one for Test error, and one for Training Loss. How does the performance of the three methods compare?
  6. Measure the runtime of the two algorithms (SGD and SGD+Momentum) under the above parameter settings. How does the time to run an iteration compare between the two algorithms?
  7. Now test some other values of the hyperparameters \(\alpha\) and \(beta\) for both algorithms. How do the best settings you can come up with compare between SGD and SGD+momentum? List at least two hyperparameter settings you tried for each of the two algorithms, and describe what you observed by looking at the training error, test error, and training loss.

Part 3: ADAM.

  1. Implement a function to compute the Adam optimization algorithm (using minibatching and sequential sampling) for multinomial logistic regression (with l2 regularization) on MNIST. Here is some pseudocode for that algorithm.
    \begin{algorithm} \caption{Adam} \begin{algorithmic} \State \textbf{input:} learning rate factor $\alpha$, decay rates $\rho_1$ and $\rho_2$, initial parameters $w$ \State \textbf{input:} minibatch size $B$, small constant $\epsilon$ to avoid division by $0$ (e.g. $\epsilon = 10^{-5}$) \State \textbf{initialize } $r \leftarrow 0$ (a vector in $\R^d$) \State \textbf{initialize } $s \leftarrow 0$ (a vector in $\R^d$) \State \textbf{initialize timestep } $t \leftarrow 0$ \For{$k = 1$ \TO $\texttt{num\_epochs}$} \For{$i = 0$ \TO $(n / B) - 1$} \State $t \leftarrow t + 1$ \State sample a stochastic gradient $g \leftarrow \frac{1}{B} \sum_{b = 1}^B \nabla f_{i \cdot B + b}(w)$ \State accumulate first moment estimate $s_j \leftarrow \rho_1 s_j + (1 - \rho_1) g_j$ for all $j \in \{1, \ldots, d \}$ \State accumulate second moment estimate $r_j \leftarrow \rho_2 r_j + (1 - \rho_2) g_j^2$ for all $j \in \{1, \ldots, d \}$ \State correct first moment bias $\hat s \leftarrow \frac{s}{1 - \rho_1^t}$ \State correct second moment bias $\hat r \leftarrow \frac{r}{1 - \rho_2^t}$ \State update model $w_j \leftarrow w_j - \frac{\alpha}{\sqrt{\hat r_j + \epsilon}} \cdot \hat s_j$ for all $j \in \{1, \ldots, d \}$ \EndFor \EndFor \Return $w$ \end{algorithmic} \end{algorithm}
  2. Using an L2 regularization parameter of \(\gamma = 0.0001\), and minibatch size \(B = 600\) for 10 epochs, record the value of the parameters every 10 iterations (i.e. ten times per epoch), for the following two algorithms and parameter settings:
  3. Evaluate the training error, test error, and training loss of each recorded model exactly, using the entire MNIST training and test dataset.
  4. Plot the resulting error and loss against the number of epochs in three figures, one for Training error, one for Test error, and one for Training Loss. How does the performance of the two methods compare?
  5. Measure the runtime of the two algorithms (SGD and Adam) under the above parameter settings. How does the time to run an iteration compare between the two algorithms?
  6. Now test some other values of the hyperparameters \(\alpha\), \(rho_1\), and \(\rho_2\) for Adam. How do the best settings you can come up with compare between SGD (from Part 2) and Adam? List at least three hyperparameter settings you tried for Adam, and describe what you observed by looking at the training error, test error, and training loss.

What to submit:

  1. An implementation of the functions in main.py.
  2. A lab report containing:

Setup:

  1. Run pip3 install -r requirements.txt to install the required python packages