Programming Assignment 3: Accelerating SGD

CS4787 — Principles of Large-Scale Machine Learning — Spring 2020

Project Due: April 8, 2020 at 11:59pm

Late Policy: Up to two slip days can be used for the final submission.

Please submit all required documents to CMS.

This is a partner project. You can either work alone, or work ONLY with your chosen partner. Failure to adhere to this rule (by e.g. copying code) may result in an Academic Integrity Violation.

Overview: In this project, you will be testing some of the methods we discussed in class for accelerating stochastic gradient descent on the same MNIST task as in Programming Assignment 1. These methods are all used in practice for machine learning systems at even the largest scales, and the goal of this assignment is to give you some experience working with them so that you can build your intuition for how they work.

Background: In the last programming assignment, we looked at minibatched SGD with sequential sampling (Algorithm 4 from that project):

Algorithm 1 Minibatch SGD with Fixed α\alpha and Sequential Sampling

1:procedure SGD((f1,,fn),α,w0,B,num_epochs(f_1, \ldots, f_n), \alpha, w_0, B, \texttt{num\_epochs})

2:initialize ww0w \leftarrow w_0

3:for t=1t = 1 to num_epochs\texttt{num\_epochs} do

4:for i=0i = 0 to (n/B)1(n / B) - 1 do

5:wwα1Bb=1BfiB+b(w)w \leftarrow w - \alpha \cdot \frac{1}{B} \sum_{b = 1}^B \nabla f_{i \cdot B + b}(w)

6:end for

7:end for

8:return ww

9:end procedure

As usual, num_epochs\texttt{num\_epochs} denotes the number of epochs, or passes through the dataset, used in training, and the total number of iterations here will be T=num_epochsn/BT = \texttt{num\_epochs} \cdot n / B, where we assume that BB divides nn evenly. When using stochastic gradients in this programming assignment, we are going to base our algorithms off of this one, by using for all our stochastic training algorithms both (1) minibatching and (2) sequential scan through the data.

Instructions: This project is split into three parts: the implementation and evaluation of momentum on top of gradient descent, the implementation and evaluation of momentum with stochastic gradients, and the exploration of Adam (an algorithm that uses adaptive learning rates).

Part 1: Momentum with gradient descent.

  1. Implement a function to compute the gradient of a batch of training examples for multinomial logistic regression (with l2 regularization) on MNIST.
  2. Implement a function to compute the training/test error of multinomial logistic regression on MNIST.
  3. Implement a function to compute the training loss of multinomial logistic regression (with l2 regularization) on MNIST.
  4. Implement a function to run gradient descent on multinomial logistic regression (with l2 regularization) on MNIST.
  5. Implement a function to run gradient descent using Nesterov's momentum on multinomial logistic regression (with l2 regularization) on MNIST. Here is some pseudocode for that algorithm.

    Algorithm 2 Nesterov's momentum GD

    1:procedure Nesterov(f,α,β,w0,num_epochsf, \alpha, \beta, w_0, \texttt{num\_epochs})

    2:initialize w0w0w_0 \leftarrow w_0

    3:initialize v0w0v_0 \leftarrow w_0

    4:for t=1t = 1 to num_epochs\texttt{num\_epochs} do

    5:vtwt1αf(wt1)v_t \leftarrow w_{t-1} - \alpha \nabla f(w_{t-1})

    6:wtvt+β(vtvt1)w_t \leftarrow v_t + \beta (v_t - v_{t-1})

    7:end for

    8:return ww

    9:end procedure

  6. Using an L2 regularization parameter of γ=0.0001\gamma = 0.0001 and step size α=1.0\alpha = 1.0 for 100 epochs/iterations, record the value of the parameters every iteration (i.e. once per full gradient step), for the following three algorithms and parameter settings:
  7. Evaluate the training error, test error, and training loss of each recorded model exactly, using the entire MNIST training and test dataset.
  8. Plot the resulting error and loss against the number of epochs in three figures, one for Training error, one for Test error, and one for Training Loss. How does the performance of the three methods compare?
  9. Measure the runtime of the two algorithms (Gradient descent and Nesterov's momentum) under the above parameter settings. How does the time to run an iteration compare between the two algorithms?
  10. Now test some other values of the hyperparameters α\alpha and betabeta for both algorithms. How do the best settings you can come up with compare between gradient descent and Nesterov's momentum? List at least two hyperparameter settings you tried for each of the two algorithms, and describe what you observed by looking at the training error, test error, and training loss.

Part 2: Momentum with SGD.

  1. Implement a function to compute minibatch sequential sampling SGD for multinomial logistic regression (with l2 regularization) on MNIST.
  2. Implement a function to compute minibatch SGD with momentum and sequential sampling for multinomial logistic regression (With l2 regularization) on MNIST. Here is some pseudocode for that algorithm.

    Algorithm 3 SGD with Momentum using Minibatching and Sequential Sampling

    1:procedure MomentumSGD((f1,,fn),α,β,w0,B,num_epochs(f_1, \ldots, f_n), \alpha, \beta, w_0, B, \texttt{num\_epochs})

    2:initialize ww0w \leftarrow w_0

    3:initialize v0v \leftarrow 0

    4:for t=1t = 1 to num_epochs\texttt{num\_epochs} do

    5:for i=0i = 0 to (n/B)1(n / B) - 1 do

    6:vβvα1Bb=1BfiB+b(w)v \leftarrow \beta v - \alpha \cdot \frac{1}{B} \sum_{b = 1}^B \nabla f_{i \cdot B + b}(w)

    7:ww+vw \leftarrow w + v

    8:end for

    9:end for

    10:return ww

    11:end procedure

  3. Using an L2 regularization parameter of γ=0.0001\gamma = 0.0001, step size α=0.2\alpha = 0.2, and minibatch size B=600B = 600 for 10 epochs, record the value of the parameters every 10 iterations (i.e. ten times per epoch), for the following three algorithms and parameter settings:
  4. Evaluate the training error, test error, and training loss of each recorded model exactly, using the entire MNIST training and test dataset.
  5. Plot the resulting error and loss against the number of epochs in three figures, one for Training error, one for Test error, and one for Training Loss. How does the performance of the three methods compare?
  6. Measure the runtime of the two algorithms (SGD and SGD+Momentum) under the above parameter settings. How does the time to run an iteration compare between the two algorithms?
  7. Now test some other values of the hyperparameters α\alpha and betabeta for both algorithms. How do the best settings you can come up with compare between SGD and SGD+momentum? List at least two hyperparameter settings you tried for each of the two algorithms, and describe what you observed by looking at the training error, test error, and training loss.

Part 3: ADAM.

  1. Implement a function to compute the Adam optimization algorithm (using minibatching and sequential sampling) for multinomial logistic regression (with l2 regularization) on MNIST. Here is some pseudocode for that algorithm.

    Algorithm 4 Adam

    1:input: learning rate factor α\alpha, decay rates ρ1\rho_1 and ρ2\rho_2, initial parameters ww

    2:input: minibatch size BB, small constant ϵ\epsilon to avoid division by 00 (e.g. ϵ=105\epsilon = 10^{-5})

    3:initialize r0r \leftarrow 0 (a vector in Rd\R^d )

    4:initialize s0s \leftarrow 0 (a vector in Rd\R^d )

    5:initialize timestep t0t \leftarrow 0

    6:for k=1k = 1 to num_epochs\texttt{num\_epochs} do

    7:for i=0i = 0 to (n/B)1(n / B) - 1 do

    8:tt+1t \leftarrow t + 1

    9:sample a stochastic gradient g1Bb=1BfiB+b(w)g \leftarrow \frac{1}{B} \sum_{b = 1}^B \nabla f_{i \cdot B + b}(w)

    10:accumulate first moment estimate sjρ1sj+(1ρ1)gjs_j \leftarrow \rho_1 s_j + (1 - \rho_1) g_j for all j{1,,d}j \in \{1, \ldots, d \}

    11:accumulate second moment estimate rjρ2rj+(1ρ2)gj2r_j \leftarrow \rho_2 r_j + (1 - \rho_2) g_j^2 for all j{1,,d}j \in \{1, \ldots, d \}

    12:correct first moment bias s^s1ρ1t\hat s \leftarrow \frac{s}{1 - \rho_1^t}

    13:correct second moment bias r^r1ρ2t\hat r \leftarrow \frac{r}{1 - \rho_2^t}

    14:update model wjwjαr^j+ϵs^jw_j \leftarrow w_j - \frac{\alpha}{\sqrt{\hat r_j + \epsilon}} \cdot \hat s_j for all j{1,,d}j \in \{1, \ldots, d \}

    15:end for

    16:end for

    17:return ww

  2. Using an L2 regularization parameter of γ=0.0001\gamma = 0.0001, and minibatch size B=600B = 600 for 10 epochs, record the value of the parameters every 10 iterations (i.e. ten times per epoch), for the following two algorithms and parameter settings:
  3. Evaluate the training error, test error, and training loss of each recorded model exactly, using the entire MNIST training and test dataset.
  4. Plot the resulting error and loss against the number of epochs in three figures, one for Training error, one for Test error, and one for Training Loss. How does the performance of the two methods compare?
  5. Measure the runtime of the two algorithms (SGD and Adam) under the above parameter settings. How does the time to run an iteration compare between the two algorithms?
  6. Now test some other values of the hyperparameters α\alpha, rho1rho_1, and ρ2\rho_2 for Adam. How do the best settings you can come up with compare between SGD (from Part 2) and Adam? List at least three hyperparameter settings you tried for Adam, and describe what you observed by looking at the training error, test error, and training loss.

What to submit:

  1. An implementation of the functions in main.py.
  2. A lab report containing:

Setup:

  1. Run pip3 install -r requirements.txt to install the required python packages