Programming Assignment 2: Scaling with SGD

CS4787 — Principles of Large-Scale Machine Learning — Spring 2021

Project Due: March 22, 2021 at 11:59pm

Late Policy: Up to two slip days can be used for the final submission.

Please submit all required documents to CMS.

This is a group project. You can either work alone, or work ONLY with the other members of your group, which may contain AT MOST three people in total. Failure to adhere to this rule (by e.g. copying code) may result in an Academic Integrity Violation.

Overview: In this project, you will be implementing and evaluating stochastic gradient descent on the same MNIST task as in Programming Assignment 1. You will also be evaluating the effects of different learning rates and batch sizes, as well as exploring empirically the impact of using different sampling schemes for stochastic gradient descent.

Background: So far in class, we've been talking about stochastic gradient descent implemented as the following

\begin{algorithm} \caption{Stochastic Gradient Descent with Fixed $\alpha$ and Random Sampling} \begin{algorithmic} \PROCEDURE{SGD}{$(f_1, \ldots, f_n), \alpha, w_0, T$} \STATE \textbf{initialize} $w \leftarrow w_0$ \FOR{$t = 1$ \TO $T$} \STATE \textbf{sample } $\tilde i$ \textbf{ uniformly at random from } $\{1, \ldots, n\}$ \STATE $w \leftarrow w - \alpha \cdot \nabla f_{\tilde i}(w)$ \ENDFOR \RETURN $w$ \ENDPROCEDURE \end{algorithmic} \end{algorithm}

(Possibly we may want to use some averaging or random sampling of the iterates, but we'll ignore that for now.) One problem with Algorithm 1 is that since the training examples are chosen at random, they are unpredictable (or to use a more technical term, they have poor memory locality). This can make our results difficult to replicate, which can be bad when we want to scale our systems and test them automatically. Sampling at random can also be bad for the memory subsystem of the processor we are running on, especially if it depends on caches that try to predict which data to prefetch and make available for quick access (e.g. a CPU). We will discuss this more in detail later in the course, when we talk about hardware for machine learning. But for the moment, one (naive) way to address this problem is to just sample the training examples in a predictable order. And what order is more predictable than the order in which the training examples appear in memory? This motivates the development of the following sequential-scan-order version of SGD.

\begin{algorithm} \caption{Stochastic Gradient Descent with Fixed $\alpha$ and Sequential Sampling} \begin{algorithmic} \PROCEDURE{SGD}{$(f_1, \ldots, f_n), \alpha, w_0, \texttt{num\_epochs}$} \STATE \textbf{initialize} $w \leftarrow w_0$ \FOR{$t = 1$ \TO $\texttt{num\_epochs}$} \FOR{$i = 1$ \TO $n$} \STATE $w \leftarrow w - \alpha \cdot \nabla f_i(w)$ \ENDFOR \ENDFOR \RETURN $w$ \ENDPROCEDURE \end{algorithmic} \end{algorithm}

Here, \(\texttt{num\_epochs}\) denotes the number of epochs, or passes through the dataset, used in training. The total number of iterations here will be \(T = \texttt{num\_epochs} \cdot n \). Algorithm 2 can run individual iterations faster than Algorithm 1 because its memory accesses are more predictable. (Although note that our convergence proofs from class do not apply to this modified sequential-sampling algorithm, and more care is needed to get convergence guarantees here.)

We can do the same thing with minibatching. The original version of minibatched SGD we discussed in class uses a random with-replacement subsample of size \(B\) as follows:

\begin{algorithm} \caption{Minibatch SGD with Fixed $\alpha$ and Random Sampling} \begin{algorithmic} \PROCEDURE{SGD}{$(f_1, \ldots, f_n), \alpha, w_0, B, T$} \STATE \textbf{initialize} $w \leftarrow w_0$ \FOR{$t = 1$ \TO $T$} \FOR{$b = 1$ \TO $B$} \STATE \textbf{sample } $\tilde i_b$ \textbf{ uniformly at random from } $\{1, \ldots, n\}$ \ENDFOR \STATE $w \leftarrow w - \alpha \cdot \frac{1}{B} \sum_{b = 1}^B \nabla f_{\tilde i_b}(w)$ \ENDFOR \RETURN $w$ \ENDPROCEDURE \end{algorithmic} \end{algorithm}

In order to make the accesses here more predictable, we can do the same thing as we did to baseline SGD, provided that the minibatch size \(B\) evenly divides the training set size \(n\).

\begin{algorithm} \caption{Minibatch SGD with Fixed $\alpha$ and Sequential Sampling} \begin{algorithmic} \PROCEDURE{SGD}{$(f_1, \ldots, f_n), \alpha, w_0, B, \texttt{num\_epochs}$} \STATE \textbf{initialize} $w \leftarrow w_0$ \FOR{$t = 1$ \TO $\texttt{num\_epochs}$} \FOR{$i = 0$ \TO $(n / B) - 1$} \STATE $w \leftarrow w - \alpha \cdot \frac{1}{B} \sum_{b = 1}^B \nabla f_{i \cdot B + b}(w)$ \ENDFOR \ENDFOR \RETURN $w$ \ENDPROCEDURE \end{algorithmic} \end{algorithm}

Once again, \(\texttt{num\_epochs}\) denotes the number of epochs, or passes through the dataset, used in training. The total number of iterations here will be \(T = \texttt{num\_epochs} \cdot n / B\), and we assume that \(B\) divides \(n\) evenly. In this programming assignment, we will implement, and explore the performance of all of these algorithms, as well as some variants that use a different step size at each iteration.

Instructions: This project is split into three parts: the implementation of the basic functionality of SGD, the exploration of the effects of the hyperparameters (the step size and minibatch size), and evaluation of the systems cost of running SGD.

Part 1: Implementation.

  1. Implement a function to compute the gradient of a batch of training examples for multinomial logistic regression (with l2 regularization) on MNIST.
  2. Implement the four stochastic gradient descent algorithms described above in the Background section.
  3. Run both non-minibatched SGD algorithms (Algorithms 1 and 2) with L2 regularization parameter \(\gamma = 0.0001\) and step size \(\alpha = 0.001\) for 10 epochs. Set the monitor period to 6000 i.e. record the value of the parameters every 6000 update iterations per epoch; ex: n=60000, monitor_period=6000 results in monitoring 10 times per epoch.
  4. Run both minibatched SGD algorithms (Algorithms 3 and 4) with L2 regularization parameter \(\gamma = 0.0001\), step size \(\alpha = 0.05\), and minibatch size \(B = 60\) for 10 epochs, and record the value of the parameters every 100 batch update iterations (i.e. 10 times per epoch).
  5. Evaluate the training error and test error of each recorded model exactly, using the entire MNIST training and test dataset. (You can use your code from Programming Assignment 1 to do this.)
  6. Plot the resulting error against the number of epochs in two figures, one for Training error and one for Test error. How does the performance of the four methods compare?

Part 2: Exploration.

  1. For stochastic gradient descent (Algorithm 1) evaluate some other values of the step size other than the one given to you in Part 1.
  2. For SGD (Algorithm 1), find a value of the step size that allows the algorithm to reach a lower training error after 10 epochs than the step size given to you in Part 1. What effect does this changed step size have on the final test error?
  3. For SGD (Algorithm 1), can you find a value of the step size that allows the algorithm to reach the same (or better) training error after 5 epochs than could be achieved in 10 epochs using the step size given to you in Part 1? (This may be the same step size that you used in the previous step.) What effect does this changed step size have on the final test error after 5 epochs?
  4. Now do the same thing for minibatch SGD with sequential scan (Algorithm 4). Can you find a value for the batch size and learning rate that allows the algorithm to reach the same (or better) training error after 5 epochs than could be achieved in 10 epochs using the step size given to you in Part 1? What effect does this changed step size have on the final test error after 5 epochs?
  5. For at least three different algorithm configurations you explored in this Part, plot the resulting error against the number of epochs in two figures, one for Training error and one for Test error, just as you did for the evaluation in Part 1. You should be plotting the optimal step size and/or batch size for a given algorithm against eachother.

Part 3: Systems Evaluation.

  1. Measure the runtime of all four algorithms using the configurations prescribed in Part 1. How does the runtime of the algorithms compare?
  2. You should observe that some of the algorithms run an epoch in substantially less time than the other algorithms do. Can you explain why?

What to submit:

  1. An implementation of the functions in main.py.
  2. A lab report containing:

Setup:

  1. Run pip3 install -r requirements.txt to install the required python packages