Programming Assignment 1: ERM, Gradient Descent, and Subsampling

CS4787 — Principles of Large-Scale Machine Learning — Spring 2019

Project Due: February 17, 2019 at 11:59pm

Late Policy: Up to two slip days can be used for the final submission.

Please submit all required documents to CMS.

This is a partner project. You can either work alone, or work ONLY with your chosen partner. Failure to adhere to this rule (by e.g. copying code) may result in an Academic Integrity Violation.

Overview: In this project, you will be implementing and evaluating gradient descent for empirical risk minimization on a multinomial logistic regression task on the MNIST dataset. We'll start by reviewing multinomial logistic regression as a method. Consider a multi-class classification task with $c$ classes in which the predictions $\hat y$ are distributions over the $c$ classes. That is, \[ \hat y \in \R^c, \hspace{2em} \hat y_j \ge 0 \text{ for all } j \in \{1, \ldots, c\} \hspace{2em} \text{and} \hspace{2em} \sum_{j=1}^c \hat y = 1. \] The example labels $y$ are one-hot vectors identifying a single class. That is, $y \in \R^c$, each $y_j \in \{0, 1\}$, and there exists exactly one $j \in \{1, \ldots, c\}$ such that $y_j = 1$. The examples themselves are vectors $x \in \R^d$. Multinomial logistic regression uses a linear model hypothesis with a softmax function. Specifically, given a parameter matrix $W \in \R^{c \times d}$, the hypothesis is \[ h_W(x) = \operatorname{softmax}(Wx) \] where the softmax function maps from $\R^c$ to $\R^c$ and is defined by \[ \left( \operatorname{softmax}(u) \right)_j = \frac{\exp(u_j)}{\sum_{k=1}^c \exp(u_k)}. \] We want to perform empirical risk minimization using the cross entropy loss function \[ L(\hat y, y) = -\sum_{j=1}^c y_j \cdot \log(\hat y_j), \] where here $\log$ refers to the natural logarithm, and using $\ell_2$ regularization. Our regularized empirical risk for some dataset $\mathcal{D}$, using regularization parameter $\gamma > 0$ is defined as \[ R(h_W) = \frac{1}{|\mathcal{D}|} \sum_{(x,y) \in \mathcal{D}} L(h_W(x), y) + \frac{\gamma}{2} \| W \|_F^2, \] where $\| W \|_F$ denotes the Frobenius norm (also known as Euclidean norm) is just the square root of the sum of the squares of the entries of $W$. \[ \| W \|_F=\sqrt(\sum_{i=1}^m \sum_{j=1}^n |w_{i,j}|^2 ) \]

Part 1: The gradient of the loss function. A key part of gradient descent is computing the gradient of the loss. For this problem, the gradient of the (unregularized) loss for a single example with respect to the parameters $W$ is \[ \nabla_W L(h_W(x), y) = \left( \operatorname{softmax}(Wx) - y \right) x^T. \]

Derive an expression for the gradient of the total loss with regularization over a whole dataset with respect to the weights $W$. Also write out an expression for a single update step of gradient descent. Include these expressions (and the derivation) in your report.

Implement a function multinomial_logreg_loss_i to compute the loss of a single example $(x,y)$ given parameters $W$.

Implement a function multinomial_logreg_grad_i to compute the gradient of a single example $(x,y)$ given parameters $W$.

One quick and rough way to check if your gradient function is actually the derivative of your loss function is to use the loss function to compute a numerical approximation of the derivative, and compare it to the value the gradient function produces. Recall that, for a function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ and vectors $x$ and $v$, \[ v^T \nabla f(x) = \lim_{\eta \rightarrow 0} \frac{f(x + \eta v) - f(x)}{\eta}, \] so for sufficiently small $\eta$, \[ v^T \nabla f(x) \approx \frac{f(x + \eta v) - f(x)}{\eta}. \] How might you use this formula to give a rough test of whether your gradient and loss functions are correct? Write code to run the test you developed, using the function test_gradient. test_gradient is only meant to be self-contained i.e. don't integrate it with the rest of the steps. The functions will run through the autograder to determine if you are checking accurately, this could consist of print statements or any other way to show that the two values are approximately the same. Include the results of your test and an explanation in your report.

Part 2: Gradient descent.

Implement the function gradient_descent, the gradient descent algorithm. Your implementation should use the function multinomial_logreg_total_grad which computes the total gradient of the loss function for the whole dataset, using your implementation of multinomial_logreg_grad_i.

Run the gradient descent algorithm with L2 regularization parameter $\gamma = 0.0001$ and step size $\alpha = 1.0$ for 10 iterations, and record the value of the parameters every 10 iterations (i.e. only at the beginning and end). Time how long it took to run the algorithm. How long do you expect it would take to run 1000 iterations with this method? Report these results in your report.

Part 3: Computing more efficiently

Re-implement the functions multinomial_logreg_total_loss and multinomial_logreg_total_grad to use NumPy matrix arithmetic rather than Python for-loops to express computations of the loss and gradients of the whole dataset more efficiently. You should be able to implement this using no Python for-loops at all.

Re-run the gradient descent algorithm with the same parameters as in Part 2, for 10 epochs as before. Time how long it took to run the algorithm. How does this compare to your measurements from Part 2? Which approach is faster? Can you explain why? Report these results in your report.

Part 4: Evaluating gradient descent

Implement a function multinomial_logreg_error to compute the error of the classifier given parameters $W$.

Run your gradient descent algorithm with L2 regularization parameter $\gamma = 0.0001$ and step size $\alpha = 1.0$ for 1000 iterations, and record the value of the parameters every 10 iterations (i.e. a total of 101 records).

Evaluate the training error, test error, training loss, and test loss of each recorded model exactly, using the entire MNIST training or test dataset. Plot the resulting error and loss with your results in part in four figures, one for Training error, one for Test error, one for Training loss, and one for Test loss. Here, the $x$-axis of your figures should report iterations.

Now write a function estimate_multinomial_logreg_error to estimate the error of the model using subsampling. Estimate the training error and test error of each recorded model using a random subsample of the dataset (of sizes 100 and 1000). How do these results compare to the exact error? What subsample size do you think you need to get a good sense of what the error is? Explain. Report these results in your report.

Measure the runtime of these error evaluations, comparing the exact (full dataset) error evaluating with your subsampled error evaluation method (at sizes 100 and 1000). Which one seems to be faster? Can you explain why? Report these results in your report.

What to submit:

  1. An implementation of the functions in main.py.
  2. A lab report containing:

Setup:

  1. Run pip3 install -r requirements.txt to install the required python packages