Programming Assignment 4: Training Neural Networks

CS4787 — Principles of Large-Scale Machine Learning — Spring 2020

Project Due: Monday, April 27, 2020 at 11:59pm

Late Policy: Up to two slip days can be used for the final submission.

Please submit all required documents to CMS.

This is a partner project. You can either work alone, or work ONLY with your chosen partner. Failure to adhere to this rule (by e.g. copying code) may result in an Academic Integrity Violation.

Overview: In this project, you will be learning how to train a deep neural network using a machine learning framework. While you've seen in the homework that differentiating even simple deep neural networks by hand can be tedious, machine learning frameworks that run backpropagation automatically make this easy. This assignment instructs you to use TensorFlow, the most popular machine learning framework at the moment, but the experience should transfer to whatever machine learning frameworks you decide to use in your own projects. These frameworks and learning methods drive machine learning systems at the largest scales, and the goal of this assignment is to give you some experience working with them so that you can build your intuition for how they work.

This assignment is designed and tested with TensorFlow 2.1.0, using Keras (the default front-end for TensorFlow 2). If you run into issues with the code, please check you have the right version of TensorFlow installed.

In this assignment, you are going to explore training a neural network on the MNIST dataset, the same dataset you have been working with so far in this class. MNIST is actually a relatively small dataset to use with Deep Learning, but it's a good first dataset to use to start playing around with these frameworks and learning how they work. While image datasets like MNIST usually use convolutional neural networks (CNNs), here for simplicity we'll mostly look at how a fully connected neural network performs on MNIST, since this is closest to what we've discussed in class.

Please do not wait until the last minute to do this assignment! While I have constructed it so that the programming part will not take so long, actually training the networks can take some time, depending on the machine you run on. It takes my implementation about five minutes to train all the networks (without any hyperparameter optimization).

Instructions: This project is split into three parts: the training and evaluation of a fully connected neural network for MNIST, the exploration of hyperparameter optimization for this network, and the training and evaluation of a convolutional neural network (CNN) which is more well-suited to image classification tasks.

Part 1: Fully Connected Neural Network on MNIST.

  1. Implement a function, train_fully_connected_sgd, that uses TensorFlow and Keras to train a neural network with the following architecture. Run your function to this network with a cross entropy loss (hint: you might find the sparse_categorical_crossentropy loss from Keras to be useful here) using stochastic gradient descent (SGD). Use the following hyperparameter settings and instructions: Your code should save the following statistics:
  2. Now modify the function you designed above in Part 1.1 (train_fully_connected_sgd) to support learning with momentum. Train your network with momentum SGD using the following hyperparameter settings and instructions You should save the same statistics as listed above.
  3. Now implement a function, train_fully_connected_adam, that trains the same neural network using Adam. Train your network using the following hyperparameter settings and instructions You should save the same statistics as listed above.
  4. Finally, let's explore whether batch normalization can help improve our accuracy or convergence speed here. Implement a function, train_fully_connected_bn_sgd, that trains the same neural network using momentum SGD and batch normalization. Add a batch norm layer after each linear layer in the original network. Train this network using the following hyperparameter settings and instructions: You should save the same statistics as listed above.
  5. For each of the four training algorithms you ran above, plot the following two figures: This is a total of eight figures.
  6. Report the wall-clock times used the the algorithm for training. How does the performance of the different algorithms compare? Please explain briefly.

Part 2: Hyperparameter Optimization.

  1. For the SGD with momentum algorithm, use grid search to select the step size parameter from the options \(\alpha \in \{1.0, 0.3, 0.1, 0.03, 0.01, 0.003, 0.001\} \) that maximizes the validation accuracy. Report the values of the validation accuracy and validation loss you observed for each setting of \(\alpha\), and report the \(\alpha\) that the grid search selected. Did the step size found by grid search improve over the step size given in the instructions in Part 1?
  2. Now choose any one of the four algorithms from Part 1, and choose any three hyperparameters you want to explore (e.g. the momentum, the layer width, the number of layers, et cetera). For each hyperparameter, choose a grid of possible values you think will be good to explore. Report your grid, and justify your selection. Then run grid search using the grid you selected. Report the best parameters you found, and the corresponding validation accuracy, validation loss, test accuracy, and test loss.
  3. Now use random search to explore the same space as you did above in Part 2.2. For each hyperparameter you explored, choose a distribution over possible values that covers the same range as the grid you chose for that hyperparameter in Part 2.2. Report your distribution, and justify your selection. Then run random search using the distribution you selected, running at least 10 random trials. Report the best parameters you found, and the corresponding validation accuracy, validation loss, test accuracy, and test loss.
  4. How did the performance of grid search compare to random search? How did the total amount of time spent running grid search compare to the total amount of time spent running random search?

Part 3: Convolutional Neural Networks.

  1. Implement a function, train_CNN_sgd, that uses TensorFlow and Keras to train a convolutional neural network with the following architecture. Run your function to this network with a cross entropy loss (as before), using the Adam optimizer and the following hyperparameter settings and instructions: You should save the same statistics as listed above.
  2. Plot the following two figures: This is a total of two figures.
  3. Report the wall-clock times used the the algorithm for training. How does the performance compare to the performance of the fully connected network you studied in Part 1?

Hints! You may find the following functions useful for doing this assignment.

What to submit:

  1. An implementation of the functions in main.py.
  2. A lab report containing:

Setup:

  1. Run pip3 install -r requirements.txt to install the required python packages