Project Due: Monday, May 3, 2021 at 11:59pm
Late Policy: Up to two slip days can be used for the final submission.
Please submit all required documents to CMS.
This is a group project. You can either work alone, or work ONLY with the other members of your group, which may contain AT MOST three people in total. Failure to adhere to this rule (by e.g. copying code) may result in an Academic Integrity Violation.
Overview: In this project, you will be learning how to use Bayesian optimization to automatically tune the hyperparameters of a machine learning algorithm. This project involves a combination of the skills you've previously developed from working with TensorFlow and the tehnical skills you've learned in class about Bayesian optimization. Note that while Project 4 explored TensorFlow's high-level support for deep learning with Keras, here we'll see some more low-level aspects of TensorFlow as we use it to compute gradients for the inner optimization problem of Bayesian optimization.
Background: In class, we discussed the Gaussian process prior for Bayesian optimization. The Gaussian process we described in class used a prior that was determined by a kernel, such as the RBF kernel, for which \(K(x, x) = 1\). This implicitly assumes that if we evaluate the objective function multiple times with the same parameters \(x\), then the value will be exactly the same (i.e. the variance of the difference between different evaluations is \(0\)). This also implicitly assumes that for values of \(x\) and \(z\) that are very close together \( K(x, z) \) becomes arbitrarily close to \(1\), and so, the variance of \(f(x) - f(z)\) also becomes small, implying that they are very close together as well. More explicitly, \[ \begin{aligned} \mathbf{E}\left[ \left( f(x) - f(z) \right)^2 \right] &= \mathbf{E}\left[ f(x)^2 \right] - 2 \mathbf{E}\left[ f(x) f(z) \right] + \mathbf{E}\left[ f(z)^2 \right] \\ &= K(x, x) - 2 K(x, z) + K(z, z) \\ &= 2 - 2 K(x,z) \\ &= 2 - 2 \cdot \exp\left( -\gamma \| x - z \|^2 \right) \end{aligned} \] which comes arbitrarily close to zero as \( x \) approaches \( z \). But for a randomized process like training a ML model, this is not necessarily the case, and the objective measurements \(f(x)\) can be noisy or discontinuous (which is particularly the case for metrics like error that take on discrete values). To better handle this situation, it is common to model the observations of the objective \(f(x)\) as noisy with some noise parameter \(\sigma^2\): this is discussed in more detail in the "Additive Gaussian Noise" section of this old CS4780 lecture on Gaussian processes. Using this noisy observations assumption, the predicted value of the objective observation at a test point \(x_*\) is \[ f(x_*) \sim \mathcal{N}(\mathbf{k}_*^T (\Sigma + \sigma^2 I)^{-1} y, K(x_*, x_*) + \sigma^2 - \mathbf{k}_*^T (\Sigma + \sigma^2 I)^{-1} \mathbf{k}_*). \] Using this assumption can improve the behavior of Bayesian optimization by making the matrix inverses in this expression better-behaved (since nothing ensures that \(\Sigma\) is invertible but \(\Sigma + \sigma^2 I\) certainly will be if \(\sigma^2 > 0\)). So we will be using this formulation with a nonzero \(\sigma^2\) in this assignment.
There are many ways to solve the inner optimization problem to choose the next point \(x_*\) at which we are going to evaluate the objective. In this assignment, we are going to the following a simple heuristic. First, choose a random starting point from some distribution and run gradient descent with a fixed learning rate for some fixed number of steps. Then choose another random starting point and repeat some number of times. Finally, return the parameter vector \(x_*\) that has the smallest value for the acquisition function among all the parameter values arrived at at the end of gradient descent.
Please do not wait until the last minute to do this assignment! Just as in Programming Assignment 4, actually training the models that we will be exploring can take some time, depending on the machine you run on.
Instructions: This project is split into three parts: the implementation of Bayesian optimization for general objective functions, the exploration of its performance for a synthetic objective function, and the investigation of how Bayesian optimization can be used to automatically set the hyperparameters of models you've previously explored.
Part 1: Bayesian Optimization.
rbf_kernel_matrix
that computes a Gaussian RBF kernel matrix. The function should take in two matrix inputs \( X \in \mathbb{R}^{d \times m} \) and \( Z \in \mathbb{R}^{d \times n} \) and one scalar input \( \gamma \in \mathbb{R} \), and it should return a matrix \( \Sigma \in \mathbb{R}^{m \times n} \) such that \( \Sigma_{ij} = K(X_i, Z_j) = \exp(-\gamma \cdot \| X_i - Z_j \|^2 ) \) where \( X_i \) denotes the \(i\)th column of \( X \) and similarly for \( Z_j \).
tf.norm
, tf.matmul
, and tf.math.exp
to be useful.gp_prediction
that computes the distribution predicted by the Gaussian process using an RBF kernel.
This function should take as input the observations you've already made, the parameter \(\gamma\) of the RBF kernel, the parameter \(\sigma^2\) of the noise as described above, and the point \(x_*\) at which we want to make a prediction.
Importantly, we will want to make predictions for many different points \(x_*\), and there is a lot of computation (e.g. the forming of the covariance matrix \(\Sigma\)) that be shared among these many predictions.
To enable this sharing, your function gp_prediction
should use currying in the sense that it takes as input the observations already made and the kernel parameters \(\gamma\) and \(\sigma^2\), performs any precomputation that is necessary on those parameters, and then returns a second function that takes as input \(x_*\) and returns the prediction made by the GP.
lcb_acquisition
and then it returns a function that has the same signature as the other acquisition functions.bayes_opt
that runs the Bayesian optimization algorithm. This function should take as input
objective
: the objective function we want to optimize using Bayesian optimization.d
: the dimension of the parameter vector we are optimizing over.gamma
and sigma2_noise
: the RBF kernel and noise parameters for the Gaussian process.acquisition
: the acquisition function to use. Takes three parameters (\(y_{\text{best}}\), the mean, and the standard deviation of the prediction, and returns a scalar.random_x
: a function that takes no inputs and returns a random parameter vector suitable to pass to the objective function. To be used in the initialization and minimization steps.gd_nruns
: number of independent random initializations we should use for gradient descent.gd_alpha
: learning rate for gradient descent.gd_niters
: number of steps of gradient descent to run.n_warmup
: number of warmup iterations to run.num_iters
: total number of outer iterations to run (equivalently, the total number of evaluations of the objective function to use).gradient_descent
function helpful for evaluating the inner minimization problem.
Part 2: Evaluating Bayesian Optimization on a Synthetic Objective.
test_objective
\[ f(x) = \cos(8x) - 0.3 + (x - 0.5)^2. \]
Use the following parameters:
rand
function.animate_predictions
function to visualize what your Gaussian process is predicting at each step of Bayesian optimization. (If you are running on the VM, you may need to install the library ffmpeg
by running sudo apt-get update
and sudo apt-get install ffmpeg
to get this to work.) This function should take in the outputs of your Bayesian optimization function, and generate an animation. Here's an example of the output of this function, plotted for the PI acquisition function.
In this video, the blue curve represents the true objective function, the red curve represents the mean predicted by Bayesian optimization, the light blue region is a \(2\sigma\)-confidence interval about the mean, and the blue dots are the parameters explored by Bayesian optimization. Briefly report what you observed in this video, and include the video in your submission. (Warning: the video may take some time to render, so start early!)
animate_predictions
function, be sure to include a .mp4
extension or other supported video extension. If you don't include an appropriate extension, the animation library may throw a cryptic error.Part 3: Bayesian Optimization for Learning Hyperparameters.
load_MNIST_dataset_with_validation_split
splits the MNIST training set into a new training set of 50000 examples and a validation set of 10000 examples.
The hyperparameters we want to set are \(\alpha\), the learning rate, \(\beta\), the momentum parameter, and \(\gamma\), the \(\ell_2\)-regularization constant (not to be confused with the RBF kernel hyperparameter).
Since these hyperparameters are all of different scales, we need to rescale them somehow to be on a similar scale.
When you're using Bayesian optimization in practice, this is something that you'll need to do based on your intuition about the problem, but for this assignment I've suggested the following parameterization.
Let \(x \in \mathbb{R}^3\) be the parameter optimized by Bayesian optimization, and define
\[
\gamma = 10^{-8 \cdot x_1}, \;
\alpha = 0.5 \cdot x_2, \;
\beta = x_3.
\]
This scales all the parameters such that randomly sampling each coordinate of \(x\) from \([0,1]\) will result in a reasonable setting of the hyperparameters.
Implement the function mnist_sgd_mss_with_momentum
, which provides an objective function for Bayesian optimization.
Specifically, this function trains the classifier using SGD+Momentum with the specified parameters, and then returns the validation error minus \(0.9\). (This is the validation error minus what we expect the validation error would be for random guessing).
rand
function.mnist_sgd_mss_with_momentum
(as opposed to the other computations internal to Bayesian optimization). Report your methodology, your observations, and your conclusions about what the fraction is. What does this say about the overhead of Bayesian optimization for this task?What to submit:
Setup:
pip3 install -r requirements.txt
to install the required python packages