$\newcommand{\R}{\mathbb{R}}$ $\newcommand{\norm}[1]{\left\|#1\right\|}$ $\newcommand{\Exv}[1]{\mathbf{E}\left[#1\right]}$ $\newcommand{\Prob}[1]{\mathbf{P}\left(#1\right)}$ $\newcommand{\Var}[1]{\operatorname{Var}\left(#1\right)}$ $\newcommand{\Abs}[1]{\left|#1\right|}$
import numpy
import scipy
import matplotlib
from matplotlib import pyplot
import time
matplotlib.rcParams.update({'font.size': 14, 'figure.figsize': (6.0, 6.0)})
We're still grading the prelim. The solutions have been released on Canvas.
Second pset with late days due tonight.
Second programming assignment out tonight.
Strongly convex optimization problems are poorly conditioned when the condition number $\kappa$ is large.
We said last time that intuitively, we’d like to set the step size larger for directions with less curvature, and smaller for directions with more curvature.
Last time: momentum
Today we'll talk about two other methods for addressing the issue of conditioning: preconditioning, and adaptive learning rates.
Motivation: One way to think about a large condition number is graphically/geometrically.
The "level sets" of a function $f: \R^d \rightarrow \R$ are the sets $\{ w \mid f(w) = C \}$ at which $f$ takes on one particular value. They are one way to visualize the curvature of the function.
Let's look at our poorly conditioned optimization problem example from last time, the two-dimensional quadratic
$$f(w) = f(w_1, w_2) = \frac{L}{2} w_1^2 + \frac{\mu}{2} w_2^2 = \frac{1}{2} \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}^T \begin{bmatrix} L & 0 \\ 0 & \mu \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}.$$Here, the level sets of $f$ are the sets on which $f(w) = C$ for some constant $C$, which take the form
$$2C = L w_1^2 + \mu w_2^2.$$What is the geometric shape this equation represents?
These are ellipses! Let's plot some. We can parameterize this with $w_1 = \sqrt{\frac{2C}{L}} \cdot \cos(\theta)$ and $w_2 = \sqrt{\frac{2C}{\mu}} \cdot \sin(\theta)$.
def plot_levelsets(L, mu):
t = numpy.linspace(0, 2*numpy.pi, 128)
for C in (((L + mu)/2) * numpy.linspace(0.1,1.9,8)**2):
x = numpy.cos(t) / numpy.sqrt(L / (2*C))
y = numpy.sin(t) / numpy.sqrt(mu / (2*C))
pyplot.plot(x, y, label=f"C = {C:.3f}")
pyplot.axis("square")
pyplot.xlim([-2,2])
pyplot.ylim([-2,2])
pyplot.title(f"Level Sets of $f(w_1, w_2) = \\frac{{{L}}}{{2}} w_1^2 + \\frac{{{mu}}}{{2}} w_2^2$")
pyplot.xlabel("$w_1$")
pyplot.ylabel("$w_2$")
pyplot.legend(bbox_to_anchor=(1.04,1),loc="upper left")
plot_levelsets(100, 1)
plot_levelsets(1, 1)
The learning problem we are solving is invariant to shifts and scales of the weight space. That is, if I scale the weight space, solve the problem in the scaled space, and then scale it back, I'll have the solution to the original problem. Idea: Rescale the underlying space we're optimizing over to make the level sets more like circles.
To do this for an objective function $f$, let's imagine solving the modified optimization problem where we minimize $g(u) = f(R u)$ for some fixed matrix $R$.
First, we need to find the gradient of $g$. Observe that
$$g(u + \eta v) = f(R (u + \eta v)) = f(Ru + \eta Rv)$$so
$$\left. \frac{d}{d \eta} g(u + \eta v) \; \right|_{\eta = 0} = \left. \frac{d}{d \eta} f(Ru + \eta Rv) \; \right|_{\eta = 0} = (Rv)^T \nabla f(Ru) = v^T R^T \nabla f(Ru)$$which lets us conclude that $\nabla g(u) = R^T \nabla f(Ru)$, so gradient descent will look like
$$u_{t+1} = u_t - \alpha \nabla g(u_t) = u_t - \alpha R^T \nabla f(R u_t).$$If we multiply both sides by $R$, and let $w_t$ denote the corresponding value to $u_t$ in the original weight space, i.e. $w_t = R u_t$, we get
$$w_{t+1} = R u_{t+1} = R u_t - \alpha R R^T \nabla f(R u_t) = w_t - \alpha R R^T \nabla f(w_t).$$Effectively, we're just running gradient descent with gradients scaled by some positive semidefinite matrix $P = R R^T$ (why is it positive semidefinite?). This method is called preconditioned gradient descent, and we can apply the same idea to precondition SGD.
Activity: What would be the best preconditioner $P$ to choose for our two-dimensional quadratic example of $f$ above? Is the best preconditioner unique?
$P$ should be the inverse of the Hessian
$$P = \begin{bmatrix} \frac{1}{L} & 0 \\ 0 & \frac{1}{\mu} \end{bmatrix}$$$$w_{t+1} = w_t - \alpha \begin{bmatrix} \frac{1}{L} & 0 \\ 0 & \frac{1}{\mu} \end{bmatrix} \begin{bmatrix} L & 0 \\ 0 & \mu \end{bmatrix} w_t = w_t - \alpha w_t$$Suppose that we (naively) run the preconditioned gradient descent update
$$w_{t+1} = w_t - \alpha P \nabla f(w_t),$$where $w_t \in \R^d$ and $P \in \R^{d \times d}$.
How much memory is needed to store $P$? How much compute time is needed to multiply by $P$ in the preconditioned GD update step?
Memory: $O(d^2)$
Compute: $O(d^2)$
This can be very expensive, especially when the model size is large!
One common way to address this is to use a diagonal preconditioner: we restrict $P$ to be a diagonal matrix.
When we restrict $P$ to be diagonal, now much memory is needed to store $P$? How much compute time is needed to multiply by $P$ in the preconditioned GD update step?
Our setup so far is all very interesting, but it requires us to pull a matrix $P$ from somewhere. How do we pick a useful $P$?
Recall: we said last time that intuitively, we’d like to set the step size larger for directions with less curvature, and smaller for directions with more curvature.
Idea of adaptive learning rates: adjust the learning rate per-component dynamically at each step of GD/SGD so that we can have a different step size in different directions.
Adjust the per-component step size based on observed statistics of the gradients.
Explicitly, in maths notation,
$$w_{j,t+1} = w_{j,t} - \alpha_{j,t} \left( \nabla f_{\tilde i_t}(w_t) \right)_j$$where $w_{j,t}$ denotes the $i$th entry of the model at time $t$, and the learning rate $\alpha_{j,t}$ is now (1) allowed to be different per-parameter, and (2) allowed to vary as a function of the previously observed gradient samples.
AdaGrad sets the step size for each parameter to be inversely proportional to the square root of the sum of all observed squared values of that component of the gradient.
Algorithm AdaGrad
Input: global learning rate factor $\alpha$, initial parameters $w \in \R^d$.
Initialize $r \leftarrow 0 \in \R^d$
loop
Here $w_j$ and $r_j$ denote the $j$th element of the vectors. (Typically we also need to add some small correction factor to avoid dividing by zero in this expression if $r_j$ is zero.)
The motivation behind AdaGrad:
One problem with AdaGrad: it does not necessarily work well in the non-convex setting
This could lead to a step size that is very small in some directions in which we don't want the step size to be small.
A modification to AdaGrad that uses an exponential moving average instead of a sum. This can be more effective for non-convex optimization problems such as neural networks.
Algorithm RMSProp
Input: global learning rate factor $\alpha$, decay rate $\rho$, initial parameters $w \in \R^d$.
Initialize $r \leftarrow 0 \in \R^d$
loop
RMSProp estimates the second moment of the gradients by using an exponential moving average. This is an update that works by initializing $r$ to $0$ and then updating
$$r \leftarrow \rho r + (1 - \rho) x$$to average a sequence of $x$. If we let $r_t$ and $x_t$ now denote the values of these quantities at step $t$, we can write this more formally as
$$r_t = \rho r_{t-1} + (1 - \rho) x_t.$$Problem: this is not necessarily an unbiased estimator of the sequence $x$. (By "unbiased" here I mean, if all the $x_t$ are independent and identically distributed, $r_t$ should have the same mean as $x_t$.) E.g. at the first timestep,
$$r_1 = (1 - \rho) x_1$$but for it to be unbiased, we'd want $r_1 = x_1$.
How can we correct the bias?
One approach: if all the input $x$ were the same, what would $r_t$ look like?
\begin{align*} r_t &= \rho r_{t-1} + (1 - \rho) x \\&= \rho \left( \rho r_{t-2} + (1 - \rho) x \right) + (1 - \rho) x \\&= \rho^2 \left( \rho r_{t-3} + (1 - \rho) x \right) + \rho (1 - \rho) x + (1 - \rho) x \\&= \rho^3 \left( \rho r_{t-4} + (1 - \rho) x \right) + \rho^2 (1 - \rho) x + \rho (1 - \rho) x + (1 - \rho) x \\&= \sum_{i = 0}^{t-1} \rho^{i} (1 - \rho) x \\&= (1 - \rho) x \cdot \frac{1 - \rho^t}{1 - \rho} \\&= x \cdot (1 - \rho^t). \end{align*}So we can correct for the bias by dividing by $1 - \rho^t$.
Modified RMSProp to (1) use momentum with exponential weighting, and (2) correct for bias to estimate the first-order and second-order moments of the gradients. A very popular learning algorithm with over 67,000 citations!
Algorithm Adam
Input: global learning rate factor $\alpha$, decay rates $\rho_1$ and $\rho_2$, initial parameters $w \in \R^d$.
Initialize timestep $t \leftarrow 0$
loop
How can we think of these methods as relating to diagonal preconditioning?