$\newcommand{\R}{\mathbb{R}}$ $\newcommand{\norm}[1]{\left\|#1\right\|}$ $\newcommand{\Exv}[1]{\mathbf{E}\left[#1\right]}$ $\newcommand{\Prob}[1]{\mathbf{P}\left(#1\right)}$ $\newcommand{\Var}[1]{\operatorname{Var}\left(#1\right)}$ $\newcommand{\Abs}[1]{\left|#1\right|}$
import numpy
import scipy
import matplotlib
from matplotlib import pyplot
import time
matplotlib.rcParams.update({'font.size': 12})
When we analyzed gradient descent and SGD for strongly convex objectives, the convergence rate depended on the condition number $\kappa = L / \mu$.
For example, the noise ball size for SGD with a constant step size was determined by
$$\Exv{ f(w_T) - f^* } \le \left(1 - \alpha \mu \right)^T \cdot (f(w_0) - f^*) + \frac{\alpha \sigma^2 \kappa}{2}$$and for gradient descent with constant step size, the convergence rate was
$$ f(w_T) - f^* \le \exp\left(-\frac{T}{\kappa} \right) \cdot \left( f(w_0) - f^* \right).$$Takeaway: when the condition number is high, convergence can become slow. Motivates the question: How can we speed up gradient descent when the condition is high?
To build our intuition about the condition number, let's look at the simplest possible setting with a non-1 condition number: a 2D quadratic.
$$f(w) = f(w_1, w_2) = \frac{L}{2} w_1^2 + \frac{\mu}{2} w_2^2 = \frac{1}{2} \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}^T \begin{bmatrix} L & 0 \\ 0 & \mu \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}.$$The optimum value occurs at $w^* = 0$ (where $f(w^*) = 0$), and the second derivative matrix is the constant
$$\nabla^2 f(w) = \begin{bmatrix} L & 0 \\ 0 & \mu \end{bmatrix},$$Let's think about what gradient descent (with constant learning rate) is going to do on this problem. The update rule will be" \begin{align*} (w_{t+1})_1 &= (w_{t})_1 - \alpha L (w_t)_1 \\ (w_{t+1})_2 &= (w_{t})_2 - \alpha \mu (w_t)_2. \end{align*} This means that \begin{align*} \left( (w_T)_1 \right)^2 &= (1 - \alpha L)^{2T} \left( (w_0)_1 \right)^2 \\ \left( (w_T)_2 \right)^2 &= (1 - \alpha \mu)^{2T} \left( (w_0)_2 \right)^2 \end{align*}
and so
$$f(w_T) = \frac{L}{2} (1 - \alpha L)^{2T} \left( (w_0)_1 \right)^2 + \frac{\mu}{2} (1 - \alpha \mu)^{2T} \left( (w_0)_2 \right)^2.$$So we have this expression for our quadratic... $$f(w_T) = \mathcal{O}\left( \max(\Abs{1 - \alpha L}, \Abs{1 - \alpha \mu})^{2T} \right).$$ Let's look at this inner max term as a function of $\alpha$.
mu = 1.0
L = 2.0
alpha = numpy.arange(0,1.5,0.01)
alpha_opt = 2 / (L + mu);
pyplot.plot(alpha, numpy.abs(1 - mu * alpha), label = "$|1 - \\mu \\alpha|$");
pyplot.plot(alpha, numpy.abs(1 - L * alpha), label = "$|1 - L \\alpha|$");
pyplot.plot(alpha, numpy.maximum(numpy.abs(1 - L * alpha),numpy.abs(1 - mu * alpha)), "--", label = "$\\max(|1 - L \\alpha|,|1 - \\mu \\alpha|)$");
pyplot.plot([alpha_opt], [numpy.abs(1 - mu * alpha_opt)], "o", label = "optimal point")
pyplot.title("Convergence Rate of GD on Quadratic Objective");
pyplot.xlabel("step size $\\alpha$");
pyplot.legend();
So while we would want to pick a smaller learning rate for the $L$ term, and a larger learning rate for the $\mu$ term, we're forced to pick something in the middle. (Incidentally, this optimal learning rate occurs at
$$\alpha = \frac{2}{L + \mu} \hspace{1em} \Rightarrow \hspace{1em} \max(\Abs{1 - \alpha L}, \Abs{1 - \alpha \mu}) = \frac{L - \mu}{L + \mu} = \frac{\kappa - 1}{\kappa + 1} = 1 - \frac{2}{\kappa + 1}.$$but it's not super important to know this.) That is, we’d like to set the step size larger for dimension with less curvature, and smaller for the dimension with more curvature.
But we can't do this with plain GD or SGD, because there is only one step-size parameter.
Three common solutions:
Motivation: try to tell the difference between more and less curved directions using information already available in gradient descent.
Idea: in the one-dimensional case, if the gradients are reversing sign, then the step size is too large, because we're overshooting the optimum.
Conversely, if the gradients are staying in the same direction, then the step size is too small.
Can we use this to make steps smaller when gradients reverse sign and larger when gradients are consistently in the same direction?
Adds an extra momentum term to gradient descent.
$$w_{t+1} = w_t - \alpha \nabla f(w_t) + \beta (w_t - w_{t-1}).$$Intuition:
This is also known as the heavy ball method because it approximately simulates the dynamics of a ball that has physical momentum sliding through the space we want to optimize over.
Rather than showing you the full analysis for convex functions, we'll look at a simple case that lets us build intuition: the case of 1D quadratics.
$$f(w) = \frac{\lambda}{2} w^2.$$Using this, Polyak momentum's update step looks like
\begin{align*} w_{t+1} &= w_t - \alpha \lambda w_t + \beta (w_t - w_{t-1}) \\ &= (1 + \beta - \alpha \lambda) w_t - \beta w_{t-1}. \end{align*}We can write this in terms of a matrix multiplication as
$$\begin{bmatrix} w_{t+1} \\ w_t \end{bmatrix} = \begin{bmatrix} (1 + \beta - \alpha \lambda) & -\beta \\ 1 & 0 \end{bmatrix} \begin{bmatrix} w_t \\ w_{t-1} \end{bmatrix}.$$This is sometimes called the companion matrix of the process.
If we call this matrix $M \in \R^{2 \times 2}$, then by induction we get
$$\begin{bmatrix} w_{K+1} \\ w_K \end{bmatrix} = M \begin{bmatrix} w_K \\ w_{K-1} \end{bmatrix} = M^2 \begin{bmatrix} w_{K-1} \\ w_{K-2} \end{bmatrix} = \cdots = M^K \begin{bmatrix} w_1 \\ w_0 \end{bmatrix}.$$We need to learn some things about $M$ to make sense of this expression $M^K$.
Let's assume (without checking) that $M$ is diagonalizable, and it has two eigenvectors $u_1$ and $u_2$ with eigenvalues $\chi_1$ and $\chi_2$. I.e.
$$M u_1 = \chi_1 u_1$$and similarly for $u_2$. Then,
$$M^K u_1 = M \cdot M \cdots M u_1 = \chi_1 \cdot \chi_1 \cdots \chi_1 u_1 = \chi_1^K u_1$$,
and so in particular if our input is $\begin{bmatrix}w_1 \\ w_0\end{bmatrix} = c_1 u_1 + c_2 u_2$, for some constants $c_1$ and $c_2$, then
$$M^K \begin{bmatrix}w_1 \\ w_0\end{bmatrix} = c_1 \chi_1^K u_1 + c_2 \chi_2^K u_2 = \mathcal{O}\left( \max(|\chi_1|,|\chi_2|)^K \right).$$So (barring issues with $M$ not being diagonalizable) this algorithm is going to converge at a linear rate determined by the maximum absolute value of an eigenvalue of $M$.
To understand how Polyak momentum converges we need to understand the eigenvalues of $M$.
Recall: the eigenvalues of a matrix are the zeros of its characteristic polynomial.
\begin{align*} |\chi I - M| & = \left| \begin{bmatrix} \chi - (1 + \beta - \alpha \lambda) & \beta \\ -1 & \chi \end{bmatrix} \right| \\& = \chi \left( \chi - (1 + \beta - \alpha \lambda) \right) + \beta \\& = \chi^2 - (1 + \beta - \alpha \lambda) \chi + \beta. \end{align*}So, by the quadratic formula, the eigenvalues of $M$ are going to be
$$\chi = \frac{ (1 + \beta - \alpha \lambda) \pm \sqrt{ (1 + \beta - \alpha \lambda)^2 - 4 \beta } }{ 2 }.$$Now we make an important observation that will simplify our lives. If
$$(1 + \beta - \alpha \lambda)^2 - 4 \beta < 0$$which is equivalent to
$$-1 \le \frac{1 + \beta - \alpha \lambda}{2 \sqrt{\beta}} \le 1,$$then the term in the square root will be negative.
In this case:
Explicitly:
$$\chi_1^* = \chi_2 \hspace{2em}\text{and}\hspace{2em} \Abs{\chi_1}^2 = \Abs{\chi_2}^2 = \chi_1^* \chi_1 = \chi_1 \chi_2.$$But this last term is just the product of the eigenvalues of $M$. And we know what the product of the eigenvalues of $M$ is: that's just the determinant of $M$, which is $\beta$.
This means that
$$\Abs{\chi_1} = \Abs{\chi_2} = \sqrt{\Abs{\chi_1 \chi_2}} = \sqrt{\beta}.$$Since $\Abs{\chi} = \sqrt{\beta}$, this means that the iterates $w_t$ of momentum converge at a rate of $\sqrt{\beta}$ in this case!
Note that this happens no matter what the step size is, as long as our condition is satisfied that
$$-1 \le \frac{1 + \beta - \alpha \lambda}{2 \sqrt{\beta}} \le 1.$$For a multidimensional quadratic, we'll see the same dynamics as the single-dimension quadratic along each of the eigenvectors of the second derivative matrix.
Importantly, they will all converge at the same rate of $\beta^t$, even for directions of different curvature!
This will happen as long as the condition
$$ -1 \le \frac{1 + \beta - \alpha \lambda}{2 \sqrt{\beta}} \le 1$$is satisfied for all the directions' curvatures $\lambda$.
Since $\mu \le \lambda \le L$, this will hold if and only if
$$-1 \le \frac{1 + \beta - \alpha L}{2 \sqrt{\beta}} \hspace{1em}\text{ and }\hspace{1em} \frac{1 + \beta - \alpha \mu}{2 \sqrt{\beta}} \le 1.$$Now, we want to make $\beta$ as small as possible to converge as fast as possible. How small can we make $\beta$ to guarantee that our conditions hold for a quadratic?
This smallest value of $\beta$ will result in our inequalities holding with equality, so we'll get
$$-1 = \frac{1 + \beta - \alpha L}{2 \sqrt{\beta}} \hspace{1em}\text{ and }\hspace{1em} \frac{1 + \beta - \alpha \mu}{2 \sqrt{\beta}} = 1.$$...and when we solve this for $\alpha$ and $\beta$ (two equations, two unknowns), we get
$$\alpha = \frac{2 + 2 \beta}{L + \mu} \hspace{1em}\text{ and }\hspace{1em} \sqrt{\beta} = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} = 1 - \frac{2}{\sqrt{\kappa} + 1}.$$How does this compare to the rate that we got for gradient descent without momentum?
GD w/o Momentum: Converges like $\left(1 - \frac{1}{\kappa}\right)^T \approx \exp\left(- \frac{T}{\kappa}\right)$
GD with Momentum: Converges like $\left(1 - \frac{2}{\sqrt{\kappa} + 1}\right)^T \approx \exp\left(- \frac{2T}{\sqrt{\kappa} + 1}\right)$
Activity: What is the additional computational cost of running momentum? What is the memory requirement?
...
Takeaways:
Slightly different from Polyak momentum; guaranteed to work for convex functions.
Main difference: separate the momentum state from the point that we are calculating the gradient at.
Usually we run something like this:
\begin{align*} v_{t+1} &= \beta v_t - \alpha \nabla f_{\tilde i_t}(w_t) \\ w_{t+1} &= w_t + v_{t+1}. \end{align*}