Gradient descent (and beyond)

Cornell CS 4/5780

Fall 2022

In the previous lecture on Logistic Regression we wrote down expressions for the parameters in our model as solutions to optimization problems that do not have closed form solutions. Specifically, given data {(xi,yi)}i=1n with xiRd and yi{+1,1} we saw that w^MLE=argminwRd,bRi=1nlog(1+eyi(wTxi+b))(1)
and w^MAP=argminwRd,bRi=1nlog(1+eyi(wTxi+b))+λwTw.(2)

These notes will discuss general strategies to solve these problems and, therefore, we abstract our problem to minw(w),(3) where :RdR. In other words, we would like to find the vector w that makes (w) as small as possible. This is a very general problem that arises in a broad range of applications and often the best way to approach it depends on properties of . While we will discuss some basic algorithmic strategies here, in practice it is important to use well developed software—there are a lot of little details that go into practically solving an optimization problem and we will not cover them all here. Numerical optimization is both a very well studied area and one that is continually evolving.

Before diving into the mathematical and algorithmic details we will make some assumptions about our problem to simplify the discussion. Specifically, we will assume that:

  1. is convex. This allows us to assert that any local minimum we find is also a global minimum,1 and helps simplify our discussion of Newton’s method.
  2. is at least thrice continuously differentiable. We are going to extensively use Taylor approximations and this assumption simplifies the discussion greatly.
  3. There are no constraints placed on w. It is common to consider the problem minwC(w) where C represents some constraints on w (e.g., w is entrywise non-negative). Adding constraints is a level of complexity we will not address here—it is best left to a course that focuses on numerical optimization.

What is a (local) minimizer

The first question we might ask is what it actually means to solve eq. 3. We call w a local minimizer of if:

There is some ϵ>0 such that for {w|ww2<ϵ} we have that (w)(w).

In other words, there is some small radius around w where w makes as small as possible. It turns out that our earlier assumption that is convex implies that if we find such a w we can immediately assert that (w)(w) for all wRd. Note that none of the inequalities involving are strict, we can also define a strict local minimizer by forcing (w)<(w).2 Notably, some convex functions have no strict local minimizers, e.g., the constant function (w)=1 is convex, and some convex functions have no finite local minimizers, e.g., for any non-zero vector cRd cTw is convex but can be made arbitrarily small by letting w2 appropriately.

Mathematically, we often talk about (first and second order) necessary and sufficient conditions for a point to be a local minimizer. We will omit many of the details here, but a key necessary condition is that the gradient of is zero at w, i.e., (w)=0. If not, we can actually show there is a direction we can move in that further decreases . Assuming the gradient at w is zero, a sufficient condition for the point to be a strict local minimizer is for the Hessian, i.e., 2(w) to be positive definite.

Taylor expansions

While we made some assumptions on , they do not actually tell us much about the function—it could behave in all sorts of ways. Moreover, as motivated by the logistic regression example it may not be so easy to work with the function globally. Therefore, we will often leverage local information about the function . We accomplish this through the use of first and second order Taylor expansions.3

Recall that the first order Taylor expansion of centered at w can be written as (w+p)(w)+g(w)Tp,(4) where g(w) is the gradient of evaluated at w, i.e., (g(w))j=wj(w) for j=1,,d. Similarly, the second order Taylor expansion of centered at w can be written as (w+p)(w)+g(w)Tp+12pTH(w)p,(5) where H(w) is the Hessian of evaluated at w, i.e., [H(w)]i,j=2wiwj(w) for i,j=1,,d. These correspond to linear and quadratic approximations of as illustrated in fig. 1. In general, these approximations are reasonably valid if p is small (concretely, eq. 4 has error O(p22) and eq. 5 has error O(p23)).

Figure 1: First and second order Taylor approximations.

Search direction methods

The basic approach we will take to solving eq. 3 here fall into the broad category of search direction methods. The core idea is that given a starting point w0 we construct a sequence of iterates w1,w2, with the goal that wkw as k. In a search direction method we will think of constructing wk+1 from wk by writing it as wk+1=wk+s for some “step” s. Concretely, this means our methods will have the following generic format:

Input: initial guess w0

k = 0;

While not converged:

  1. Pick a step s
  2. wk+1=wk+s and k=k+1
  3. Check for convergence; if converged set w^=wk

Return: w^

This process is schematically outlined in fig. 2.

Figure 2: A search direction method applied to optimize a function of two variables.

There are two clearly ambiguous steps in the above algorithm: how do we pick s and how do we determine when we have converged. We will spend most of our time addressing the former and then briefly touch on the latter—robustly determining convergence is one of the little details that a good optimization package should do well.

Gradient descent

The core idea behind gradient descent can be summed up as follows: given we are currently at wk, determine the direction in which the function decreases the fastest (at this point) and take a step in that direction. Mathematically, this can be achieved by considering the linear approximation to at wk provided by the Taylor series. Specifically, if we consider the linear function (wk)+g(wk)Ts then the fastest direction to descend is simply sg(wk). However, the linear approximation does not tell us how far to go (it runs off to minus infinity). Therefore, what we actually do in gradient descent is set s as s=αg(wk) for some step size α>0. The linear approximation is only good locally, so it is not actually clear that this choice even decreases if α is chosen poorly. However, assuming the gradient is non-zero we can show that there is always some small enough α such that (wkαg(wk))<(wk). In particular, we have that (wkαg(wk))=(wk)αg(wk)Tg(wk)+O(α2). Since g(wk)Tg(wk)>0 and α20 faster than α as α0 we conclude that for some sufficiently small α>0 we have that (wkαg(wk))<(wk).

In classical optimization, α is often referred to as the step size (in this case g(wk) is the search direction). However, in machine learning α is typically referred to as the learning rate. In numerical optimization there are good strategies for picking α in an adaptive manner at each step. However, they can be more expensive then a fixed strategy for setting α. The catch is that setting α too small can lead to slow convergence and setting α too large can actually lead to divergence—see fig. 3. A safe choice that guarantees convergence is to set α=c/k at iteration k for any constant c>0.

Figure 3: Choices of the step size in gradient descent that lead to convergence (left) or divergence (right).

Adagrad

One related strategy is to set the step size per entry of the gradient; though this is actually best thought of as modifying the step direction on a per entry basis. Adagrad (technically, diagonal Adagrad) accomplishes this by keeping a running average of the squared gradient with respect to each optimization variable. It then sets a small learning rate for variables with large gradients and a large learning rate for features with small gradients. This can be important if the entries of w are attached to features (e.g., in logistic regression we can associate each entry of w with a feature) that vary in scale or frequency.

Input: , , parameter ϵ>0, and initial learning rate α.

Set wj0=0 and zj=0 for j=1,,d. k = 0;

While not converged:

  1. Compute entries of the gradient gj=wj(wk)
  2. zj=zj+gj2 for j=1,,d.
  3. wjk+1=wjkαgjzj+ϵ for j=1,,d.
  4. k=k+1
  5. Check for convergence; if converged set w^=wk

Return: w^

Newton’s method

In gradient descent we used first order information about at wk to determine which direction to move. A natural follow-up is then to use second order information—this leads to Newton’s method. Specifically, we now chose a step by explicitly minimizing the quadratic approximation to at wk. Recall that because is convex H(w) is positive semi-definite for all w, so this is a sensible thing to attempt.4 In fact, Newton’s method has very good properties in the neighborhood of a strict local minimizer and once close enough to a solution it converges rapidly.

To derive the Newton step consider the quadratic approximation (wk+s)(wk)+g(wk)Ts+12sTH(wk)s. Since this approximation is a convex quadratic we can pick s such that wk+1 is a local minimizer. To accomplish this we explicitly solve mins(wk)+g(wk)Ts+12sTH(wk)s by differentiating and setting the gradient equal to zero. For simplicity, lets assume for the moment that H(wk) is positive definite.5 Since the gradient of our quadratic approximation is g(wk)+H(wk)s this implies that s solves the linear system H(wk)s=g(wk). In practice there are a many approaches that can be used to solve this system, but it is important to note that relative to gradient descent Newton’s method is more expensive as we have to compute the Hessian and solve a d×d linear system.

A simple example

While we will discuss the pros and cons of Newton’s method in the next section, there is a simple example that clearly illustrates how incorporating second order information can help. Pretend the function was actually a strictly convex quadratic, i.e., (w)=12wTAw+bTw+c where A is a positive definite matrix, b is an arbitrary vector, and c is some number. In this case, Newton converges in one step (since the strict global minimizer w of is the unique solution to Aw=b).

Meanwhile, gradient descent yields the sequence of iterates wk=(IαA)wk1αb. Using the fact that w=(IαA)wαb we can see that wkwIαA2wk1w2IαA2kw0w2. Therefore, as long as α is small enough such that all the eigenvalues of IαA are in (1,1) the iteration will converge—albeit slowly if we have eigenvalues close to ±1. More generally, fig. 4 shows an example where we see the accelerated convergence of Newton’s method as we approach the local minimizer.

Figure 4: Comparison of gradient descent and Newton’s method; in this case the function and starting point are favorable to Newton’s method.

Potential issues

While Newton’s method converges very quickly once wk is sufficiently close to w, that does not mean it always converges (see fig. 5) In fact, you can set up examples where it diverges or enters a limit cycle. This typically means the steps are taking you far from where the quadratic approximation is valid. Practically, a fix for this is to introduce a step size α>0 and formally set s=α[H(wk)]1g(wk). We typically start with α=1 since that is the proper step to take if the quadratic is a good approximation. However, if this step seems poor (e.g., (wk+s)>(wk)) then we can decrease α at that iteration.

Figure 5: Example where Newton’s method converges or diverges depending on the starting point.

Another issue that we sidestepped is that when is convex but not strictly convex we can only assume that H(wk) is positive semi-definite. In principle this means that H(wk)s=g(wk) may not have a solution. Or, if H(wk) has a zero eigenvalue then we always have infinite solutions. A “quick fix” is to simply solve (H(wk)+ϵI)s=g(wk) instead for some small parameter ϵ. This lightly regularizes the quadratic approximation to at wk and ensures it has a strict global minimizer.6

Checking for convergence

In the above meta algorithm we simply defined the sequence of iterates w0,w1,. However, even in exact arithmetic one of the iterates will likely never be a local minimizer w—if that somehow happens we just got lucky or had a very special function. Rather, all we hope for is wkw as k. Therefore, we need rules to determine when to stop. Doing so can be delicate in practice, we don’t know how the function looks like. Some common factors that go into assessing convergence include:

  1. Relative change in the iterates, i.e., wk+1wk2wk2<δ1 for some small δ!>0.
  2. A reasonably small gradient, i.e., g(wk)2<δ2 for some small δ2>0.

Both strategies have their failure modes, so in practice composite conditions are often used to ensure convergence has been reached.

Best practices and extensions

While we are covering a reasonably narrow slice of numerical optimization here (see, e.g., CS 4220 for a more thorough treatment), there are a few miscellaneous points worth stressing. Moreover, as illustrated in the class demo

While Newton’s method has very good local convergence properties, it is considerably more expensive than gradient descent in practice. We have to both compute the Hessian matrix and solve a d×d linear system at each step. Therefore, a lot of time has been devoted to the construction of so-called quasi-Newton methods. Generically, these methods compute a step s by solving the linear system Mks=g(wk) where MkH(wk) is some sort of approximation to the Hessian that is easier to compute and yields a linear system that is easier to solve.

For example, We could let Mk be the diagonal of the Hessian, i.e., [Mk]i,i=[H(wk)]i,i and [Mk]i,j=0 if ij. Now we only have to compute d entries of the Hessian and solving the associated linear system only requires computing si=[g(wk)]i/[H(wk)]i,i for i=1,,d. In fact, Adagrad can be viewed through this lens where Mk is a diagonal matrix with [M(wk)]i,i=αzi+ϵ. These are only some simple examples, there are many more.

Sometimes we can mix methods, for example we could start with gradient descent to try and get near a local minimizer and then switch to Newton’s method to rapidly converge. This is cheaper than the idea above where we introduced a step size since we omit even computing the Hessian till we think it may help us. There are ways to formalize these sort of ideas that we will not cover here.


  1. All the methods we discuss here are also applicable to nonconvex problems. However, in that setting we only look for local solutions and recognize that when/if we find one it may not be the global minimizer of our function.↩︎

  2. If we further assume our function is strictly convex then we can say similar things about strict local and global minimizers.↩︎

  3. Here, we omit an explicit discussion of error terms, but they are sometimes necessary in the analysis of optimization methods.↩︎

  4. This can also be a sensible thing to do when is not convex. However, some modifications have to be made since the best quadratic approximation to may not be a convex quadratic and, therefore, may not have a finite local minimizer. This often takes the form of a modification of the Hessian to get a convex quadratic we can minimize.↩︎

  5. We will return to this point later.↩︎

  6. Formally speaking this is not necessary. If g(wk) is in the range of H(wk) and H(wk) has a zero eigenvalue (so there are infinite solutions), we can simply pick the solution of minimal norm. This moves us to the local minimizer of the quadratic approximation closest to wk. Similarly, if g(wk) is not in the range of H(wk) then we can just take a gradient step. In fact, if g(wk) is not in the range of H(wk) it implies that the quadratic has no finite minimizer. So, a gradient step is a sensible thing to do. Moreover, if has a strict local minimizer then H(w) is guaranteed to be positive definite once w is close enough to the minimizer and this all becomes a moot point.↩︎