In the previous lecture on Logistic Regression we wrote down expressions for the parameters in our model as solutions to optimization problems that do not have closed form solutions. Specifically, given data with and we saw that (1)
and (2)
These notes will discuss general strategies to solve these problems and, therefore, we abstract our problem to (3) where In other words, we would like to find the vector that makes as small as possible. This is a very general problem that arises in a broad range of applications and often the best way to approach it depends on properties of While we will discuss some basic algorithmic strategies here, in practice it is important to use well developed software—there are a lot of little details that go into practically solving an optimization problem and we will not cover them all here. Numerical optimization is both a very well studied area and one that is continually evolving.
Before diving into the mathematical and algorithmic details we will make some assumptions about our problem to simplify the discussion. Specifically, we will assume that:
is convex. This allows us to assert that any local minimum we find is also a global minimum,1 and helps simplify our discussion of Newton’s method.
is at least thrice continuously differentiable. We are going to extensively use Taylor approximations and this assumption simplifies the discussion greatly.
There are no constraints placed on It is common to consider the problem where represents some constraints on (e.g., is entrywise non-negative). Adding constraints is a level of complexity we will not address here—it is best left to a course that focuses on numerical optimization.
What is a (local) minimizer
The first question we might ask is what it actually means to solve eq. 3. We call a local minimizer of if:
There is some such that for we have that
In other words, there is some small radius around where makes as small as possible. It turns out that our earlier assumption that is convex implies that if we find such a we can immediately assert that for all Note that none of the inequalities involving are strict, we can also define a strict local minimizer by forcing 2 Notably, some convex functions have no strict local minimizers, e.g., the constant function is convex, and some convex functions have no finite local minimizers, e.g., for any non-zero vector is convex but can be made arbitrarily small by letting appropriately.
Mathematically, we often talk about (first and second order) necessary and sufficient conditions for a point to be a local minimizer. We will omit many of the details here, but a key necessary condition is that the gradient of is zero at i.e., If not, we can actually show there is a direction we can move in that further decreases Assuming the gradient at is zero, a sufficient condition for the point to be a strict local minimizer is for the Hessian, i.e., to be positive definite.
Taylor expansions
While we made some assumptions on they do not actually tell us much about the function—it could behave in all sorts of ways. Moreover, as motivated by the logistic regression example it may not be so easy to work with the function globally. Therefore, we will often leverage local information about the function We accomplish this through the use of first and second order Taylor expansions.3
Recall that the first order Taylor expansion of centered at can be written as (4) where is the gradient of evaluated at i.e., for Similarly, the second order Taylor expansion of centered at can be written as (5) where is the Hessian of evaluated at i.e., for These correspond to linear and quadratic approximations of as illustrated in fig. 1. In general, these approximations are reasonably valid if is small (concretely, eq. 4 has error and eq. 5 has error ).
Figure 1: First and second order Taylor approximations.
Search direction methods
The basic approach we will take to solving eq. 3 here fall into the broad category of search direction methods. The core idea is that given a starting point we construct a sequence of iterates with the goal that as In a search direction method we will think of constructing from by writing it as for some “step” Concretely, this means our methods will have the following generic format:
Figure 2: A search direction method applied to optimize a function of two variables.
There are two clearly ambiguous steps in the above algorithm: how do we pick and how do we determine when we have converged. We will spend most of our time addressing the former and then briefly touch on the latter—robustly determining convergence is one of the little details that a good optimization package should do well.
Gradient descent
The core idea behind gradient descent can be summed up as follows: given we are currently at determine the direction in which the function decreases the fastest (at this point) and take a step in that direction. Mathematically, this can be achieved by considering the linear approximation to at provided by the Taylor series. Specifically, if we consider the linear function then the fastest direction to descend is simply However, the linear approximation does not tell us how far to go (it runs off to minus infinity). Therefore, what we actually do in gradient descent is set as for some step size The linear approximation is only good locally, so it is not actually clear that this choice even decreases if is chosen poorly. However, assuming the gradient is non-zero we can show that there is always some small enough such that In particular, we have that Since and faster than as we conclude that for some sufficiently small we have that
In classical optimization, is often referred to as the step size (in this case is the search direction). However, in machine learning is typically referred to as the learning rate. In numerical optimization there are good strategies for picking in an adaptive manner at each step. However, they can be more expensive then a fixed strategy for setting The catch is that setting too small can lead to slow convergence and setting too large can actually lead to divergence—see fig. 3. A safe choice that guarantees convergence is to set at iteration for any constant
Figure 3: Choices of the step size in gradient descent that lead to convergence (left) or divergence (right).
Adagrad
One related strategy is to set the step size per entry of the gradient; though this is actually best thought of as modifying the step direction on a per entry basis. Adagrad (technically, diagonal Adagrad) accomplishes this by keeping a running average of the squared gradient with respect to each optimization variable. It then sets a small learning rate for variables with large gradients and a large learning rate for features with small gradients. This can be important if the entries of are attached to features (e.g., in logistic regression we can associate each entry of with a feature) that vary in scale or frequency.
Input: parameter and initial learning rate
Set and for k = 0;
While not converged:
Compute entries of the gradient
for
for
Check for convergence; if converged set
Return:
Newton’s method
In gradient descent we used first order information about at to determine which direction to move. A natural follow-up is then to use second order information—this leads to Newton’s method. Specifically, we now chose a step by explicitly minimizing the quadratic approximation to at Recall that because is convex is positive semi-definite for all so this is a sensible thing to attempt.4 In fact, Newton’s method has very good properties in the neighborhood of a strict local minimizer and once close enough to a solution it converges rapidly.
To derive the Newton step consider the quadratic approximation Since this approximation is a convex quadratic we can pick such that is a local minimizer. To accomplish this we explicitly solve by differentiating and setting the gradient equal to zero. For simplicity, lets assume for the moment that is positive definite.5 Since the gradient of our quadratic approximation is this implies that solves the linear system In practice there are a many approaches that can be used to solve this system, but it is important to note that relative to gradient descent Newton’s method is more expensive as we have to compute the Hessian and solve a linear system.
A simple example
While we will discuss the pros and cons of Newton’s method in the next section, there is a simple example that clearly illustrates how incorporating second order information can help. Pretend the function was actually a strictly convex quadratic, i.e., where is a positive definite matrix, is an arbitrary vector, and is some number. In this case, Newton converges in one step (since the strict global minimizer of is the unique solution to ).
Meanwhile, gradient descent yields the sequence of iterates Using the fact that we can see that Therefore, as long as is small enough such that all the eigenvalues of are in the iteration will converge—albeit slowly if we have eigenvalues close to . More generally, fig. 4 shows an example where we see the accelerated convergence of Newton’s method as we approach the local minimizer.
Figure 4: Comparison of gradient descent and Newton’s method; in this case the function and starting point are favorable to Newton’s method.
Potential issues
While Newton’s method converges very quickly once is sufficiently close to that does not mean it always converges (see fig. 5) In fact, you can set up examples where it diverges or enters a limit cycle. This typically means the steps are taking you far from where the quadratic approximation is valid. Practically, a fix for this is to introduce a step size and formally set We typically start with since that is the proper step to take if the quadratic is a good approximation. However, if this step seems poor (e.g., ) then we can decrease at that iteration.
Figure 5: Example where Newton’s method converges or diverges depending on the starting point.
Another issue that we sidestepped is that when is convex but not strictly convex we can only assume that is positive semi-definite. In principle this means that may not have a solution. Or, if has a zero eigenvalue then we always have infinite solutions. A “quick fix” is to simply solve instead for some small parameter This lightly regularizes the quadratic approximation to at and ensures it has a strict global minimizer.6
Checking for convergence
In the above meta algorithm we simply defined the sequence of iterates However, even in exact arithmetic one of the iterates will likely never be a local minimizer —if that somehow happens we just got lucky or had a very special function. Rather, all we hope for is as Therefore, we need rules to determine when to stop. Doing so can be delicate in practice, we don’t know how the function looks like. Some common factors that go into assessing convergence include:
Relative change in the iterates, i.e., for some small
A reasonably small gradient, i.e., for some small
Both strategies have their failure modes, so in practice composite conditions are often used to ensure convergence has been reached.
Best practices and extensions
While we are covering a reasonably narrow slice of numerical optimization here (see, e.g., CS 4220 for a more thorough treatment), there are a few miscellaneous points worth stressing. Moreover, as illustrated in the class demo
While Newton’s method has very good local convergence properties, it is considerably more expensive than gradient descent in practice. We have to both compute the Hessian matrix and solve a linear system at each step. Therefore, a lot of time has been devoted to the construction of so-called quasi-Newton methods. Generically, these methods compute a step by solving the linear system where is some sort of approximation to the Hessian that is easier to compute and yields a linear system that is easier to solve.
For example, We could let be the diagonal of the Hessian, i.e., and if Now we only have to compute entries of the Hessian and solving the associated linear system only requires computing for In fact, Adagrad can be viewed through this lens where is a diagonal matrix with These are only some simple examples, there are many more.
Sometimes we can mix methods, for example we could start with gradient descent to try and get near a local minimizer and then switch to Newton’s method to rapidly converge. This is cheaper than the idea above where we introduced a step size since we omit even computing the Hessian till we think it may help us. There are ways to formalize these sort of ideas that we will not cover here.
All the methods we discuss here are also applicable to nonconvex problems. However, in that setting we only look for local solutions and recognize that when/if we find one it may not be the global minimizer of our function.↩︎
If we further assume our function is strictly convex then we can say similar things about strict local and global minimizers.↩︎
Here, we omit an explicit discussion of error terms, but they are sometimes necessary in the analysis of optimization methods.↩︎
This can also be a sensible thing to do when is not convex. However, some modifications have to be made since the best quadratic approximation to may not be a convex quadratic and, therefore, may not have a finite local minimizer. This often takes the form of a modification of the Hessian to get a convex quadratic we can minimize.↩︎
Formally speaking this is not necessary. If is in the range of and has a zero eigenvalue (so there are infinite solutions), we can simply pick the solution of minimal norm. This moves us to the local minimizer of the quadratic approximation closest to Similarly, if is not in the range of then we can just take a gradient step. In fact, if is not in the range of it implies that the quadratic has no finite minimizer. So, a gradient step is a sensible thing to do. Moreover, if has a strict local minimizer then is guaranteed to be positive definite once is close enough to the minimizer and this all becomes a moot point.↩︎