Optimality conditions
We will frequently phrase model fitting problems in terms of optimization, which means we need to remember how optimality conditions work. This includes not only remembering first-order and second-order optimality conditions for derivatives, but also knowing a little about optimality conditions for non-differentiable functions. The key concept behind optimality conditions is to locally approximate a function in terms of a local polynomial approximation, where it usually suffices to go out to first or second order.
Minima and maxima
Let be a continuous function. The set is the feasible set. We say the problem of minimizing or maximizing is an unconstrained optimization problem when (all points in the space are feasible). Otherwise, it is a constrained optimization problem.
We say is a local minimizer if any close enough to satisfies (more precisely: there is some so that for all within of , ). The value is the local minimum value. For a strong local minimizer, the inequality is strict – any in a neighborhood of satisfies . We say is a global minimizer if for all , and a strong global minimizer if for all other than .
We can define a local maximizer or a strong local maximizer analogously, but we usually focus on the case of minimization rather than maximization (and in any event, maximizing is the same as minimizing ).
When the feasible set is compact (for finite-dimensional vector spaces, this means closed and bounded), we are guaranteed that there is a global minimizer in . In other cases, may not be bounded below, or it may be bounded below but with no point where the greatest lower bound (the infimum) is achieved.
Derivative and gradient
If is differentiable, a stationary point is a point such that . At a non-stationary point, there is guaranteed to be some direction such that (and ), so that can neither attain an (unconstrained) minimum nor maximum at that point. Stationary points can be minima, maxima, or saddles; we usually classify them as such by the second derivative test.
In an inner product space, the gradient (the Riesz map of ) gives us the “direction of steepest ascent,” and the negative gradient gives us the direction of steepest descent. It is important to realize that this direction depends on the inner product used! Moreover, the concept of steepest ascent or descent generalizes to other normed linear spaces where we do not assume an inner product: all we need is the notion of the change in the function value relative to the distance from a starting point. For example, if is with the norm, then the direction of steepest descent (or a direction, if has some zero components) is given by a vector of where ; here is the vector of unit norm such that is maximal.
Second derivative test
When is twice differentiable, we can characterize the function near a stationary point by the second derivative: The point is a local minimizer if the quadratic form given by is positive definite and a local maximizer if is negative definite. If the Hessian is semidefinite, we need to look at higher derivatives to classify . If is strongly indefinite (i.e. the inertia has nonzero positive and negative components), then is a saddle point.
For computation, of course, we usually express with respect to a basis. In this case, we describe the second derivative test in terms of the inertia of the Hessian matrix .
Constraints and cones
We have described the first and second derivative tests in the context of unconstrained optimization, but the general approach is equally sensible for constrained optimization problems. We generally define the feasible domain in terms of a set of constraints, equations and inequalities involving functions that (for our purposes) are at least continuously differentiable: We say the th inequality constraint is active at if . We write the active set at as We do not worry about trying to classify the equality constraints, though in a sense they are always active.
Tangent cone
The first derivative test tells us that if is locally minimized at , then a local linear model should not predict a nearby point with smaller function values. That is, there should not be and a (differentiable) feasible path with and , or we will have for all sufficiently small . The set of directions that are tangent to some feasible path at is known as the tangent cone at . A function satisfies a (first order) geometric optimality condition at if where denotes the tangent cone. Equivalently, where is the polar cone of , i.e. A local minimizer must satisfy this condition, though not all points that satisfy this condition need be local minimizers.
Linearized cone
The trouble with the geometric optimality condition is that we somehow need to characterize the tangent cone at each feasible point. Because we are characterizing the feasible set in terms of differentiable constraint functions, it is tempting to try to characterize feasible directions via the derivatives of the constraints by looking at the linearized cone It is true that , and the polars satisfy . Unfortunately, the linearized cone can be strictly bigger than the tangent cone. This is true even in 1D. For example, consider the domain written as . The tangent cone at is , but the linearized cone is . Hence, if we seek to minimize subject to , the point satisfies the geometric optimality condition, but the condition is not satisfied if we replace the tangent cone with the linearized cone.
Constraint qualification
A constraint qualification condition is a hypothesis on the constraints that guarantees that , so that we can use the linearized cone in lieu of the tangent cone in the geometric optimality condition. Two common examples are the linearly independent constraint qualification (LICQ), which holds at if and the Mangasarian-Fromovitz constraint qualification (MFCQ), which holds at if Note that our example of the constraint satisfies neither LICQ nor MFCQ at .
Multipliers and KKT
We have already seen one theorem of alternatives in our discussion of linear algebra — the Fredholm alternative theorem, which deals with solvability of linear equations. There are many other theorems of alternatives for dealing with inequalities, associated with mathematicians like Motzkin, Kuhn, and Farkas. One such theorem is Farkas’ lemma: if and , then exactly one of the two statements is true:
- There exists such that and
- There exists such that and
Here the inequalities are taken elementwise.
Using Farkas’s lemma, one can rewrite the polar of the linearized cone as It is usually convenient to define for inequality constraints that are inactive; in this case, we can rewrite where and are (column) vectors of equality and inequality constraint functions and and are concrete row vectors. The statement that , typically called a complementary slackness condition is equivalent to saying that inactive inequality constraints (where ) must be associated with zero multipliers.
With this machinery in place, we can now rewrite the condition in the form that we usually use. We define the Lagrangian function The variables and that multiply the constraints are known as Lagrange multipliers. The Karush-Kuhn-Tucker (KKT) conditions at , equivalent to are When a constraint qualification condition holds, the KKT conditions are necessary first-order conditions for optimality at .
Multipliers and adjoints
Now suppose that we wanted to find a minimum (or maximum or other stationary point) of subject to the equations we saw in Section 5.4.7 The Lagrangian for this system is where are the primal variables and are multipliers. Stationarity gives We can write this equivalently as where This is precisely the set of equations that we saw in Section 5.4.7, except that where we said “dual variables” in Section 5.4.7, here we are saying (negative) “Lagrange multipliers.”
Constrained second derivatives
The second-order geometric optimality condition for a twice-differentiable at is a straightforward extension of the first-order geometric optimality condition. If satisfies the first-order condition and we satisfy that then is a constrained local minimum. If there are nonzero directions where and , then is not a constrained local minimum. Otherwise, we cannot determine minimality without looking at higher derivatives.
As with the geometric first-order condition, the geometric second-order condition is complicated by the need to get a handle on the tangent cone. Assuming that we satisfy the linearly independent constraint qualification condition at , we have a more straightforward version of the second derivative test. Let Then a sufficient condition for to be a strong constrained local minimizer is if That is, the Hessian should be positive definite in all directions that are not already “uphill” at first order. Such a condition is known as conditional positive definiteness, since is positive definite conditioned on some constraints on the directions considered. If there are such directions for which , then we do not have a local minimizer; otherwise, we may need to consider higher derivatives to make a diagnosis.
Convexity
A set is convex if every line segment between points in also lies in , i.e. A function is convex if the graph of on a line segment in always lies below the secant connecting the endpoint values: We say is strictly convex if the inequality is strict on the interior of the segment connecting and : If is twice differentiable, convexity corresponds to positive semi-definiteness of the Hessian, and strict convexity corresponds to positive definiteness of the Hessian. However, functions can be convex even if they are not twice differentiable.
A convex function on a closed, bounded set in finite dimensions is guaranteed to take on its minimum value at some point . There are no local minimizers, only global minimizers. The set of global minimizers is itself a convex set. If is strictly convex, then the global minimizer is unique.
Subderivatives
A subderivative of a convex at is the set of functionals such that for all ; that is, the subderivative corresponds to the set of “supporting hyperplanes” that agree with and lie below elsewhere on . When is an inner product space, we say is in the subgradient at if whenever .
When is differentiable at , the only element of the subderivative is the derivative (and the only element of the subgradient is the gradient). However, the concept of a subderivative continues to make sense even for nonsmooth functions. For example, the absolute value function on is not differentiable at 0, but has a subgradient .
The notion of a subderivative allows us to generalize the usual notion of stationary points for differentiable functions: a point is a stationary point for a convex if the subderivative of at contains the , and a minimizer must be a stationary point.