6  Optimization theory

6.1 Optimality conditions

We will frequently phrase model fitting problems in terms of optimization, which means we need to remember how optimality conditions work. This includes not only remembering first-order and second-order optimality conditions for derivatives, but also knowing a little about optimality conditions for non-differentiable functions. The key concept behind optimality conditions is to locally approximate a function in terms of a local polynomial approximation, where it usually suffices to go out to first or second order.

6.1.1 Minima and maxima

Let f:ΩVR be a continuous function. The set Ω is the feasible set. We say the problem of minimizing or maximizing f is an unconstrained optimization problem when Ω=V (all points in the space are feasible). Otherwise, it is a constrained optimization problem.

We say xΩ is a local minimizer if any yΩ close enough to x satisfies f(y)f(x) (more precisely: there is some ϵ>0 so that for all y within ϵ of x, f(y)f(x)). The value f(x) is the local minimum value. For a strong local minimizer, the inequality is strict – any yx in a neighborhood of x satisfies f(y)>f(x). We say x is a global minimizer if f(x)f(y) for all yΩ, and a strong global minimizer if f(x)<f(y) for all yΩ other than x.

We can define a local maximizer or a strong local maximizer analogously, but we usually focus on the case of minimization rather than maximization (and in any event, maximizing f is the same as minimizing f).

When the feasible set Ω is compact (for finite-dimensional vector spaces, this means closed and bounded), we are guaranteed that there is a global minimizer in Ω. In other cases, f may not be bounded below, or it may be bounded below but with no point where the greatest lower bound (the infimum) is achieved.

6.1.2 Derivative and gradient

If f:VR is differentiable, a stationary point is a point x such that f(x)=0. At a non-stationary point, there is guaranteed to be some direction u such that f(x)u>0 (and f(x)(u)<0), so that f can neither attain an (unconstrained) minimum nor maximum at that point. Stationary points can be minima, maxima, or saddles; we usually classify them as such by the second derivative test.

In an inner product space, the gradient f(x) (the Riesz map of f(x)) gives us the “direction of steepest ascent,” and the negative gradient gives us the direction of steepest descent. It is important to realize that this direction depends on the inner product used! Moreover, the concept of steepest ascent or descent generalizes to other normed linear spaces where we do not assume an inner product: all we need is the notion of the change in the function value relative to the distance from a starting point. For example, if V is Rn with the norm, then the direction of steepest descent (or a direction, if f(x) has some zero components) is given by a vector of s where sj=sign(f(x)j); here s is the vector of unit l norm such that f(x)s is maximal.

6.1.3 Second derivative test

When f:VR is twice differentiable, we can characterize the function near a stationary point x by the second derivative: f(x+u)=f(x)12f(x)(uu)+o(u2). The point x is a local minimizer if the quadratic form given by f(x) is positive definite and a local maximizer if f(x) is negative definite. If the Hessian is semidefinite, we need to look at higher derivatives to classify x. If f(x) is strongly indefinite (i.e. the inertia has nonzero positive and negative components), then x is a saddle point.

For computation, of course, we usually express f(x) with respect to a basis. In this case, we describe the second derivative test in terms of the inertia of the Hessian matrix Hf(x).

6.1.4 Constraints and cones

We have described the first and second derivative tests in the context of unconstrained optimization, but the general approach is equally sensible for constrained optimization problems. We generally define the feasible domain Ω in terms of a set of constraints, equations and inequalities involving functions ci:VR that (for our purposes) are at least continuously differentiable: ci(x)=0,iE,ci(x)0,iI. We say the ith inequality constraint is active at x if ci(x)=0. We write the active set at x as A(x)={iI:ci(x)=0}. We do not worry about trying to classify the equality constraints, though in a sense they are always active.

6.1.4.1 Tangent cone

The first derivative test tells us that if f is locally minimized at x, then a local linear model should not predict a nearby point with smaller function values. That is, there should not be f(x)u<0 and a (differentiable) feasible path γ:[0,τ)V with γ(0)=x and γ(0)=u, or we will have f(γ(ϵ))=f(x)+ϵf(x)u+o(ϵ)<f(x). for all sufficiently small ϵ>0. The set of directions u that are tangent to some feasible path at x is known as the tangent cone at x. A function f satisfies a (first order) geometric optimality condition at x if uT(x),f(x)u0. where T(x) denotes the tangent cone. Equivalently, f(x)T(x)o, where T(x)o is the polar cone of T(x), i.e. T(x)o={wV:uT(x),wu0}. A local minimizer must satisfy this condition, though not all points that satisfy this condition need be local minimizers.

6.1.4.2 Linearized cone

The trouble with the geometric optimality condition is that we somehow need to characterize the tangent cone at each feasible point. Because we are characterizing the feasible set in terms of differentiable constraint functions, it is tempting to try to characterize feasible directions via the derivatives of the constraints by looking at the linearized cone L(x)={uV:(iE,ci(x)u=0)(iA(x),ci(x)u0)}. It is true that T(x)L(x), and the polars satisfy L(x)oT(x)o. Unfortunately, the linearized cone can be strictly bigger than the tangent cone. This is true even in 1D. For example, consider the domain x0 written as c(x)=x30. The tangent cone at x=0 is T(0)={uR:u0}, but the linearized cone is L(0)=R. Hence, if we seek to minimize f(x)=x subject to x30, the point x=0 satisfies the geometric optimality condition, but the condition is not satisfied if we replace the tangent cone with the linearized cone.

6.1.4.3 Constraint qualification

A constraint qualification condition is a hypothesis on the constraints that guarantees that Lo(x)=To(x), so that we can use the linearized cone in lieu of the tangent cone in the geometric optimality condition. Two common examples are the linearly independent constraint qualification (LICQ), which holds at x if {ci(x):iA(x)E} is linearly independent. and the Mangasarian-Fromovitz constraint qualification (MFCQ), which holds at x if {ci(x):iE} is linearly independent anduV:ci(x)u<0 for iA(x) and ci(x)u=0 for iE. Note that our example of the constraint x30 satisfies neither LICQ nor MFCQ at 0.

6.1.5 Multipliers and KKT

We have already seen one theorem of alternatives in our discussion of linear algebra — the Fredholm alternative theorem, which deals with solvability of linear equations. There are many other theorems of alternatives for dealing with inequalities, associated with mathematicians like Motzkin, Kuhn, and Farkas. One such theorem is Farkas’ lemma: if ARm×n and bRm, then exactly one of the two statements is true:

  • There exists xRn such that Ax=b and x0
  • There exists yRm such that yA0 and yb<0

Here the inequalities are taken elementwise.

Using Farkas’s lemma, one can rewrite the polar of the linearized cone as L(x)o={wV:w=iEλici(x)+iA(x)μici(x),μi0}. It is usually convenient to define μi=0 for inequality constraints that are inactive; in this case, we can rewrite L(x)o={wV:w=λcE(x)+μcI(x),μcI(x)=0,μ0}, where cE(x) and cI(x) are (column) vectors of equality and inequality constraint functions and λ and μ are concrete row vectors. The statement that μcI(x)=0, typically called a complementary slackness condition is equivalent to saying that inactive inequality constraints (where ci(x)<0) must be associated with zero multipliers.

With this machinery in place, we can now rewrite the condition f(x)L(x)o in the form that we usually use. We define the Lagrangian function L(x,μ,λ)=f(x)+λcE(x)+μcI(x) The variables λ and μ that multiply the constraints are known as Lagrange multipliers. The Karush-Kuhn-Tucker (KKT) conditions at x, equivalent to f(x)L(x)o are D1L(x,λ,μ)=0constrained stationaritycE(x)=0,primal feasibility (equality)cI(x)0,primal feasibility (inequality)μ0,dual feasibilityμcI(x)=0,complementary slackness. When a constraint qualification condition holds, the KKT conditions are necessary first-order conditions for optimality at x.

6.1.6 Multipliers and adjoints

Now suppose that we wanted to find a minimum (or maximum or other stationary point) of y subject to the equations we saw in uf(x)=0vg(x,u)=0yh(x,u,v)=0 The Lagrangian for this system is L=yu¯(uf(x))v¯(vg(x,u))y¯(yh(x,u,v)) where x,u,v,y are the primal variables and u¯,v¯,y¯ are multipliers. Stationarity gives [u¯v¯y¯][D1f100D1gD2g10D1hD2hD3h1]=[0001]. We can write this equivalently as x¯=0 where [x¯u¯v¯y¯][1000D1f100D1gD2g10D1hD2hD3h1]=[0001]. This is precisely the set of equations that we saw in , except that where we said “dual variables” in , here we are saying (negative) “Lagrange multipliers.”

6.1.7 Mechanical analogies

6.1.8 Constrained second derivatives

The second-order geometric optimality condition for a twice-differentiable f at x is a straightforward extension of the first-order geometric optimality condition. If x satisfies the first-order condition uT(x),f(x)u0, and we satisfy that 0uT(x) s.t. f(x)u=0,f(x)(uu)>0, then x is a constrained local minimum. If there are nonzero directions uT(x) where f(x)u=0 and f(x)(uu)<0, then x is not a constrained local minimum. Otherwise, we cannot determine minimality without looking at higher derivatives.

As with the geometric first-order condition, the geometric second-order condition is complicated by the need to get a handle on the tangent cone. Assuming that we satisfy the linearly independent constraint qualification condition at x, we have a more straightforward version of the second derivative test. Let J=E{iI:μi>0}. Then a sufficient condition for x to be a strong constrained local minimizer is if  nonzero uN(cJ(x)),f(x)(uu)>0. That is, the Hessian should be positive definite in all directions that are not already “uphill” at first order. Such a condition is known as conditional positive definiteness, since f(x) is positive definite conditioned on some constraints on the directions considered. If there are such directions for which f(x)(uu)<0, then we do not have a local minimizer; otherwise, we may need to consider higher derivatives to make a diagnosis.

6.2 Vector optimization

6.3 Convexity

A set ΩV is convex if every line segment between points in Ω also lies in Ω, i.e. x,yΩ,s[0,1],(1s)x+syΩ. A function f:ΩVR is convex if the graph of f on a line segment in Ω always lies below the secant connecting the endpoint values: x,yΩ,s[0,1],f((1s)x+sy)(1s)f(x)+sf(y). We say f is strictly convex if the inequality is strict on the interior of the segment connecting x and y: xyΩ,s(0,1),f((1s)x+sy)<(1s)f(x)+sf(y). If f is twice differentiable, convexity corresponds to positive semi-definiteness of the Hessian, and strict convexity corresponds to positive definiteness of the Hessian. However, functions can be convex even if they are not twice differentiable.

A convex function on a closed, bounded set in finite dimensions is guaranteed to take on its minimum value fmin at some point xΩ. There are no local minimizers, only global minimizers. The set of global minimizers {xΩ:f(x)=fmin} is itself a convex set. If f is strictly convex, then the global minimizer is unique.

6.3.1 Subderivatives

A subderivative of a convex f at xΩ is the set of functionals wV such that f(x+u)f(x)+wu for all x+uΩ; that is, the subderivative corresponds to the set of “supporting hyperplanes” that agree with f(x) and lie below f elsewhere on Ω. When V is an inner product space, we say wV is in the subgradient at xΩ if f(x+u)f(x)+u,w whenever x+uΩ.

When f is differentiable at x, the only element of the subderivative is the derivative (and the only element of the subgradient is the gradient). However, the concept of a subderivative continues to make sense even for nonsmooth functions. For example, the absolute value function x|x| on R is not differentiable at 0, but has a subgradient [1,1].

The notion of a subderivative allows us to generalize the usual notion of stationary points for differentiable functions: a point xΩ is a stationary point for a convex f if the subderivative of f at x contains the 0V, and a minimizer must be a stationary point.


  1. For equality constraints, the signs of the Lagrange multipliers are unimportant.↩︎