Notebook for 2023-04-19
Levenberg-Marquardt
Recall from our discussion of nonlinear least squares the Levenberg-Marquardt iteration where we solve a regularized least squares problem to compute the step; that is,
$$p_k = \operatorname{argmin}_p \frac{1}{2} \|f(x_k) + f'(x_k) p\|^2 + \frac{\mu}{2} \|Dp\|^2.$$
The scaling matrix $D$ may be an identity matrix (per Levenberg), or we may choose $D^2 = \operatorname{diag}(f'(x_k)^T f'(x_k))$ (as suggested by Marquardt).
For $\lambda = 0$, the Levenberg-Marquardt step is the same as a Gauss-Newton step. As $\lambda$ becomes large, though, we have the (scaled) gradient step
$$p_k = -\frac{1}{\mu} D^{-2} f'(x_k)^T f(x_k) + O(\mu^{-2}).$$
Unlike Gauss-Newton with line search, changing the parameter $\mu$ affects not only the distance we move, but also the direction.
In order to get both ensure global convergence (under sufficient hypotheses on $f$, as usual) and to ensure that convergence is not too slow, a variety of methods have been proposed that adjust $\lambda$ dynamically. To judge whether $\mu$ has been chosen too aggressively or conservatively, we monitor the gain ratio, or the ratio of actual reduction in the objective to the reduction predicted by the (Gauss-Newton) model:
$$\rho = \frac{\|f(x_k)\|^2-\|f(x_k+p_k)\|^2} {\|f(x_k)\|^2 - \|f(x_k)+f'(x_k)p_k\|^2}.$$
If the step decreases the function value enough ($\rho$ is sufficiently positive), then we accept the step; otherwise, we reject it. For the next step (or the next attempt), we may increase or decrease the damping parameter $\mu$ depending on whether $\rho$ is close to one or far from one.
levenberg_marquardt (generic function with 1 method)
Consider constraints
There is another way to think of the Levenberg-Marquardt step. Consider the minimization problem
$$p_k = \operatorname{argmin}_p \frac{1}{2} \|f(x) + f'(x)p \|^2 \mbox{ s.t. } \|Dp\| \leq \Delta.$$
There are two possible cases in this problem:
$\|f'(x_k)^\dagger f(x)\| < \Delta$, and the solution is the Gauss-Newton step
Otherwise the Gauss-Newton step is too big, and we have to enforce the constraint $\|Dp\| = \Delta$. For convenience, we rewrite this constraint as $(\|Dp\|^2-\Delta^2)/2 = 0$.
We define the Langrangian for the optimization problem to be
$$L(p,\lambda) = \frac{1}{2} \|f(x_k)+f'(x_k) p\|^2 + \frac{\lambda}{2} \left( \|Dp\|^2-\Delta^2 \right).$$
The solution to the constrained optimization problem satisfies the critical point equation $\partial L/\partial p = 0$ and $\partial L/\partial \lambda = 0$. The equation $\partial L/\partial p = 0$ is the same as the Tikhonov-regularized least squares problem with regularization parameter $\lambda$. Whether $\lambda$ is treated as a regularization parameter or a multiplier that enforces a constraint is thus simply a matter of perspective. Hence, we can consider the Levenberg-Marquardt method as minimizing the model $\|f(x_k) + f(x_k) p\|$ subject to the constraint $\|Dp\| \leq \Delta$, where a larger or smaller value of $\lambda$ corresponds to a smaller or larger value of $\Delta$. We think of the region $\|Dp\| \leq \Delta$ as the region where the Gauss-Newton model provides good guidance for optimization; that is, it is a region where we trust the model.
Trust regions
A trust region method for mininizing $\phi$ involves a model $\mu(p)$ that is supposed to approximate the decrease $\phi(x_k+p)-\phi(x_k)$ associated with taking a step $p$; and a trust region, often chosen to be a sphere $\|p\| \leq \Delta$, where we believe the model to provide reasonable predictions.
The simplest model $\mu(p)$ is linear, but the more interesting (and common) case involves a quadratic model
$$\mu(p) = g^T p + \frac{1}{2} p^T H p.$$
Minimizing a quadratic $\mu(p)$ subject to the constraint $\|\mu(p)\| \leq \Delta$ is not easy. We turn to this trust region subproblem next.
Compared to a line search strategy, trust region methods have the advantage that we adapt not just the step length but also the direction of the search. Consequently, trust region methods often exhibit more robust convergence, though both line search and trust region approaches exhibit good global convergence properties, and both approaches lead to eventual superlinear convergence when paired with a Newton model (i.e. a quadratic approximation centered at $x_k$) or a quasi-Newton method such as BFGS.
The trust region subproblem
The problem
$$\mbox{minimize } g^T p + \frac{1}{2} p^T H p \mbox{ s.t. } \|p\| \leq \Delta$$
is the trust region subproblem. Sometimes people use the more general constraint
$$p^T M p \leq \Delta^2$$
for some positive definite $M$, but we will stick to the usual 2-norm. There are two possible solutions:
If $H$ is positive definite and $\|H^{-1} g\| \leq \Delta$, then the solution is $p = -H^{-1} g$. This is the interior case.
If $H$ is not positive definite or $\|H^{-1} g\| > \Delta$, then the solution iis $p = -(H+\lambda I)^{-1} g$ for some $\lambda > 0$ such that $\|p\| = \Delta$. At the appropriate $\lambda$, we have $H+\lambda I$ is positive semi-definite. This is the boundary case
Most of the effort is spent on the boundary case, which itself has two subcases:
If $H + \lambda I$ is positive definite, then there is a unique solution to the trust region subproblem.
If $H + \lambda I$ is singular, then there are multiple solutions to the trust region subproblem, and we seen the problem with minimum norm.
The case when $H + \lambda I$ is singular (i.e. $-\lambda$ is an eigenvalue of $H$) is consistently known as the hard case in the literature.
Exact solves
The standard solver for the trust-region subproblem is due to Moré and Sorensen, and involves a safeguarded Newton iteration for finding the relevant $\lambda$, with careful treatment of the hard case. A number of authors have also adapted this approach to the large sparse case. However, I am particularly fond of a method proposed by Gander, Golub, and Von Matt that recasts the trust-region subproblem in terms of an eigenvalue problem. That paper concluded that the eigenvalue formulation was numerically inferior to the Moré-Sorensen approach, but a 2017 paper of Adachi, Iwata, Nakatsukasa, and Takeda concluded that this was in part because the eigensolvers available in 1989 were not as good as the solvers currently available. The Adachi et al paper provides a nice discussion of the formulation, including the hard case, which results in a mercifully brief code (which you are nonetheless not required to digest). One of the nice things about this formulation is that it adapts naturally to large-scale problems where $H$ is sparse or data sparse, though we will only worry about the dense case in our code.
solve_tr (generic function with 1 method)
A useful picture is a plot of the step for various $\Delta$ values for a sample quadratic model.
Inexact solves
One of the main difficulties with the trust region approach is solving a constrained quadratic optimization as a subproblem. As with line search, the thinking goes, the cost of doing an exact search is probably not worthwhile — we would rather get a good-enough approximate solution and move on.
A popular inexact search approach is the dog leg method. The idea of the dog leg method is to approximate the shape of the curve
$$p(\Delta) = \operatorname{argmin}_p \mu(p) \mbox{ s.t. } \|p\| \leq \Delta$$
based on the observation that
$p(0) = 0$.
$p'(0) \propto -\nabla \phi(x_k)$.
For large $\Delta$, $p(\Delta) = p_{\infty}$ is the unconstrained minimizer of $\mu$.
We thus approximate the $\rho(\Delta)$ curve by a piecewise linear curve with
A line segment from $0$ to $-\alpha \nabla \phi(x_k)$ where $\mu(-\alpha \nabla \phi(x_k))$ is mimimized.
Another line segment from $-\alpha \nabla \phi(x_k)$ to $p_{\infty}$.
dogleg_tr (generic function with 1 method)
A plot illustrates what happens with the dogleg path as a function of $\Delta$, compared to the true trust region solution path: the dogleg is a piecewise linear approximation to the true path.
A related approach is two-dimensional subspace minimization, which involves a constrained miminization over the two-dimensional subspace spanned by $-\nabla \phi(x_k)$ and $p_{\infty}$.
The Steighaug method combines the trust region approach with a (linear) conjugate gradient solve on the quadratic model problem. The idea is to trace out a polygonal path (as in the dog leg method) connecting the CG iterates, until that path intersects the trust region boundary. If the (approximate) Hessian used by the model is indefinite, CG runs until it discovers the indefiniteness, then plots a path toward where the model descends to $-\infty$. There are more recent variants which combine Newton, trust regions, and Krylov subspaces in various clever ways; other than mentioning that they exist, though, we leave this topic for the interested student to pursue in her copious free time.
Adapting the trust region
At each step of the method, we (approximately) minimize the model within the trust region to get a proposed step $p$, then check the gain ratio associated with taking that step:
$$\rho_k = \frac{\phi(x_k)-\phi(x_k+p_k)}{\mu(0)-\mu(p_k)}.$$
Depending on whether the gain ratio, we adjust $\Delta$; a strategy proposed in Nocedal and Wright is:
If $\rho_k < 1/4$, we were too aggressive; set $\Delta_{k+1} = \Delta_k/4$.
If $\rho_k > 3/4$ and $\|p_k\| = \Delta_k$, we were too conservative; set $\Delta_{k+1} = \min(2\Delta_k, \Delta_{\max})$.
Otherwise, leave $\Delta_{k+1} = \Delta_k$.
We also use the gain ratio to decide whether to accept or reject the step. For $\rho_k > \eta$ for a fixed $\eta \in [0,1/4)$, we accept ($x_{k+1} = x_k+p$); otherwise we reject ($x_{k+1} = x_k$).
tr_newton (generic function with 1 method)
An illustrative computation
It is always useful to see how these things work on a problem we've already looked at in another context. Let's consider as an example the Rosenbrock banana function.
Hrosen (generic function with 1 method)