Notebook for 2023-04-10
Nonlinear least squares
Today we consider a special class of optimization problems: nonlinear least squares. Given $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$ for $m > n$, we seek to minimize the objective function
$$\phi(x) = \frac{1}{2} \|f(x)\|^2.$$
Differentiating gives us the critical stationarity condition
$$\forall \delta x \in \mathbb{R}^n, \delta \phi(x) = f(x)^T f'(x) \delta x = 0;$$
or, equivalently,
$$0 = \nabla \phi(x) = f'(x)^T f(x) = J(x)^T f(x).$$
Like linear least squares, nonlinear least squares problems often arise from model fitting. Some important special cases include
Separable problems of the form $f(x, y) = \Phi(y) x - b$
Linear models with nonlinear loss functions, where $f_j(x) = \psi((Ax-b)_j)$
For this lecture, let's consider three small example problems: one general nonlinear least squares problem, one separable problem, and one robust regression problem involving a nonlinear loss function. We will use these examples to illustrate different ways we can use problem structure – and also show how to think about initial guesses.
Newton and Gauss-Newton
The Hessian of $\phi$ is
$$H_{\phi}(x) = J(x)^T J(x) + \sum_{i=0}^m f_i(x) H_{f_i}(x)$$
and while it is certainly possible to run an ordinary Newton iteration for the optimization problem, we can also note that the second term is potentially small in the case when we can solve to a small residual (i.e. when $f(x)$ gets close to zero). Hence, we consider the Gauss-Newton iteration, a scaled gradient descent iteration with the scaling matrix $J(x)^T J(x)$, i.e.
$$\begin{align*}x^{k+1} &= x^k - \left[ J(x^k)^T J(x^k) \right]^{-1} J(x^k)^T f(x^k) \\ &= x^k - J(x^k)^\dagger f(x^k) \end{align*}$$
As long as $J$ remains nonsingular, the Gauss-Newton direction is a descent direction. And assuming $J$ is Lipschitz with constant $L$ near a stationary point $x^*$, one can derive an error iteration
$$\|e^{k+1}\| \leq L \|J(x^*)^\dagger\|^2 \|f(x^*)\| \|e^k\| + O(\|e^k\|^2)$$
This is an upper bound, but it suggests (accurately) that Gauss-Newton is not guaranteed to converge even from small initial error. At the same time, convergence can be very rapid for well-behaved problems ($J(x^*)$ not too near singular, Lipschitz constant is not too big) when the residual norm at the solution is small. If $f(x^*) = 0$, we recover the same type of quadratic convergence we see for Newton's method.
gauss_newton (generic function with 1 method)
gauss_newton (generic function with 2 methods)
Questions
Show how to write one Gauss-Newton step in terms of a linear least squares problem.
Suppose $m = n$. What does Gauss-Newton look like in this case?
Justify the claim that if $J$ is nonsingular, the Gauss-Newton direction is a descent direction.
A reaction rate example
This example is taken from Wikipedia and involves fitting the chemical reaction rate for an enzyme mediated reaction. The larger the substrate concentration of the enzyme ($[S]$), the faster the rate of reaction $R$, according to a relation of the form
$$R = \frac{V_{\max} [S]}{K_M + [S]}$$
Writing the unknowns as $\beta = \begin{bmatrix} V_{\max} & K_M \end{bmatrix}^T$, we solve the nonlinear least squares problem
$$\mbox{minimize } \sum_i \left( R_i - \frac{\beta_1 S_i}{\beta_2 + S_i} \right)^2$$
over a collected data set. To get an initial guess, we multiply each residual by $\beta_2 + S_i$ to solve the ordinary least squares problem
$$\mbox{minimize } \sum_i \left( (\beta_2 + S_i) R_i - \beta_1 S_i \right)^2$$
f
J
H
0.357625
0.481568
According to the Wikipedia article, the true parameters are about $\hat{\beta}_1 = 0.362$ and $\hat{\beta}_2 = 0.556$; a quick sanity check reassures us that we have a good initial guess from the linear least squares problem.
It is also useful to sanity-check that our Jacobian and Hessian are implemented correctly.
test_jacobian (generic function with 1 method)
test_jacobian (generic function with 2 methods)
4.390115950196215e-11
2.0560589229672195e-11
Having reassured ourselves that we have a good initial guess and a correct Jacobian, let's apply Gauss-Newton to solve the problem.
0.361837
0.556266
Compared to the Newton iteration, Gauss-Newton is getting (fast) linear convergence, and ends up taking about twice as many iterations to drive the residual down close to $10^{-15}$. At the same time, Gauss-Newton doesn't require that we form the Hessian matrix!
0.361837
0.556266
An actual bound for Gauss-Newton (optional reading)
We are not going to do the actual error iteration for Gauss-Newton iteration in class, but it is not that conceptually difficult – it's just that the algebra is a bit tedious. We'll keep it in the notes for the sake of completeness.
In the interest of keeping the notation concise, let's write $J^* = J(x^*)$ and $J_k = J(x^k) = J_* + E_k$. Similarly, let $f_* = f(x^*)$. We assume throughout the derivation that $J$ is Lipschitz with constant $L$ (in the 2-norm) and that $\sigma_{\min}(J_*) > L \|e^k\|$, implying that
$$\sigma_{\min}(J_k) \geq \sigma_{\min}(J_*) - \|E_k\| \geq \sigma_{\min}(J_*)-L\|e^k\| > 0.$$
By the fundamental theorem of calculus,
$$f(x^k) = f_* + \int_0^1 J(x+\xi e^k) e^k d\xi$$
Therefore we can write $p^k = -J_k^\dagger f(x^k)$ as
$$p^k = -(J_k^T J_k)^{-1} J_k^T f_* - J_k^\dagger \int_0^1 J(x+\xi e^k) e^k \, d\xi$$
Using the fact that $J_*^T f_* = 0$ (by stationarity), we have
$$-(J_k^T J_k)^{-1} J_k^T f_* = -(J_k^T J_k)^{-1} E_k^T f_*,$$
which gives us the bound
$$\|-(J_k^T J_k)^{-1} J_k^T f_*\| \leq \frac{L \|f^*\| \|e^k\|}{\left( \sigma_{\min}(J_*)-L\|e^k\| \right)^2}.$$
and using the fact that $J_k^\dagger J_k = I$ (assuming $J_k$ full rank), we have
$$-J_k^\dagger \int_0^1 J(x+\xi e^k) e^k \, d\xi = -e^k -J_k^\dagger \int_0^1 (J(x+\xi e^k)-J_k) e^k \, d\xi,$$
and by the Lipschitz condition and consistency,
$$\|-J_k^\dagger \int_0^1 (J(x+\xi e^k)-J_k) e^k \, d\xi\| \leq \frac{L\|e^k\|^2}{2(\sigma_{\min}(J_*)-L\|e^k\|)}$$
Substituting these bounds into $e^{k+1} = e^k + p^k$, we have
$$\|e^{k+1}\| \leq L \left( \frac{\|f^*\|}{(\sigma_{\min}(J_*)-L\|e^k\|)^2} + \frac{\|e^k\|/2}{\sigma_{\min}(J_*)-L\|e^k\|} \right) \|e^k\|$$
and if $L \|f_*\|/\sigma_{\min}(J_*)^2 < 1$ then a sufficient condition for error to decrease by some $\alpha < 1$ (where $\alpha$ does not depend on $e^k$) is
$$\|e^k\| \leq \frac{2\sigma_{\min}(J_*)}{5L} \left( 1 - \frac{L\|f_*\|}{\|\sigma_{\min}(J_*)\|^2} \right)$$
and so the Gauss-Newton iteration is guaranteed to converge if
$$\|e^k\| \leq \frac{2\sigma_{\min}(J_*)}{5L} \left( 1 - \frac{L\|f_*\|}{\|\sigma_{\min}(J_*)\|^2} \right)$$
Levenberg-Marquardt
Gauss-Newton can converge quite quickly under some circumstances, but suffers when
The minimal residual is large
The Jacobian is close to singular
The Lipschitz constant on $J$ is large
There is not much we can do about $\|f(x_*)\|$ or about the Lipschitz constant on the Jacobian. But we do know how to deal with ill-conditioned least squares problems! The Levenberg-Marquardt algorithm replaces the least squares solve in the Gauss-Newton algorithm with a regularized least squares solve:
$$x^{k+1} = x^k - (J_k^T J_k + \lambda^2 D_k^2)^{-1} J_k^T f(x^k)$$
The scaling matrix $D_k$ may be an identity (suggested by Levenberg) or a diagonal matrix whose diagonal entries are the column norms of $J_k$ (suggested by Marquardt).
When $\lambda$ approaches zero, the Levenberg-Marquardt step approaches the Gauss-Newton step. As $\lambda$ becomes large, it approaches a small scaled gradient step with the scaling matrices $D_k$. Unlike Gauss-Newton, we can always get Levenberg-Marquardt to converge given a good enough initial guess and a large enough $\lambda$. In practice, though, we usually choose $\lambda$ adaptively. We'll return to how we do this in another lecture.
levenberg_marquardt (generic function with 1 method)
levenberg_marquardt (generic function with 2 methods)
Questions
Justify the claims about the asymptotic behavior of Levenberg-Marquardt as $\lambda$ approaches 0 and $\infty$.
Explain the implementation of the Levenberg-Marquardt step in the code.
A peak-fitting example
In various types of spectroscopy, one sees "peaks" in frequency space associated with different types of resonances. A Lorentzian peak (also known as a Breit-Wigner peak or a Cauchy distribution) with center $x_0$, width parameter $\Gamma$, and amplitude parameter $c$ has the form
$$\frac{c}{\pi} \frac{\Gamma/2}{(x-x_0)^2 + (\Gamma/2)^2}$$
Our problem here is to fit empirical measurements to a sum of Lorentzian peaks. This problem occurs often enough that there are specific approaches people take (we wouldn't always treat this as a general nonlinear least squares problem), but we will only take advantage of the partially linear nature of the problem, i.e. that we are minimizing
$$\sum_j \left( y_j - \sum_i \phi(x_j; x_{c,i}, \Gamma_i) c_i \right)^2$$
where $\phi(x_j; x_{c,i}, \Gamma_i)$ is the value at $x_j$ of the Lorentzian peak with amplitude one at center $x_{c,i}$ and with scale parameter $\Gamma_i$. That is, the amplitudes are determined by a linear least squares problem in which the coefficient matrix depends on some other parameters.
fJp
ΦJp
0.0770435
0.0979035
0.146941
0.18336
0.0895584
0.153379
0.189965
0.17448
0.22621
0.0719385
0.5
1.2
1.6
0.2
0.2
0.2
1.0
1.0
1.0
lorx_plot
1.4590000946289105e-10
We start with a not-fantastic initial guess (plotted below). Getting a good initial guess is actually an interesting problem, but one that we will not address here.
From this initial guess, Gauss-Newton does not have any hope of convergence, but Levenberg-Marquardt does.
Questions
Play with the parameter $\lambda$ in the Levenberg-Marquardt call. What happens?
Play with the initial guess. Does convergence improve from a better starting guess?
Argue that the "true" signal looks like $p(x)/q(x)$ for low degree polynomials $p$ and $q$, where $q$ has complex roots $x_{c,i} \pm \iota \Gamma_i/2$. We can find $p$ and $q$ by setting $p(x_i) - y_i q(x_i) \approx 0$ in a least squares sense; how would we implement this directly to get a better initial guess for the centers and widths of the Lorentzians?
Variable projection
In the Lorentzian peak-fitting example, our function $f$ had the form
$$f(x_c, \Gamma, c) = y-\Phi(x_c, \Gamma) c.$$
But if we know the centers and width parameters, then computing the amplitude parameters becomes a standard linear least squares problem. Putting in the solution gives us the reduced problem (or projected problem)
$$f_{\mathrm{proj}}(x_c, \Gamma) = \left( I-\Phi(x_c, \Gamma) \Phi(x_c, \Gamma)^\dagger \right) y = Py$$
where $P = I-\Phi \Phi^\dagger$ is the residual projector. The reduced problem can then be solved with any of our nonlinear least-squares solvers. This technique, known as variable projection, is widely used in a variety of applications.
Minimizing the norm of $f_{\mathrm{proj}}$ is often nicer than minimizing the norm of the original $f$. Of course, in order to do this, we need to be able to compute the Jacobian of $f_{\mathrm{proj}}$! Fortunately, we have lots of experience at this point with computing such derivatives. Using the definition of the pseudoinverse and mumbling over algebra for a while yields:
$$\begin{align*} \delta P &= -(\delta \Phi) \Phi^\dagger - (\Phi^\dagger)^T (\delta \Phi)^T + (\Phi^\dagger)^T (\Phi^T \delta \Phi + (\delta \Phi)^T \Phi) \Phi^\dagger \\ &= -P (\delta \Phi) \Phi^\dagger - (\Phi^\dagger)^T (\delta \Phi)^T P \\ \delta f_{\mathrm{proj}} &= (\delta P) y = -P (\delta \Phi) c - (\Phi^\dagger)^T (\delta \Phi)^T r \end{align*}$$
where $c = \Phi^\dagger y$ and $r = f_{\mathrm{proj}} = y - \Phi c$. Given the full QR decomposition
$$\Phi = \begin{bmatrix} Q_1 & Q_2 \end{bmatrix} \begin{bmatrix} R_1 \\ 0 \end{bmatrix}$$
and observing that $P = I-Q_1 Q_1^T = Q_2 Q_2^T$ and $(\Phi^\dagger)^T = Q_1 R_1^{-T}$, we have
$$\delta f_{\mathrm{proj}} = -Q_2 Q_2^T (\delta \Phi) c - Q_1 R^{-T} (\delta \Phi)^T r \\ = -Q \begin{bmatrix} R^{-T} (\delta \Phi)^T r \\ Q_2^T (\delta \Phi) c \end{bmatrix}$$
We know that the Householder QR factorization stores $Q$ implicitly, and while we can apply $Q$ and $Q^T$ quickly, we probably do not want to explicitly form $Q_2$. But we can evaluate the above expression without explicitly forming $Q_2$ by
Forming $Q^T (\delta \Phi) c$
Overwriting the leading rows with $R^{-T} (\delta \Phi)^T r$
Multiplying the result by $Q$
Negating the whole thing
In the case of the peak-finding problem, each of the parameters affects only one basis vector. So, for example
$$\frac{\partial f_{\mathrm{proj}}}{\partial x_{c,j}} = -Q \begin{bmatrix} R^{-T} e_j \left( \frac{\partial \phi_j}{\partial x_{c,j}} \right)^T r \\ Q_2^T \left(\frac{\partial \phi_i}{\partial x_{c,j}}\right) c_j \end{bmatrix}$$
and similarly for the derivatives with respect to the width parameters $\Gamma_j$.
lorx_fJproj (generic function with 1 method)
Because there was plenty of room to make an algebra error in our manipulations above, let's do our usual sanity check.
2.7176355015749446e-10
Unlike the original full problem, the projected version of the problem can be solved rather quickly with Gauss-Newton, even if we use our not-so-fantastic initial guess.
Questions
Can you derive the expression for $\delta P$ yourself?
Explain the expression for the derivative of $f_{\mathrm{proj}}$ with respect to $x_{c,i}$.
We claimed that $P = I-Q_1 Q_1^T = Q_2 Q_2^T$; why are these two expressions equivalent?
Iteratively reweighted least squares
A number of optimization problems from statistics have the form
$$\mbox{minimize } \phi(x) = \sum_{i=1}^m \rho(r_i), \quad r = Ax-b.$$
where we will assume $\rho$ is an even function with $\rho(0) = 0$. This can, of course, be converted to a nonlinear least squares problem, but we will stick with this form. Let $\psi$ denote the derivative of $\rho$; then we have
$$\begin{align*} \nabla \phi(x) &= A^T \psi(r) \\ H_{\phi}(x) &= A^T \operatorname{diag}(\psi'(r)) A \end{align*}$$
where $\psi(r)$ and $\psi'(r)$ should be interpreted elementwise. A Newton step for this problem is
$$p = -(A^T \operatorname{diag}(\psi'(r)) A)^{-1} A^T \psi(r)$$
and this is equivalent to solving the weighted least squares problem
$$\mbox{minimize } \left\| \frac{\psi(r)}{\psi'(r)} + Ap \right\|_{\operatorname{diag}(\psi'(r))}^2$$
where $\psi(r)/\psi'(r)$ should be interpreted elementwise.
Of course, we already know that Newton iteration is not globally convergent in general, but this iteration has another irritating issue: it is undefined when any component of $\psi'(r)$ is exactly zero! There are interesting cases where this is a real issue. However, an a slight modification avoids this problem and is globally convergent for convex loss functions. Define $W(f)$ to be the diagonal matrix of weights $w_k = \psi(r_k)/r_k$; then $x$ is a stationary point for $\phi$ if
$$A^T \psi(r) = A^T W(r) r = 0.$$
This suggests the fixed point iteration
$$A^T W(r^k) r^{k+1} = 0,$$
that is,
$$x^{k+1} = \operatorname{argmin} \|Ax-y\|_{W(r^k)}^2.$$
In words, at each step we compute a new weighted least squares fit to the data. Observe that
$$w_k = \frac{\psi(r_k)}{r_k} = \frac{\psi(r_k)-\psi(0)}{r_k-0} = \psi'(r_k) + o(f_k)$$
and
$$\frac{\psi(r_k)}{\psi'(r_k)} = \frac{\psi'(0) r_k}{\psi'(0)} + o(r_k) = r_k + o(r_k);$$
hence, as with Gauss-Newton, this iteration is similar to Newton iteration when the residuals are small.
This algorithm is an example of an iteratively reweighted least squares (IRLS) algorithm. Several algorithms share the IRLS name; all have the property that each iterate is the solution to a weighted least squares problem, where the weights vary from iteration to iteration.
irls (generic function with 1 method)
Questions
Write the IRLS iteration in additive update form – that is, what least squares problem does $p^k = x^{k+1}-x^k$ satisfy? Compare to the Newton least squares problem.
Explain the line
ws[r .== 0] .= 0.0
in theirls
routine.
Robust regression
Least squares regression works well for Gaussian noise, but tends to perform poorly in the presence of outliers. Fitting procedures that better tolerate outliers are generally known as robust regression procedures. The most popular robust regression techniques involve solving a nonlinear least squares problem
$$\mbox{minimize } \sum_{j=1}^m \rho(r_j), \quad r = Ax-b$$
for some loss function $\rho$ that grows more slowly than the squared loss. The coefficients from this estimation procedure are known as $M$-estimators.
In order to compute $M$-estimators, we first want some initial guess at the solution and at the typical level of (non-outlier) noise in the data. A typical trick to do this is to draw random subsets of the data until we think that with high probability we have at least one subset that contains no outliers. For each subset, we do an ordinary least squares fit, and then judge quality based on the median absolute deviation (MAD), i.e. the median absolute value of the residuals. We return as our initial guess for $x$ the solution that has the smallest MAD. Based on the assumption that non-outlier data is subject to normal noise, we also return a scale factor of MAD/0.6745 (since for the standard normal the expected value of the MAD is 0.6745).
rr_init (generic function with 1 method)
Two of the most common loss functions used in robust regression are
Huber loss: quadratic near the origin and then grows linearly away from the origin
Tukey loss: roughly quadratic near the origin, then flattens out to a constant
The Huber loss function is convex, and so the nonlinear least squares problem with Huber loss has a unique solution. The Tukey loss is less sensitive to extreme outliers, but may in general lead to many local minimizers.
dtukey (generic function with 1 method)
Now we set up and solve a robust regression problem with 10% outliers, using rr_init
to get an initial guess and the iteratively weighted least squares procedure to optimize the Tukey loss with an appropriately chosen scale factor.
0.0693697
0.0778182
0.0127841
3.70888
0.368664
0.041952
0.00491948
0.000590419
7.21037e-5
8.91969e-6
1.11396e-6
1.40104e-7
6.03477e-13
Running this example gives an initial estimate with error of 0.07781824323535004 and an associated scale factor of 0.06936972626116508. After 15 steps of IRLS, we get a converged result with an estimation error of 0.012784053089836236. The convergence plot is shown below.
Questions
Will the
rr_init
routine fail if we unluckily draw a subset such thatAs[s,:]
is singular?Plot $\psi(r)/r$ at the converged solution. Explain what you observe.