Notebook for 2023-03-31

md"""

# Notebook for 2023-03-31

"""

148 μs

using LinearAlgebra

253 μs

using Plots

7.7 s

Last time, we mostly focused on solving systems of nonlinear equations. Today, we consider optimization problems. The two perspectives are very complimentary, but we will delve into that further after break.

Recall from your calculus classes the second-order Taylor expansion. If $\phi : \mathbb{R}^n \rightarrow \mathbb{R}$ is $\mathcal{C}^2$ (i.e. if it is at least twice continuously differentiable) then we have the expansion

$$\phi(x+u) = \phi(x) + \phi'(x) u + \frac{1}{2} u^T H_{\phi}(x) u + o(\|u\|^2)$$

where $\phi'(x) \in \mathbb{R}^{1 \times n}$ is the derivative of $\phi$ and $H_{\phi}$ is the Hessian matrix consisting of second derivatives:

$$(H_\phi)_{ij} = \frac{\partial \phi}{\partial x_i \partial x_j}.$$

The gradient $\nabla \phi(x) = \phi'(x)^T$ is a column vector (rather than a row vector).

md"""

Last time, we mostly focused on solving systems of nonlinear equations.

Today, we consider optimization problems. The two perspectives are very

complimentary, but we will delve into that further after break.

Recall from your calculus classes the second-order Taylor expansion.

If $\phi : \mathbb{R}^n \rightarrow \mathbb{R}$ is $\mathcal{C}^2$ (i.e. if it is

at least twice continuously differentiable) then we have the expansion

$$\phi(x+u) = \phi(x) + \phi'(x) u + \frac{1}{2} u^T H_{\phi}(x) u + o(\|u\|^2)$$

where $\phi'(x) \in \mathbb{R}^{1 \times n}$ is the derivative of $\phi$ and

$H_{\phi}$ is the *Hessian matrix* consisting of second

derivatives:

$$(H_\phi)_{ij} = \frac{\partial \phi}{\partial x_i \partial x_j}.$$

The *gradient* $\nabla \phi(x) = \phi'(x)^T$ is a column vector

(rather than a row vector).

312 μs

Let's give an illustration comparing a function $\phi$ to the first and second-order Taylor expansions about some point $x_0$.

md"""

Let's give an illustration comparing a function $\phi$ to the first and second-order Taylor expansions about some point $x_0$.

"""

122 μs

Float64

-0.741297

0.921551

begin

# Reminder for gradient: 2 sin(x) cos(x) = sin(2x)

ϕex(x) = -cos(x[1]+x[2]) + sin(x[2])^2

∇ϕex(x) = [sin(x[1]+x[2]); sin(x[1]+x[2])+sin(2*x[2])]

Hϕex(x) = [cos(x[1]+x[2]) cos(x[1]+x[2]);

cos(x[1]+x[2]) cos(x[1]+x[2]) + 2*cos(2*x[2])]

# Taylor expand about some random point, plot error over some range

xx = range(-1.0, 1.0, length=100)

x0 = 2*rand(2) .- 1.0

end

213 ms

plot_vs_approx (generic function with 1 method)

function plot_vs_approx(ϕ, ϕ̂, x0, xx, yy)

# Plot function vs approximation

p1 = plot(xx, xx, (x,y)->ϕ([x; y]), st=:contour)

plot!([x0[1]], [x0[2]], marker=true, label="x0")

plot!(xx, xx, (x,y)->ϕ̂([x; y]), st=:contour, style=:dash, colorbar=false)

# Plot error

p2 = plot(xx, xx, (x,y)->ϕ([x; y])-ϕ̂([x; y]), st=:contour)

plot!([x0[1]], [x0[2]], marker=true, label="x0")

plot(p1, p2, layout=(1,2))

end

2.4 ms

# Compare vs first-order Taylor

let ϕ̂ex(x) = ϕex(x0) + ∇ϕex(x0)'*(x-x0)

plot_vs_approx(ϕex, ϕ̂ex, x0, xx, xx)

end

4.4 s

# Compare vs second-order Taylor

let ϕ̂ex(x) = ϕex(x0) + ∇ϕex(x0)'*(x-x0) + 0.5*(x-x0)'*Hϕex(x0)*(x-x0)

plot_vs_approx(ϕex, ϕ̂ex, x0, xx, xx)

end

132 ms

If $\nabla \phi(x) \neq 0$ then $\nabla \phi(x)$ and $-\nabla \phi(x)$ are the directions of steepest ascent and descent, respectively (we will have more to say on this point presently). If $\nabla \phi(x) = 0$, then we say $x$ is a stationary point or critical point. The first derivative test says that if $x$ minimizes $\phi$ (and $\phi$ is differentiable) then the gradient of $x$ must be zero; otherwise, there is a "downhill" direction, and a point near $x$ achieves a smaller function value.

A stationary point does not need to be a local minimizer; it might also be a maximizer, or a saddle point. The second derivative test says that for a critical point $x$ to be a (local) minimizer, the Hessian $H_{\phi}(x)$ must be at least positive semi-definite. If $x$ is a stationary point and $H_{\phi}$ is strictly positive definite, then $x$ must be a local minimizer; in this case, we call $x$ a strong local minimizer.

One approach to the problem of minimizing $\phi$ is to run Newton iteration on the critical point equation $\nabla \phi(x) = 0$. The Jacobian of the function $\nabla \phi(x)$ is simply the Hessian matrix, so Newton's iteration for finding the critical point is just

$$x_{k+1} = x_k - H_{\phi}(x_k)^{-1} \nabla \phi(x_k).$$

We can derive this in the same way that we derived Newton's iteration for other nonlinear equations; or we can derive it from finding the critical point of a quadratic approximation to $\phi$:

$$\hat{\phi}(x_k+p_k) = \phi(x_k) + \phi'(x_k) p_k + \frac{1}{2} p_k^T H_{\phi}(x_k) p_k.$$

The critical point occurs for $p_k = -H_{\phi}(x_k)^{-2} \nabla \phi(x_k)$; but this critical point is a strong local minimum iff $H_{\phi}(x_k)$ is positive definite.

md"""

If $\nabla \phi(x) \neq 0$ then $\nabla \phi(x)$ and $-\nabla \phi(x)$

are the directions of steepest ascent and descent, respectively

(we will have more to say on this point presently).

If $\nabla \phi(x) = 0$, then we say $x$ is a *stationary point*

or *critical point*. The first derivative test says that if $x$

minimizes $\phi$ (and $\phi$ is differentiable) then the gradient of

$x$ must be zero; otherwise, there is a "downhill" direction, and

a point near $x$ achieves a smaller function value.

A stationary point does not need to be a local minimizer; it might

also be a maximizer, or a saddle point. The *second* derivative

test says that for a critical point $x$ to be a (local) minimizer, the

Hessian $H_{\phi}(x)$ must be at least positive semi-definite.

If $x$ is a stationary point and $H_{\phi}$ is

strictly positive definite, then $x$ must be a local minimizer; in

this case, we call $x$ a *strong* local minimizer.

473 μs

let x = [0.0; 0.5]

# Newton loop -- plot where we're going

resids = []

xs = [copy(x)]

for k = 1:5

# Compute the gradient and record the norm

g = ∇ϕex(x)

push!(resids, norm(g))

# Take a Newton step and record the point

x -= Hϕex(x)\g

push!(xs, copy(x))

end

# Plot the function and the iterates

p1 = plot(xx, xx, (x,y)->ϕex([x; y]), st=:contour)

plot!([x[1] for x in xs], [x[2] for x in xs], marker=true, label=false)

p2 = plot(resids[resids .> 0], yscale=:log10, legend=false)

plot(p1, p2, layout=(1,2))

end

318 ms

There are a couple reasons we might want to not just say "run Newton to find a critical point" and call it a day. Three key ones are:

Computing Hessians can be a bit of a pain.
We can take advantage of the fact that this is not a general system of nonlinear equations in devising and analyzing methods.
If we only seek to solve the critical point equation, we might end up finding a maximizer or saddle point as easily as a minimizer.

For this reason, we will discuss a different class of iterations, the (scaled) gradient descent methods and their relatives.

md"""

There are a couple reasons we might want to not just say "run Newton to

find a critical point" and call it a day. Three key ones are:

- Computing Hessians can be a bit of a pain.

- We can take advantage of the fact that this is *not* a

general system of nonlinear equations in devising and analyzing

methods.

- If we only seek to solve the critical point equation, we might

end up finding a maximizer or saddle point as easily as a minimizer.

For this reason, we will discuss a different class of iterations, the

(scaled) gradient descent methods and their relatives.

"""

341 μs

Questions

For all these questions, assume $\phi$ is twice continuously differentiable and that the Hessian $H_{\phi}$ is Lipschitz with constant $M$.

Change the starting point in the iteration above (e.g. to the point $(0, 0.6)$). What changes?
Explain why $\phi(x+u) = \phi(x) + \nabla \phi(x)^T u + \frac{1}{2} u^T H_{\phi}(x+\xi u) u$ for some $\xi \in [0,1]$.
Argue that $|\phi(x) + \nabla \phi(x)^T u + \frac{1}{2} u^T H_{\phi}(x) u - \phi(x+u)| \leq \frac{M}{2} \|u\|^3$.
Let $p$ be a Newton update starting from $x$, and let $w = \phi(x)-\phi(x+p)$ be the improvement in the function value during that Newton step. Show that $\left| w-\frac{1}{2} p^T H_{\phi}(x) p \right| \leq \frac{M}{2} \|p\|^3$.

md"""

#### Questions

For all these questions, assume $\phi$ is twice continuously differentiable and that the Hessian $H_{\phi}$ is Lipschitz with constant $M$.

1. Change the starting point in the iteration above (e.g. to the point $(0, 0.6)$). What changes?

2. Explain why $\phi(x+u) = \phi(x) + \nabla \phi(x)^T u + \frac{1}{2} u^T H_{\phi}(x+\xi u) u$ for some $\xi \in [0,1]$.

3. Argue that $|\phi(x) + \nabla \phi(x)^T u + \frac{1}{2} u^T H_{\phi}(x) u - \phi(x+u)| \leq \frac{M}{2} \|u\|^3$.

4. Let $p$ be a Newton update starting from $x$, and let $w = \phi(x)-\phi(x+p)$ be the improvement in the function value during that Newton step. Show that $\left| w-\frac{1}{2} p^T H_{\phi}(x) p \right| \leq \frac{M}{2} \|p\|^3$.

"""

874 μs

Gradient descent

One of the simplest optimization methods is the steepest descent or gradient descent method

$$x_{k+1} = x_k + \alpha_k p_k$$

where $\alpha_k$ is a step size and $p_k = -\nabla \phi(x_k)$.

Let's try an example using the test function above. Play with $\alpha$. What happens to the convergence?

md"""

## Gradient descent

One of the simplest optimization methods is the *steepest descent*

or *gradient descent* method

$$x_{k+1} = x_k + \alpha_k p_k$$

where $\alpha_k$ is a *step size* and $p_k = -\nabla \phi(x_k)$.

Let's try an example using the test function above. Play with $\alpha$.

What happens to the convergence?

"""

305 μs

let x = [0.0; 0.5], α=0.1

# Gradient descent loop -- plot where we're going

resids = []

xs = [copy(x)]

for k = 1:100

# Compute the gradient and record the norm

g = ∇ϕex(x)

push!(resids, norm(g))

# Take a gradient descent step and record the point

x -= α*g

push!(xs, copy(x))

end

# Plot the function and the iterates

p1 = plot(xx, xx, (x,y)->ϕex([x; y]), st=:contour)

plot!([x[1] for x in xs], [x[2] for x in xs], marker=true, label=false)

p2 = plot(resids[resids .> 0], yscale=:log10, legend=false)

plot(p1, p2, layout=(1,2))

end

15.0 ms

To understand the convergence of this method, consider gradient descent with a fixed step size $\alpha$ for the quadratic model problem

$$\phi(x) = \frac{1}{2} x^T A x + b^T x + c$$

where $A$ is symmetric positive definite. We have computed the gradient for a quadratic before:

$$\nabla \phi(x) = Ax + b,$$

which gives us the iteration equation

$$x_{k+1} = x_k - \alpha (A x_k + b).$$

Subtracting the fixed point equation

$$x_* = x_* - \alpha (A x_* + b)$$

yields the error iteration

$$e_{k+1} = (I-\alpha A) e_k.$$

If $\{ \lambda_j \}$ are the eigenvalues of $A$, then the eigenvalues of $I-\alpha A$ are $\{ 1-\alpha \lambda_j \}$. The spectral radius of the iteration matrix is thus

$$\min \{ |1-\alpha \lambda_j| \}_j = \min \left( |1-\alpha \lambda_{\min}|, |1-\alpha \lambda_{\max}| \right).$$

The iteration converges provided $\alpha < 1/\lambda_{\max}$, and the optimal $\alpha$ is

$$\alpha_* = \frac{2}{\lambda_{\min} + \lambda_{\max}},$$

which leads to the spectral radius

$$1 - \frac{2 \lambda_{\min}}{\lambda_{\min} + \lambda_{\max}} = 1 - \frac{2}{1 + \kappa(A)}$$

where $\kappa(A) = \lambda_{\max}/\lambda_{\min}$ is the condition number for the (symmetric positive definite) matrix $A$. If $A$ is ill-conditioned, then, we are forced to take very small steps to guarantee convergence, and convergence may be heart breakingly slow. We will get to the minimum in the long run –- but, then again, in the long run we all die.

Our example problem is not quadratic, of course. But close to the minimum at $(0,0)$, it is close enough to quadratic that we get a good picture of the convergence from the quadratic approximation. Try plugging in the optimal $\alpha$ and some nearby values in this case to confirm that it gives about as good convergence as we can manage.

md"""

To understand the convergence of this method, consider gradient

descent with a fixed step size $\alpha$ for the quadratic model problem

$$\phi(x) = \frac{1}{2} x^T A x + b^T x + c$$

where $A$ is symmetric positive definite.

We have computed the gradient for a quadratic before:

$$\nabla \phi(x) = Ax + b,$$

which gives us the iteration equation

$$x_{k+1} = x_k - \alpha (A x_k + b).$$

Subtracting the fixed point equation

$$x_* = x_* - \alpha (A x_* + b)$$

yields the error iteration

484 μs

0.5

0.707107

let A = Hϕex([0; 0])

λs = eigvals(A)

αref = 2/(λs[1]+λs[2])

ρref = 1-2/(1+cond(A))

αref, ρref

end

1.6 ms

If we take a somewhat less well-conditioned problem, we get a significantly slower optimal rate of convergence.

md"""

If we take a somewhat less well-conditioned problem, we get a significantly slower optimal rate of convergence.

"""

114 μs

let A = [1.0 0.0; 0.0 1e-2], x = [1.0; 1.0]

λs = eigvals(A)

α = 2/(λs[1]+λs[2])

# Steepest descent loop

xs = [copy(x)]

resids = []

for k = 1:100

x -= α*A*x

push!(xs, copy(x))

push!(resids, 0.5*x'*A*x)

end

# Plot the function and the iterates

p1 = plot(xx, xx, (x,y)->0.5*[x; y]'*A*[x; y], st=:contour)

plot!([x[1] for x in xs], [x[2] for x in xs], marker=true, label=false)

p2 = plot(resids[resids .> 0], yscale=:log10, legend=false)

plot(p1, p2, layout=(1,2))

end

15.4 ms

The behavior of steepest descent iteration on a quadratic model problem is indicative of the behavior more generally: if $x_*$ is a strong local minimizer of some general nonlinear $\phi$, then gradient descent with sufficiently small step size will converge locally to $x_*$. But if $H_{\phi}(x_*)$ is ill-conditioned, then one has to take small steps, and the rate of convergence can be quite slow.

Not all problems are terrible ill-conditioned, and so in many cases simple gradient descent algorithms can work quite well. For ill-conditioned problems, though, we would like to change something about the algorithm. One approach is to keep the gradient descent direction and adapt the step size in a clever way; the Barzelei-Borwein (BB) method and related approaches follow this approach. These remarkable methods deserve to be better known, but in the interest of fitting the course into the semester, we will turn instead to the problem of choosing better directions.

md"""

The behavior of steepest descent iteration on a quadratic model

problem is indicative of the behavior more generally: if $x_*$ is a

strong local minimizer of some general nonlinear $\phi$, then gradient

descent with sufficiently small step size will converge locally to

$x_*$. But if $H_{\phi}(x_*)$ is ill-conditioned, then one has to

take small steps, and the rate of convergence can be quite slow.

Not all problems are terrible ill-conditioned, and so in many cases

simple gradient descent algorithms can work quite well. For

ill-conditioned problems, though, we would like to change something

about the algorithm. One approach is to keep the gradient descent

direction and adapt the step size in a clever way; the

Barzelei-Borwein (BB) method and related approaches follow this

approach. These remarkable methods deserve to be better known,

but in the interest of fitting the course into the semester, we will

208 μs

"Steepest" descent

At a point $x$, a direction $p$ is a descent direction for $\phi$ if $\phi'(x) p < 0$. We measure how "steep" a descent direction is by the rate at which the function value decreases if we move at unit speed in that direction: $-\phi'(x) p/\|p\|$. If $\|\cdot\|$ is the ordinary Euclidean norm on $\mathbb{R}^n$, we have that the direction of steepest descent is $-\phi'(x)^T/\|\phi'(x)\|$. But what happens if we use a different norm for measuring distance?

Put differently, consider the problem of maximizing $-g^T u$ over the ball $\|u\| \leq 1$. Then

Over $\|u\|_2 \leq 1$, we have $u = -g/\|g\|$
Over $\|u\|_\infty \leq 1$, we have $u = -\operatorname{sign}(g)$
Over $\|u\|_1 \leq 1$, we have $u = -\operatorname{sign}(g_k) e_k$ where $g_k$ is the largest magnitude entry of $g$.

Of course, these are not the only norms out there. In particular, if $M$ is a positive definite matrix, then there is an associated inner product

$$\langle x, y \rangle_M = y^* M x$$

and an associated Euclidean norm

$$\|x\|_M = \sqrt{x^T M x}.$$

Maximizing $-g^T u / \|u\|_M$ gives us $u \propto -M^{-1} g$.

md"""

## "Steepest" descent

At a point $x$, a direction $p$ is a *descent* direction for $\phi$ if

$\phi'(x) p < 0$. We measure how "steep" a descent direction is by the rate

at which the function value decreases if we move at unit speed in that direction:

$-\phi'(x) p/\|p\|$. If $\|\cdot\|$ is the ordinary Euclidean norm

on $\mathbb{R}^n$, we have that the direction of steepest descent

is $-\phi'(x)^T/\|\phi'(x)\|$. But what happens if

we use a different norm for measuring distance?

Put differently, consider the problem of maximizing $-g^T u$ over the

ball $\|u\| \leq 1$. Then

- Over $\|u\|_2 \leq 1$, we have $u = -g/\|g\|$

- Over $\|u\|_\infty \leq 1$, we have $u = -\operatorname{sign}(g)$

- Over $\|u\|_1 \leq 1$, we have $u = -\operatorname{sign}(g_k) e_k$ where $g_k$ is the largest magnitude entry of $g$.

Of course, these are not the only norms out there. In particular, if $M$ is a

positive definite matrix, then there is an associated inner product

540 μs

Scaled gradient descent

The scaled gradient descent iteration takes the form

$$x_{k+1} = x_k + \alpha_k p_k, \quad M_k p_k = -\nabla \phi(x_k).$$

where $\alpha_k$ and $p_k$ are the step size and direction, as before, and $M_k$ is a symmetric positive definite scaling matrix. We can also see this as ordinary steepest descent, but steepest descent with respect to the adaptively-chosen $M_k$ Euclidean norms.

Positive definiteness of $M_k$ guarantees that $p_k$ is a descent direction, i.e.

$$\phi'(x_k) p_k = \nabla \phi(x_k)^T p_k = -\nabla \phi(x_k)^T M_k^{-1} \nabla \phi(x_k) < 0;$$

this in turn guarantees that if $\alpha_k$ is sufficiently small, $\phi(x_{k+1})$ will be less than $\phi(x_k)$ –- unless $\phi(x_k)$ is a stationary point (i.e. $\nabla \phi(x_k) = 0$).

How does scaling improve on simple gradient descent? Consider again the quadratic model problem

$$\nabla \phi(x) = Ax + b,$$

and let $M$ and $\alpha$ be fixed. With a little work, we derive the error iteration

$$e_{k+1} = (I-\alpha MA) e_k$$

If $\alpha M = A^{-1}$, the iteration converges in a single step! Going beyond the quadratic model problem, if $x_*$ was a known strong local minimizer, we could use $M_k = H_{\phi}(x_*)$ – which will give us superlinear convergence.

md"""

## Scaled gradient descent

The *scaled* gradient descent iteration takes the form

$$x_{k+1} = x_k + \alpha_k p_k, \quad M_k p_k = -\nabla \phi(x_k).$$

where $\alpha_k$ and $p_k$ are the step size and direction, as before,

and $M_k$ is a symmetric positive definite *scaling matrix*.

We can also see this as ordinary steepest descent, but steepest descent

with respect to the adaptively-chosen $M_k$ Euclidean norms.

Positive definiteness of $M_k$ guarantees that $p_k$ is a

*descent direction*, i.e.

$$\phi'(x_k) p_k = \nabla \phi(x_k)^T p_k

= -\nabla \phi(x_k)^T M_k^{-1} \nabla \phi(x_k) < 0;$$

this in turn guarantees that if $\alpha_k$ is sufficiently small,

$\phi(x_{k+1})$ will be less than $\phi(x_k)$ --- unless $\phi(x_k)$ is

a stationary point (i.e. $\nabla \phi(x_k) = 0$).

530 μs

let x = [0.0; 0.5], α=1.0

# Scaled steepest descent loop -- plot where we're going

resids = []

xs = [copy(x)]

M = Hϕex([0; 0])

for k = 1:5

# Compute the gradient and record the norm

g = M\∇ϕex(x)

push!(resids, norm(g))

# Take a gradient descent step and record the point

x -= α*g

push!(xs, copy(x))

end

# Plot the function and the iterates

p1 = plot(xx, xx, (x,y)->ϕex([x; y]), st=:contour)

plot!([x[1] for x in xs], [x[2] for x in xs], marker=true, label=false)

p2 = plot(resids[resids .> 0], yscale=:log10, legend=false)

plot(p1, p2, layout=(1,2))

end

15.1 ms

In general, we cannot scale with $H_{\phi}(x_*)$, since we don't know the location of $x_*$! But if $H_{\phi}(x_k)$ is positive definite, we might choose $M_k = H_{\phi}(x_k)$ –- which would correspond to a Newton step. Of course, $H_{\phi}(x_k)$ does not have to be positive definite everywhere! Thus, most minimization codes based on Newton scaling use $M_k = H_{\phi}(x_k)$ when it is positive definite, and otherwise use some modification. One possible modification is to choose a diagonal shift $M_k = H_{\phi}(x_k) + \beta I$ where $\beta$ is sufficiently large to guarantee positive definiteness. Another common approach is to compute a modified Cholesky factorization of $H_{\phi}(x_k)$. The modified Cholesky algorithm looks like ordinary Cholesky, and is identical to ordinary Cholesky when $H_{\phi}(x_k)$ is positive definite. But rather than stopping when it encounters a negative diagonal in a Schur complement, the modified Cholesky approach replaces that element with something else and proceeds.

md"""

In general, we cannot scale with $H_{\phi}(x_*)$, since we don't know

the location of $x_*$! But if $H_{\phi}(x_k)$ is positive definite,

we might choose $M_k = H_{\phi}(x_k)$ ---

which would correspond to a Newton step.

Of course, $H_{\phi}(x_k)$ does not have to be positive definite

everywhere! Thus, most minimization codes based on Newton scaling use

$M_k = H_{\phi}(x_k)$ when it is positive definite, and otherwise use

some modification. One possible modification is to choose a diagonal

shift $M_k = H_{\phi}(x_k) + \beta I$ where $\beta$ is sufficiently

large to guarantee positive definiteness. Another common approach is

to compute a *modified Cholesky* factorization of $H_{\phi}(x_k)$.

The modified Cholesky algorithm looks like ordinary Cholesky, and is

identical to ordinary Cholesky when $H_{\phi}(x_k)$ is positive

definite. But rather than stopping when it encounters a negative

227 μs

Questions

Suppose $\phi$ is a $C^2$ function and $H_{\phi}$ is Lipschitz with constant $M$. Let $x_*$ be a strong local minimizer for $\phi$ and consider the scaled steepest descent iteration

$$x_{k+1} = x_k - H_{\phi}(x_*)^{-1} \nabla \phi(x_k)$$

Subtract off the fixed point equation to get

$$e_{k+1} = e_k - H_{\phi}(x_*)^{-1} \left( \nabla \phi(x_*+e_k) - \nabla \phi(x_*) \right)$$

Conclude that for some $\xi \in [0,1]$,

$$\|e_{k+1}\| \leq \| H_{\phi}(x_*)^{-1} \left(H_{\phi}(x_*) - H_{\phi}(x_* + \xi e_k)\right) e_k \|,$$

which implies under the Lipschitz assumption that

$$\|e_{k+1}\| \leq \|H_{\phi}(x_*)^{-1}\| M \|e_k\|^2.$$

Therefore, the iteration converges quadratically for good enough initial guesses. Can you also give a condition on $\|e_0\|$ that guarantees an initial guess is "good enough"?

md"""

#### Questions

Suppose $\phi$ is a $C^2$ function and $H_{\phi}$ is Lipschitz with constant $M$.

Let $x_*$ be a strong local minimizer for $\phi$ and consider the scaled steepest descent iteration

$$x_{k+1} = x_k - H_{\phi}(x_*)^{-1} \nabla \phi(x_k)$$

Subtract off the fixed point equation to get

$$e_{k+1} = e_k - H_{\phi}(x_*)^{-1} \left( \nabla \phi(x_*+e_k) - \nabla \phi(x_*) \right)$$

Conclude that for some $\xi \in [0,1]$,

$$\|e_{k+1}\| \leq \| H_{\phi}(x_*)^{-1} \left(H_{\phi}(x_*) - H_{\phi}(x_* + \xi e_k)\right) e_k \|,$$

which implies under the Lipschitz assumption that

$$\|e_{k+1}\| \leq \|H_{\phi}(x_*)^{-1}\| M \|e_k\|^2.$$

Therefore, the iteration converges quadratically for good enough initial guesses.

Can you also give a condition on $\|e_0\|$ that guarantees an initial guess is

301 μs