Frontmatter

If you are publishing this notebook on the web, you can set the parameters below to provide HTML metadata. This is useful for search engines and social media.

Notebook for 2023-03-13

The main topic of this lecture is basic stationary iterations for solving linear systems. As discussed in the notes, these iterations are typically associated with a splitting $A = M-N$ where $M$ is a matrix that is easy to solve with.

md"""
# Notebook for 2023-03-13

The main topic of this lecture is basic *stationary* iterations for solving linear systems. As discussed in the notes, these iterations are typically associated with a *splitting* $A = M-N$ where $M$ is a matrix that is easy to solve with.
"""
132 ms
using LinearAlgebra
287 μs
using Plots
10.6 s

Splittings and convergence

A stationary iteration for $Ax = b$ is associated with a splitting $A = M-N$ where $M$ is a matrix that is easy to solve with. In terms of the splitting, we have

$$Mx = Nx + b$$

which is the fixed point equation for the iteration

$$Mx^{(k+1)} = Nx^{(k)} + b$$

We can also rewrite this in update form as

$$x^{(k+1)} = x^{(k)} + M^{-1}\left( b-Ax^{(k)} \right)$$

The error iteration is $e^{(k+1)} = R e^{(k)}$ where $R = M^{-1} N$. We typically analyze $R$ in terms of a bound on an operator norm (computed via opnorm in Julia). But tight bounds on the rate of convergence of the iteration are given by the spectral radius $\rho(R)$, i.e. the largest magnitude of any eigenvalue of $R$.

If $v$ is an eigenvector associated with the largest magnitude eigenvalue, then $\|Rv\| = |\rho(R)| \|v\|$. So $\rho(R)$ is a lower bound on any operator norm of $R$, as $\rho(R) = \|Rv\|/\|v\| \leq \|R\|$.

Going the other way, if $R$ is diagonalizable, then there exists a vector norm such that $\rho(R)$ is equal to the associated operator norm. If $R$ is not diagonalizable, then for any $\epsilon > 0$ there is a vector norm such that the induced operator norm satisfies $\rho(R) \leq \|R\| \leq \rho(R) + \epsilon$.

md"""
## Splittings and convergence

A stationary iteration for $Ax = b$ is associated with a splitting $A = M-N$ where $M$ is a matrix that is easy to solve with. In terms of the splitting, we have

$$Mx = Nx + b$$

which is the fixed point equation for the iteration

$$Mx^{(k+1)} = Nx^{(k)} + b$$

We can also rewrite this in update form as

$$x^{(k+1)} = x^{(k)} + M^{-1}\left( b-Ax^{(k)} \right)$$

521 μs
spectral_radius (generic function with 1 method)
spectral_radius(R) = maximum(abs.(eigvals(R)))
291 μs
Arand
10×10 Matrix{Float64}:
 0.57905   0.3999    0.950655   0.993927  …  0.655512  0.626451   0.592262
 0.896937  0.638905  0.0958122  0.39459      0.424927  0.80666    0.496504
 0.208288  0.696011  0.614289   0.265439     0.437093  0.319736   0.364303
 0.11411   0.565658  0.601235   0.996179     0.730612  0.330163   0.440749
 0.481979  0.805915  0.788288   0.773162     0.846622  0.192573   0.814053
 0.438884  0.646805  0.319852   0.497737  …  0.966071  0.55545    0.360647
 0.576505  0.788908  0.907664   0.284408     0.443088  0.0834277  0.493762
 0.895204  0.178295  0.152112   0.984811     0.147581  0.0270541  0.80562
 0.769117  0.757536  0.359267   0.716494     0.611082  0.559553   0.0101991
 0.823437  0.313998  0.892947   0.970725     0.960149  0.683754   0.0212497
Arand = rand(10,10)
20.6 μs
5.544225356637162
2.9 ms
5.620282147857701
opnorm(Arand) # Operator 2-norm
1.1 ms
6.877471368765031
opnorm(Arand,1) # Operator 1-norm
14.4 μs
6.405730168285172
opnorm(Arand,Inf) # Operator ∞-norm
15.5 μs

As a side note: if opnorm computes the operator norm, what does norm do when the argument is a matrix? The answer is that the matrix is treated as a vector, and the ordinary vector norm is computed. So norm(A) in Julia computes the ordinary Euclidean norm of a vector composed of all entries of $A$ – that is, it computes the Frobenius norm of the matrix.

md"""
As a side note: if `opnorm` computes the operator norm, what does `norm` do when the argument is a matrix? The answer is that the matrix is treated as a vector, and the ordinary vector norm is computed. So `norm(A)` in Julia computes the ordinary Euclidean norm of a vector composed of all entries of $A$ -- that is, it computes the Frobenius norm of the matrix.
"""
169 μs
6.204586897865003
norm(Arand)
22.6 μs
6.204586897865003
sqrt(sum(Arand.^2))
50.4 ms

Richardson's iteration

We start with an iteration that is not actually in the lecture notes: Richardson's iteration. For $A = I-N$, Richardson's iteration for $Ax = b$ is

$$x^{(k+1)} = N x^{(k)} + b$$

This corresponds to the splitting $A = M-N$ where $A = I$.

If we start with $x^{(0)} = 0$, we have actually seen this iteration before in a different guise:

$$x^{(k)} = \sum_{j=0}^{k-1} N^j b = (I-N)^{-1} (I-N^k) b$$

This is precisely the $k$th partial sum of a Neumann series for $(I-N)^{-1}$ (multiplied by $b$)! In this case, the error at step $k$ is

$$e^{(k)} = (I-N)^{-1} b - x^{(k)} = (I-N)^{-1} N^k b$$

and we therefore have the relative error condition

$$\frac{\|e^{(k)}\|}{\|(I-N)^{-1} b\|} \leq \|N^k\| \leq \|N\|^k.$$

where the last inequality assumes a consistent norm. Under this assumption, convergence is guaranteed if $\|N\| < 1$. However, this is sufficient, but the necessary condition (for almost all $b$) is that $\rho(N)$ should be less than one, and $\rho(N)$ is the thing that tightly controls the rate of convergence.

md"""
## Richardson's iteration

We start with an iteration that is not actually in the lecture notes: Richardson's iteration. For $A = I-N$, Richardson's iteration for $Ax = b$ is

$$x^{(k+1)} = N x^{(k)} + b$$

This corresponds to the splitting $A = M-N$ where $A = I$.

If we start with $x^{(0)} = 0$, we have actually seen this iteration before in a different guise:

$$x^{(k)} = \sum_{j=0}^{k-1} N^j b = (I-N)^{-1} (I-N^k) b$$

This is precisely the $k$th partial sum of a Neumann series for $(I-N)^{-1}$ (multiplied by $b$)! In this case, the error at step $k$ is

$$e^{(k)} = (I-N)^{-1} b - x^{(k)} = (I-N)^{-1} N^k b$$

and we therefore have the relative error condition
351 μs
richardson (generic function with 2 methods)
function richardson(N, b, nsteps=50; monitor=(x)->nothing)
x = zeros(length(b))
for j = 1:nsteps
x = N*x+b
end
x
end
1.9 ms
test_richardson (generic function with 1 method)
function test_richardson()
N = 0.2 * randn(5,5)
normN = opnorm(N)
rhoN = maximum(abs.(eigvals(N)))
b = rand(5)
xref = (I-N)\b
relerrs = []
richardson(N, b, monitor=(x)->push!(relerrs, norm(x-xref)/norm(xref)))
plot(relerrs, yscale=:log10, label="Relative error")
plot!(normN.^(1:50), yscale=:log10, label="Normwise bound")
plot!(rhoN.^(1:50), yscale=:log10, style=:dash, label="Rate from spectral radius")
end
1.5 ms
2.8 s

PageRank

Suppose $A$ is the (weighted) adjacency matrix for some graph. PageRank is a way of ranking the importance of different nodes in a graph according to a "random surfer" model. A surfer at node $i$ at time $t$ chooses where to move at time $t+1$ in one of two ways:

  1. With probability $\alpha \in [0,1)$, move from $i$ to neighbor $j$. The "dampling probability" $\alpha$ is frequently chosen to be about $0.85$. Conditioned on taking a neighbor step, the neighbor node $j$ is chosen with probability $a_{ji}/d_i$.

  2. Otherwise, with probability $1-\alpha$, "teleport" to some node $j$ in the graph with probability $\pi^{\mathrm{ref}}_j$. The reference probability is often chosen to be uniform ($\pi^{\mathrm{ref}}_j = 1/n$), but choosing a non-uniform reference is useful when defining "personalized" PageRank.

Let $AD^{-1}$ be the column-normalized transition matrix. Then we can write this process as

$$\pi^{(t+1)} = \alpha AD^{-1} \pi^{(t)} + (1-\alpha) \pi^{\mathrm{ref}}$$

With the setup from a moment ago, we can see that this looks like Richardson's iteration for the equation

$$(I-\alpha AD^{-1}) \pi = (1-\alpha) \pi^{\mathrm{ref}}$$

Because $AD^{-1}$ has columns that sum to 1, we know that $\|AD^{-1}\|_1 = 1$. We also know that $\rho(AD^{-1}) = 1$ in this case; this is true whenever we have a column stochastic matrix like $AD^{-1}$. And yet, the PageRank iteration decreases the error by more than a factor of $\alpha$ at each step. Why? Because we start off with the correct sum of the entries, which means $V^{-1} \pi^{(0)}$ has first entry zero – there is no error in the direction of the first eigenvector when we start with $\pi^{(0)} = \pi^{\mathrm{ref}}$. In this case, convergence is associated with the second largest magnitude eigenvalue.

I encourage you to check this for yourself. What happens to the plot below if you initialize the iteration with $\pi^{(0)} = 0$ rather than with $\pi^{(0)} = \pi^{\mathrm{ref}}$?

md"""
## PageRank

Suppose $A$ is the (weighted) adjacency matrix for some graph. PageRank is a way of ranking the importance of different nodes in a graph according to a "random surfer" model. A surfer at node $i$ at time $t$ chooses where to move at time $t+1$ in one of two ways:

1. With probability $\alpha \in [0,1)$, move from $i$ to neighbor $j$. The "dampling probability" $\alpha$ is frequently chosen to be about $0.85$. Conditioned on taking a neighbor step, the neighbor node $j$ is chosen with probability $a_{ji}/d_i$.
2. Otherwise, with probability $1-\alpha$, "teleport" to some node $j$ in the graph with probability $\pi^{\mathrm{ref}}_j$. The reference probability is often chosen to be uniform ($\pi^{\mathrm{ref}}_j = 1/n$), but choosing a non-uniform reference is useful when defining "personalized" PageRank.

Let $AD^{-1}$ be the column-normalized transition matrix. Then we can write this process as

2.4 ms
pagerank (generic function with 2 methods)
function pagerank(A, α, πref, nsteps=100; monitor=(π)->nothing)
d = vec(sum(A, dims=1)) # Compute column sums
π = copy(πref) # Initialize with the reference probability
#π = zeros(length(πref)) # For comparison -- initialize with the zero vector
for t = 1:nsteps
π[:] = α*(A*(π./d)) + (1-α)*πref
end
end
2.0 ms
test_pagerank (generic function with 1 method)
function test_pagerank()

# Set up PageRank problem parameters
A = rand(10,10)
d = vec(sum(A, dims=1))
πref = 0.1 * ones(10)
α = 0.85

# Get the "right" rate constant
abs_λs = sort(abs.(eigvals(A*Diagonal(1.0./d))), rev=true)
ρ2 = α*abs_λs[2]

# Run PR and track the residual norms with the monitor
resid(π) = π-α*(A*(π./d))-(1-α)*πref
resids = []
pagerank(A, α, πref, 50, monitor=(π)->push!(resids, norm(resid(π),1)))

# Semi-log plot of residuals, norm^k, and ρ2^k
plot(resids, label="Residuals", yscale=:log10)
plot!(α.^(1:50), label="Norm rate bound", style=:dash, yscale=:log10)
plot!(ρ2.^(1:50), label="Actual rate", style=:dash, yscale=:log10)

end
2.5 ms
10.4 ms

Jacobi iteration

The Jacobi iteration takes $M$ to be the diagonal part of $A$. We say $A$ is *strictly (row) diagonally dominant$ if for every row $i$,

$$|a_{ii}| > \sum_{j\neq i} |a_{ij}|$$

For Jacobi iteration, the error iteration matrix $R = M^{-1} N$ has entries $a_{ij}/a_{ii}$ (with zeros on the diagonal), and so strict row diagonal dominance means that $\|R\|_\infty < 1$. Therefore, Jacobi iteration is guaranteed to converge for strictly row diagonally dominant matrices, with error strictly decreasing at each step in the $\infty$-norm.

It is convenient not to form $N = M-A$ explicitly. Instead, we work with the update form

$$x^{(k+1)} = x^{(k)} + M^{-1} (b-Ax^{(k)})$$

md"""
## Jacobi iteration

The *Jacobi* iteration takes $M$ to be the diagonal part of $A$. We say $A$ is *strictly (row) diagonally dominant$ if for every row $i$,

$$|a_{ii}| > \sum_{j\neq i} |a_{ij}|$$

For Jacobi iteration, the error iteration matrix $R = M^{-1} N$ has entries $a_{ij}/a_{ii}$ (with zeros on the diagonal), and so strict row diagonal dominance means that $\|R\|_\infty < 1$. Therefore, Jacobi iteration is guaranteed to converge for strictly row diagonally dominant matrices, with error strictly decreasing at each step in the $\infty$-norm.

It is convenient not to form $N = M-A$ explicitly. Instead, we work with the update form

$$x^{(k+1)} = x^{(k)} + M^{-1} (b-Ax^{(k)})$$
"""
333 μs
jacobi (generic function with 2 methods)
function jacobi(A, b, nsteps=50; monitor=(x, r)->nothing)
x = zeros(length(b))
d = diag(A)
for j = 1:nsteps
r = b-A*x
x += r./d
end
x
end
2.0 ms
test_jacobi (generic function with 1 method)
function test_jacobi()
A = I-0.05*rand(10,10) # Diagonally dominant test matrix
b = rand(10)
normR = opnorm(Diagonal(diag(A))\A-I, Inf)
xref = A\b
errs = []
jacobi(A, b, 50, monitor=(x,r)->push!(errs, norm(xref-x, Inf)))
plot(errs, yscale=:log10, label="Errors")
plot!(norm(xref)*normR.^(1:50), yscale=:log10, style=:dash, label="Bound")
end
1.3 ms
792 μs

Gauss-Seidel

The Gauss-Seidel iteration sets $M$ to be the lower triangle of $A$ (assuming the usual ordering of equations and unknowns). In update form, we have

$$x^{(k+1)} = x^{(k)} + L^{-1} (b-Ax^{(k)}).$$

If $A$ is symmetric and positive definite, then we showed in the notes that one Gauss-Seidel step can be seen as coordinate descent on

$$\phi(x) = \frac{1}{2} x^T A x - x^T b$$

where we minimize with respect to each $x_i$ in turn. This guarantees that $\phi(x^{(k)})$ is decreasing with $k$, and a little extra argument gives that Gauss-Seidel always converges for positive definite systems (though it is not guaranteed to converge very fast).

md"""
## Gauss-Seidel

The Gauss-Seidel iteration sets $M$ to be the lower triangle of $A$ (assuming the usual ordering of equations and unknowns). In update form, we have

$$x^{(k+1)} = x^{(k)} + L^{-1} (b-Ax^{(k)}).$$

If $A$ is symmetric and positive definite, then we showed in the notes that one Gauss-Seidel step can be seen as coordinate descent on

$$\phi(x) = \frac{1}{2} x^T A x - x^T b$$

where we minimize with respect to each $x_i$ in turn. This guarantees that $\phi(x^{(k)})$ is decreasing with $k$, and a little extra argument gives that Gauss-Seidel always converges for positive definite systems (though it is not guaranteed to converge very fast).
"""
238 μs
gauss_seidel (generic function with 2 methods)
function gauss_seidel(A, b, nsteps=50; monitor=(x, r)->nothing)
x = zeros(length(b))
L = LowerTriangular(A)
for j = 1:nsteps
r = b-A*x
x += L\r
end
x
end
2.0 ms
test_gs (generic function with 1 method)
function test_gs()
Z = rand(15,10)
A = Z'*Z
b = rand(10)
xref = A\b
errs = []
gauss_seidel(A, b, 1000, monitor=(x,r)->push!(errs, norm(x-xref)))
plot(errs, yscale=:log10, label="Errors")
end
1.3 ms
1.4 ms

Tridiagonal test problem

Consider a two-point ODE boundary value problem

$$-\frac{d^2 u}{dx^2}(x) = f(x), \quad u(0) = u(1) = 0.$$

We discretize the problem by meshing $[0,1]$ with $N+1$ evenly spaced points $x_j = jh$ for $j = 0$ to $N+1$ where $h=1/(N+1)$ and applying the second-order finite difference approximation

$$\frac{d^2 u}{dx^2} = \frac{u(x-h)-2u(x)+u(x+h)}{h^2} + O(h^2)$$

Let $u_j$ be our approximation for $u(x_j)$; then we have

$$-u_{j-1}+2u_j-u_{j+1} = h^2 f_j, \quad j = 1, \ldots, N$$

or, in matrix notation

$$Tu = h^2 f$$

where $T$ is the tridiagonal matrix with $2$ on the diagonal and $-1$ on the first super- and subdiagonals. The vector $u$ in this case corresponds to the unknowns $u_1, \ldots, u_N$; the values $u_0 = u_N = 0.0$ are given by the boundary conditions.

md"""
## Tridiagonal test problem

Consider a two-point ODE boundary value problem

$$-\frac{d^2 u}{dx^2}(x) = f(x), \quad u(0) = u(1) = 0.$$

We discretize the problem by meshing $[0,1]$ with $N+1$ evenly spaced points $x_j = jh$ for $j = 0$ to $N+1$ where $h=1/(N+1)$ and applying the second-order finite difference approximation

$$\frac{d^2 u}{dx^2} = \frac{u(x-h)-2u(x)+u(x+h)}{h^2} + O(h^2)$$

Let $u_j$ be our approximation for $u(x_j)$; then we have

$$-u_{j-1}+2u_j-u_{j+1} = h^2 f_j, \quad j = 1, \ldots, N$$

or, in matrix notation

$$Tu = h^2 f$$

where $T$ is the tridiagonal matrix with $2$ on the diagonal and $-1$ on the first super- and subdiagonals. The vector $u$ in this case corresponds to the unknowns $u_1, \ldots, u_N$; the values $u_0 = u_N = 0.0$ are given by the boundary conditions.
308 μs
Tpoisson (generic function with 1 method)
Tpoisson(N) = SymTridiagonal(2.0*ones(N), -1.0*ones(N-1))
298 μs
10×10 SymTridiagonal{Float64, Vector{Float64}}:
  2.0  -1.0    ⋅     ⋅     ⋅     ⋅     ⋅     ⋅     ⋅     ⋅ 
 -1.0   2.0  -1.0    ⋅     ⋅     ⋅     ⋅     ⋅     ⋅     ⋅ 
   ⋅   -1.0   2.0  -1.0    ⋅     ⋅     ⋅     ⋅     ⋅     ⋅ 
   ⋅     ⋅   -1.0   2.0  -1.0    ⋅     ⋅     ⋅     ⋅     ⋅ 
   ⋅     ⋅     ⋅   -1.0   2.0  -1.0    ⋅     ⋅     ⋅     ⋅ 
   ⋅     ⋅     ⋅     ⋅   -1.0   2.0  -1.0    ⋅     ⋅     ⋅ 
   ⋅     ⋅     ⋅     ⋅     ⋅   -1.0   2.0  -1.0    ⋅     ⋅ 
   ⋅     ⋅     ⋅     ⋅     ⋅     ⋅   -1.0   2.0  -1.0    ⋅ 
   ⋅     ⋅     ⋅     ⋅     ⋅     ⋅     ⋅   -1.0   2.0  -1.0
   ⋅     ⋅     ⋅     ⋅     ⋅     ⋅     ⋅     ⋅   -1.0   2.0
16.8 μs

The case $f = 1$ is simple enough for us to solve by hand; the solution in this case is $u(x) = \frac{1}{2} x(1-x)$. Let's check that this discrete approximation gives us something reasonable.

md"""
The case $f = 1$ is simple enough for us to solve by hand; the solution in this case is $u(x) = \frac{1}{2} x(1-x)$. Let's check that this discrete approximation gives us something reasonable.
"""
444 μs
test_poisson_solve1 (generic function with 1 method)
function test_poisson_solve1(N)
T = Tpoisson(N)
f = ones(N)
x = range(0.0, 1.0, length=N+2)
h = 1.0/(N+1)

u = zeros(N+2) # Storage including end points
u[2:N+1] = T\(h^2*f) # Solve for interior points

plot(x, u, label="Numerical solution")
plot!(x, x.*(1.0.-x)/2, label="Reference solution", style=:dash)
end
1.0 ms
41.0 ms

Jacobi and Gauss-Seidel sweeps

We have seen the splitting perspective on Jacobi and Gauss-Seidel. Another perspective on the same iterations is that we are sweeping over the equations and unknowns, using equation $j$ to find an updated value for $u_j$ assuming all the other entries are correct. In the Jacobi case, we compute the updated values using the old values at each step, i.e.

$$-u_{j-1}^{(k)} + 2u_h^{(k+1)} - u_{j+1}^{(k)} = h^s f_j.$$

In the Gauss-Seidel case, we compute the updated values using the most recent versions of each variable, i.e.

$$-u_{j-1}^{(k+1)} + 2u_j^{(k+1)} - u_{j+1}^{(k)} = h^2 f_j.$$

md"""
## Jacobi and Gauss-Seidel sweeps

We have seen the splitting perspective on Jacobi and Gauss-Seidel. Another perspective on the same iterations is that we are sweeping over the equations and unknowns, using equation $j$ to find an updated value for $u_j$ assuming all the other entries are correct. In the Jacobi case, we compute the updated values using the old values at each step, i.e.

$$-u_{j-1}^{(k)} + 2u_h^{(k+1)} - u_{j+1}^{(k)} = h^s f_j.$$

In the Gauss-Seidel case, we compute the updated values using the most recent versions of each variable, i.e.

$$-u_{j-1}^{(k+1)} + 2u_j^{(k+1)} - u_{j+1}^{(k)} = h^2 f_j.$$
"""
191 μs
Tjacobi (generic function with 2 methods)
function Tjacobi(f, nsteps=100; monitor=(u)->nothing)
N = length(f)-2
h = 1.0/(N+1)
u = zeros(N+2)
uold = zeros(N+2)
for k = 1:nsteps
for j = 2:N+1
u[j] = (h^2*f[j-1] + uold[j-1] + uold[j+1])/2
end
u, uold = uold, u
end
u
end
2.3 ms
Tgs (generic function with 2 methods)
function Tgs(f, nsteps=100; monitor=(u)->nothing)
N = length(f)-2
h = 1.0/(N+1)
u = zeros(N+2)
for k = 1:nsteps
for j = 2:N+1
u[j] = (h^2*f[j-1] + u[j-1] + u[j+1])/2
end
end
u
end
2.2 ms
test_poisson_solve2 (generic function with 1 method)
function test_poisson_solve2(N)
T = Tpoisson(N)
f = ones(N+2)
x = range(0.0, 1.0, length=N+2)
h = 1.0/(N+1)

uref = zeros(N+2) # Storage including end points
uref[2:N+1] = T\(h^2*f[2:N+1]) # Solve for interior points

errs_j = []
errs_gs = []
Tjacobi(f, monitor=(u)->push!(errs_j, norm(u-uref)))
Tgs(f, monitor=(u)->push!(errs_gs, norm(u-uref)))
plot(errs_j, yscale=:log10, label="Jacobi error")
plot!(errs_gs, yscale=:log10, label="Gauss-Seidel error")
end
2.1 ms

For this model problem, Gauss-Seidel converges about twice as fast as Jacobi. But neither is all that fast as the number of points gets big.

md"""
For this model problem, Gauss-Seidel converges about twice as fast as Jacobi. But neither is all that fast as the number of points gets big.
"""
109 μs
9.1 ms
746 μs