Notes for 2023-02-22

Gram-Schmidt

The Gram-Schmidt procedure is usually the first method people learn to convert some existing basis (columns of $A$) into an orthonormal basis (columns of $Q$). For each column of $A$, the procedure subtracts off any components in the direction of the previous columns, and then scales the remainder to be unit length. In Julia, Gram-Schmidt looks something like this:

md"""

# Notes for 2023-02-22

## Gram-Schmidt

The *Gram-Schmidt* procedure is usually the first method people

learn to convert some existing basis (columns of $A$) into an

orthonormal basis (columns of $Q$). For each column of $A$, the procedure

subtracts off any components in the direction of the previous columns,

and then scales the remainder to be unit length. In Julia, Gram-Schmidt looks

something like this:

"""

14.2 ms

using LinearAlgebra

240 μs

orth_cgs0 (generic function with 1 method)

function orth_cgs0(A)

m,n = size(A)

Q = zeros(m,n)

for j = 1:n

v = A[:,j] # Take the jth original basis vector

v = v-Q[:,1:j-1]*(Q[:,1:j-1]'*v) # Orthogonalize vs q_1, ... q_j-1

v = v/norm(v) # Normalize what remains

Q[:,j] = v # Add result to Q basis

end

898 μs

test_orth_cgs0 (generic function with 1 method)

function test_orth_cgs0()

A = rand(10,5)

Q = orth_cgs0(A)

# Check orthonormality of Q columns, and that Q gives the right projector

norm(Q'*Q-I), norm(A-Q*(Q'*A))/norm(A)

end

437 μs

4.23047e-16

2.06746e-16

test_orth_cgs0()

217 μs

Where does $R$ appear in this algorithm? As with the lower triangle in the LU factorization, the $R$ factor in QR appears implicitly in the orthogonalization and normalization steps.

md"""

Where does $R$ appear in this algorithm? As with the lower triangle in the LU factorization, the $R$ factor in QR appears implicitly in the orthogonalization and normalization steps.

"""

122 μs

orth_cgs (generic function with 1 method)

function orth_cgs(A)

m,n = size(A)

Q = zeros(m,n)

R = zeros(n,n)

for j = 1:n

v = A[:,j] # Take the jth original basis vector

R[1:j-1,j] = Q[:,1:j-1]'*v # Project onto q_1, ..., q_j-1

v = v-Q[:,1:j-1]*R[1:j-1,j] # Orthogonalize vs q_1, ... q_j-1

R[j,j] = norm(v) # Compute normalization constant

v = v/R[j,j] # Normalize what remains

Q[:,j] = v # Add result to Q basis

end

Q, R

end

1.4 ms

test_orth_cgs (generic function with 1 method)

function test_orth_cgs()

A = rand(10,5)

Q, R = orth_cgs(A)

# Check orthonormality of Q columns,

# upper triangularity of R,

# and that A = QR

norm(Q'*Q-I), norm(tril(R,-1)), norm(A-Q*R)/norm(A)

end

569 μs

9.21993e-16

0.0

2.87117e-17

test_orth_cgs()

42.1 μs

Testing with random matrices is a particularly gentle way to test our code, and doesn't reveal a lurking numerical instability in the CGS algorithm. The problem is that if $A$ is even somewhat ill conditioned (i.e. the basis vectors in $A$ are nearly co-linear) then the $Q$ computed by the classical Gram-Schmidt algorithm may stop being nearly orthogonal, even if the relationship $A = QR$ is well maintained.

md"""

"""

147 μs

test_orth_cgs_poor (generic function with 1 method)

function test_orth_cgs_poor()

A = rand(10,5)*Diagonal([1, 1e-2, 1e-4, 1e-6, 1e-8])*rand(5,5)

Q, R = orth_cgs(A)

# Check orthonormality of Q columns,

# upper triangularity of R,

# and that A = QR

norm(Q'*Q-I), norm(tril(R,-1)), norm(A-Q*R)/norm(A)

end

648 μs

5.56811e-5

0.0

0.0

test_orth_cgs_poor()

5.5 ms

This is somewhat ameliorated by the modified Gram-Schmidt algorithm, which computes the multipliers in $R$ in a slightly different way: subtracting off the projections of the residual onto each previously computed column of $k$ immediately rather than computing all the projections from the original vector. This is a little less prone to cancellation. We can see that numerical orthogonality with MGS is much better than with CGS, though there is still some loss.

md"""

This is somewhat ameliorated by the *modified* Gram-Schmidt algorithm, which computes the multipliers in $R$ in a slightly different way: subtracting off the projections of the residual onto each previously computed column of $k$ immediately rather than computing all the projections from the original vector. This is a little less prone to cancellation. We can see that numerical orthogonality with MGS is much better than with CGS, though there is still some loss.

"""

206 μs

orth_mgs (generic function with 1 method)

function orth_mgs(A)

m,n = size(A)

Q = zeros(m,n)

R = zeros(n,n)

for j = 1:n

v = A[:,j] # Take the jth original basis vector

for k = 1:j-1

R[k,j] = Q[:,k]'*v # Project what remains onto q_k

v -= Q[:,k]*R[k,j] # Subtract off the q_k component

end

R[j,j] = norm(v) # Compute normalization constant

v = v/R[j,j] # Normalize what remains

Q[:,j] = v # Add result to Q basis

end

Q, R

end

1.3 ms

test_orth_mgs_poor (generic function with 1 method)

function test_orth_mgs_poor()

A = rand(10,5)*Diagonal([1, 1e-2, 1e-4, 1e-6, 1e-8])*rand(5,5)

Q, R = orth_mgs(A)

# Check orthonormality of Q columns,

# upper triangularity of R,

# and that A = QR

norm(Q'*Q-I), norm(tril(R,-1)), norm(A-Q*R)/norm(A)

end

633 μs

1.25638e-7

0.0

6.82742e-17

test_orth_mgs_poor()

72.5 μs

Householder transformations

A Householder transformation is a simple orthogonal transformation associated with reflection across a plane with some unit normal $v$:

$$H = I-2vv^T$$

We typically choose the transformation so that we can map a target vector (say $x$) to something parallel to the first coordinate axis (so $\pm \|x\| e_1$). In some cases, it is convenient to use a non-unit normal $w$ and introduce an associated scaling constant:

$$H = I - \tau ww^T$$

This lets us normalize $w$ in other ways (e.g. by putting a 1 in the first component).

md"""

## Householder transformations

A *Householder transformation* is a simple orthogonal transformation associated with reflection across a plane with some unit normal $v$:

$$H = I-2vv^T$$

We typically choose the transformation so that we can map a target vector (say $x$) to something parallel to the first coordinate axis (so $\pm \|x\| e_1$). In some cases, it is convenient to use a *non-unit* normal $w$ and introduce an associated scaling constant:

$$H = I - \tau ww^T$$

This lets us normalize $w$ in other ways (e.g. by putting a 1 in the first component).

"""

329 μs

householder (generic function with 1 method)

function householder(x)

normx = norm(x)

s = -sign(x[1])

u1 = x[1]-s*normx

w = x/u1

w[1] = 1.0

τ = -s*u1/normx

τ, w

end

539 μs

test_householder (generic function with 1 method)

function test_householder()

x = rand(10)

τ, w = householder(x)

# Check orthogonality and the mapping of x

norm((I-τ*w*w')^2-I), norm((x-τ*w*(w'*x))[2:end])/norm(x)

end

697 μs

5.41358e-16

2.56777e-16

test_householder()

50.8 μs

Now we can think about QR in the same way we think about LU. The only difference is that we apply a sequence of Householder transformations (rather than Gauss transformations) to introduce the subdiagonal zeros. As with the LU factorization, we can also reuse the storage of A by recognizing that the number of nontrivial parameters in the vector $w$ at each step is the same as the number of zeros produced by that transformation. This gives us the following.

md"""

"""

158 μs

hqr! (generic function with 1 method)

function hqr!(A)

m,n = size(A)

τ = zeros(n)

for j = 1:n

# Find H = I-τ*w*w' to zero out A[j+1:end,j]

normx = norm(A[j:end,j])

s = -sign(A[j,j])

u1 = A[j,j] - s*normx

w = A[j:end,j]/u1

w[1] = 1.0

A[j+1:end,j] = w[2:end] # Save trailing part of w

A[j,j] = s*normx # Diagonal element of R

τ[j] = -s*u1/normx # Save scaling factor

# Update trailing submatrix by multipling by H

A[j:end,j+1:end] -= τ[j]*w*(w'*A[j:end,j+1:end])

end

A, τ

end

1.5 ms

If we need $Q$ or $Q^T$ explicitly, we can always form it from the compressed representation. We can also multiply by $Q$ and $Q^T$ implicitly.

md"""

If we need $Q$ or $Q^T$ explicitly, we can always form it from the compressed representation. We can also multiply by $Q$ and $Q^T$ implicitly.

"""

151 μs

applyQ! (generic function with 1 method)

function applyQ!(QR, τ, X)

m, n = size(QR)

for j = n:-1:1

w = [1.0; QR[j+1:end,j]]

X[j:end,:] -= τ[j]*w*(w'*X[j:end,:])

end

988 μs

applyQT! (generic function with 1 method)

function applyQT!(QR, τ, X)

m, n = size(QR)

for j = 1:n

w = [1.0; QR[j+1:end,j]]

X[j:end,:] -= τ[j]*w*(w'*X[j:end,:])

end

951 μs

applyQ (generic function with 1 method)

applyQ(QR, τ, X) = applyQ!(QR, τ, copy(X))

292 μs

applyQT (generic function with 1 method)

applyQT(QR, τ, X) = applyQ(QR, τ, copy(X))

278 μs

To compute a dense representation of $Q$, we can always apply $Q$ to the columns of the identity.

md"""

To compute a dense representation of $Q$, we can always apply $Q$ to the columns of the identity.

"""

125 μs

formQ (generic function with 1 method)

formQ(QR, τ) = applyQ!(QR, τ, Matrix(1.0I, size(QR)[1], size(QR)[1]))

325 μs

This gives us all the ingredients to form a dense QR decomposition of $A$.

md"""

This gives us all the ingredients to form a dense QR decomposition of $A$.

"""

121 μs

hqr (generic function with 1 method)

function hqr(A)

QR, τ = hqr!(copy(A))

formQ(QR, τ), triu(QR)

end

443 μs

test_hqr (generic function with 1 method)

function test_hqr()

A = rand(10,5)

Q, R = hqr(A)

norm(Q'*Q-I), norm(A-Q*R)/norm(A)

end

545 μs

1.01711e-15

2.59415e-16

test_hqr()

99.3 μs

However, we don't need the dense $Q$ and $R$ factors to solve a least squares problem; we just need the ability to multiply by $Q^T$ and solve with $R$.

md"""

However, we don't need the dense $Q$ and $R$ factors to solve a least squares problem; we just need the ability to multiply by $Q^T$ and solve with $R$.

"""

163 μs

hqr_solve_ls (generic function with 1 method)

function hqr_solve_ls(QR, τ, b)

m, n = size(QR)

UpperTriangular(QR[1:n,1:n])\(applyQT!(QR, τ, copy(b))[1:n])

end

578 μs

test_hqr_solve_ls (generic function with 1 method)

function test_hqr_solve_ls()

A = rand(10,5)

b = rand(10)

QR, τ = hqr!(copy(A))

xref = A\b

x = hqr_solve_ls(QR, τ, b)

norm(x-xref)/norm(xref)

end

573 μs

1.4063212636835668e-15

test_hqr_solve_ls()

178 μs