Frontmatter

If you are publishing this notebook on the web, you can set the parameters below to provide HTML metadata. This is useful for search engines and social media.

HW 3 for CS 4220

You may (and probably should) talk about problems with the each other, with the TAs, and with me, providing attribution for any good ideas you might get. Your final write-up should be your own.

md"""
# HW 3 for CS 4220

You may (and probably should) talk about problems with the each other, with the TAs, and with me, providing attribution for any good ideas you might get. Your final write-up should be your own.
"""
158 μs
using LinearAlgebra
270 μs
using Plots
3.3 s
using QuadGK
4.3 ms

1. Function fitting

Let $x_1, \ldots, x_m$ be sample points in $[-1, 1]$ and $y_1, \ldots, y_m$ are associated real-valued responses. Consider the least squares problem over $p \in \mathcal{P}_d$

$$\mbox{minimize } \sum_{i=1}^m (p(x_i)-y_i)^2$$

where $d+1 \geq m$. We consider solving this problem with two different bases for $\mathcal{P}_d$, writing

$$p(x) = \sum_{j=0}^d b_j x^j$$

or

$$p(x) = \sum_{j=0}^d c_j T_j(x)$$

where the functions $T_j(x)$ are the Chebyshev polynomials. This gives rise to least squares problems of minimizing $\|Ab-y\|^2$ or $\|Bc-y\|^2$, where we can compute via the following functions:

md"""
#### 1. Function fitting

Let $x_1, \ldots, x_m$ be sample points in $[-1, 1]$ and $y_1, \ldots, y_m$ are associated real-valued responses. Consider the least squares problem over $p \in \mathcal{P}_d$

$$\mbox{minimize } \sum_{i=1}^m (p(x_i)-y_i)^2$$

where $d+1 \geq m$. We consider solving this problem with two different bases for $\mathcal{P}_d$, writing

$$p(x) = \sum_{j=0}^d b_j x^j$$

or

$$p(x) = \sum_{j=0}^d c_j T_j(x)$$

where the functions $T_j(x)$ are the Chebyshev polynomials. This gives rise to least squares problems of minimizing $\|Ab-y\|^2$ or $\|Bc-y\|^2$, where we can compute via the following functions:
"""
298 μs
form_ls_power (generic function with 1 method)
function form_ls_power(x, d)
A = zeros(length(x), d+1)
for j = 0:d
A[:,j+1] = x.^j
end
A
end
685 μs
form_ls_cheb (generic function with 1 method)
function form_ls_cheb(x, d)
B = zeros(length(x), d+1)
if d > 0 B[:,1] .= 1.0 end
if d > 1 B[:,2] .= x end
for j = 3:d+1
B[:,j] = 2*x .* B[:,j-1] - B[:,j-2]
end
B
end
880 μs

It is also useful to be able to evaluate both polynomials at a point or points. With careful use of broadcasting operations, the same Julia code can cover both cases.

md"""
It is also useful to be able to evaluate both polynomials at a point or points. With careful use of broadcasting operations, the same Julia code can cover both cases.
"""
138 μs
eval_poly_power (generic function with 1 method)
function eval_poly_power(x, b)
px = 0.0
for j = length(b):-1:1
px = x.*px .+ b[j]
end
end
628 μs
eval_poly_cheb (generic function with 1 method)
function eval_poly_cheb(x, c)
p = 0*x .+ c[1]
if length(c) > 1 p += c[2]*x end
Tm2 = 0.0*x .+ 1.0
Tm1 = x
for j = 3:length(c)
Tj = 2*x.*Tm1 - Tm2
p += c[j]*Tj
Tm1, Tm2 = Tj, Tm1
end
p
end
973 μs

As an example, consider the case where $y_i = 1.0 + \eta_i$, where $\eta_i$ are independent noise terms with mean zero and standard deviation of $\sigma$ (which we will set to $10^{-4}$). Ideally, we should have $p$ evaluate to the same thing with both fitting processes (they represent the same polynomial, up to roundoff effects), and that thing should be close to 1.

Note that if $p(x) = \sum_{j=0}^d w_j c$, $w_j = T_j(x)$, then for our example we expect $p(x)-1 = w^T B^\dagger \eta$ to be a normal random variable with mean zero and standard deviation $\|w^T B^\dagger\| \sigma$. We can write this in Julia as w' / B.

md"""
As an example, consider the case where $y_i = 1.0 + \eta_i$, where $\eta_i$ are independent noise terms with mean zero and standard deviation of $\sigma$ (which we will set to $10^{-4}$). Ideally, we should have $p$ evaluate to the same thing with both fitting processes (they represent the same polynomial, up to roundoff effects), and that thing should be close to 1.

Note that if $p(x) = \sum_{j=0}^d w_j c$, $w_j = T_j(x)$, then for our example we expect $p(x)-1 = w^T B^\dagger \eta$ to be a normal random variable with mean zero and standard deviation $\|w^T B^\dagger\| \sigma$. We can write this in Julia as `w' / B`.
"""
26.3 ms
let
m = 6
d = 4
x = 2.0*rand(m) .- 1.0
y = 1.0 .+ 1e-4 * randn(m)

# Solve with power basis
b = A\y

# Solve with Chebyshev basis, including variance
F = qr(B)
c = F\y
WT = form_ls_cheb([0.0; 1.0], d)
var = [norm(vi) for vi in eachrow(WT/F.R)]

# Compare
p1 = eval_poly_power([0.0; 1.0], b)
p2 = eval_poly_cheb([0.0; 1.0], c)
p1.-1.0, p2.-1.0, abs.(p1-p2), var*1e-4
end
271 ms
Questions
  • Change $m = 6, d = 4$ to $m = 60, d = 40$. What do you observe? Try re-running a few times to vary the random parameters.

  • What are the condition numbers of $A$ and $B$ for the case of $m = 60$ and $d = 40$? You may use the Julia cond function for computing this. How does this explain what you see about the differences between using the two bases?

  • In our code, we used F.R rather than F in the variance computation. Explain why this ends up being equivalent, at least in terms of the values.

md"""
##### Questions

- Change $m = 6, d = 4$ to $m = 60, d = 40$. What do you observe? Try re-running a few times to vary the random parameters.
- What are the condition numbers of $A$ and $B$ for the case of $m = 60$ and $d = 40$? You may use the Julia `cond` function for computing this. How does this explain what you see about the differences between using the two bases?
- In our code, we used `F.R` rather than `F` in the variance computation. Explain why this ends up being equivalent, at least in terms of the values.
"""
1.5 ms

2. Normalized Legendre

The normalized Legendre polynomials $q_0, q_1, \ldots$ are a family of orthonormal polynomials of increasing degree $q_d$ has degree $d$) with respect to the $\mathcal{L}^2$ inner product on $[-1, 1]$, i.e.

$$\int_{-1}^1 q_i(x) q_j(x) \, dx = \delta_{ij} = \begin{cases} 1, & i = j \\ 0, & \mbox{ otherwise.} \end{cases}$$

If the matrix $M \in \mathbb{R}^{d+1}$ has entries

$$M_{ij} = \int_{-1}^1 x^i x^j \, dx = \begin{cases} 2/(i+j+1), & i+j \mbox{ even} \\ 0, & i+j \mbox{ odd} \end{cases}$$

then the Cholesky factorization $R^T R = M$ gives the coefficients for $q_0, \ldots, q_d$ in the power basis: if $U = R^{-1}$, then

$$q_j(x) = \sum_{i=0}^j x^i u_{ij}.$$

Explain why this construction works – that is, why is it that these polynomials are orthonormal in $\mathcal{L}^2$ on $[-1, 1]$

Hint: This is easiest if we write $Q(x) = P(x) R^{-1}$, where

$$\begin{align*} Q(x) &= \begin{bmatrix} q_0(x) & \ldots & q_d(x) \end{bmatrix} \\ P(x) &= \begin{bmatrix} x^0 & \ldots & x^d \end{bmatrix} \end{align*}$$

md"""
#### 2. Normalized Legendre

The normalized [Legendre polynomials](https://en.wikipedia.org/wiki/Legendre_polynomials) $q_0, q_1, \ldots$ are a family of orthonormal polynomials of increasing degree $q_d$ has degree $d$) with respect to the $\mathcal{L}^2$ inner product on $[-1, 1]$, i.e.

$$\int_{-1}^1 q_i(x) q_j(x) \, dx = \delta_{ij} = \begin{cases} 1, & i = j \\ 0, & \mbox{ otherwise.} \end{cases}$$

If the matrix $M \in \mathbb{R}^{d+1}$ has entries

$$M_{ij} = \int_{-1}^1 x^i x^j \, dx
= \begin{cases} 2/(i+j+1), & i+j \mbox{ even} \\ 0, & i+j \mbox{ odd} \end{cases}$$

then the Cholesky factorization $R^T R = M$ gives the coefficients for $q_0, \ldots, q_d$ in the power basis: if $U = R^{-1}$, then

$$q_j(x) = \sum_{i=0}^j x^i u_{ij}.$$

Explain why this construction works -- that is, why is it that these polynomials are orthonormal in $\mathcal{L}^2$ on $[-1, 1]$

1.5 ms

3. Chopping Cholesky

Suppose $A$ is a positive definite matrix with Cholesky factorization $A = R^T R$. Then the matrix $A_{2:n,2:n}$ in which we remove the first row and column can be written as $A_{2:n,2:n} = \tilde{U}^T \tilde{U}$ where $\tilde{U} = R_{:,2:n}$ is the $n$-by-$(n-1)$ matrix consisting of columns $2$ through $n$ of $R$. The nonzero structure of $\tilde{U}$ is

$$\begin{bmatrix} \times & \times & \times & \ldots & \times \\ \times & \times & \times & \ldots & \times \\ 0 & \times & \times & \ldots & \times \\ 0 & 0 & \times & \ldots & \times \\ 0 & 0 & 0 & \ddots & \vdots \\ 0 & 0 & 0 & \ldots & \times \end{bmatrix}$$

If we apply an appropriate Householder reflector to the first two rows of $\tilde{U}$

$$\begin{bmatrix} * & * & * & \ldots & * \\ 0 & * & * & \ldots & * \\ 0 & \times & \times & \ldots & \times \\ 0 & 0 & \times & \ldots & \times \\ 0 & 0 & 0 & \ddots & \vdots \\ 0 & 0 & 0 & \ldots & \times \end{bmatrix}$$

Continuing in this fashion, we can apply a total of $n$ Householder reflectors – each affecting two rows, to reduce $\tilde{U}$ to upper triangular form $U$, i.e. we (implicitly) compute the QR factorization $\tilde{U} = QU$ in $O(n^2)$ time.

Complete the following implementation of this method below.

md"""
#### 3. Chopping Cholesky

Suppose $A$ is a positive definite matrix with Cholesky factorization $A = R^T R$. Then the matrix $A_{2:n,2:n}$ in which we remove the first row and column can be written as $A_{2:n,2:n} = \tilde{U}^T \tilde{U}$ where $\tilde{U} = R_{:,2:n}$ is the $n$-by-$(n-1)$ matrix consisting of columns $2$ through $n$ of $R$. The nonzero structure of $\tilde{U}$ is

$$\begin{bmatrix}
\times & \times & \times & \ldots & \times \\
\times & \times & \times & \ldots & \times \\
0 & \times & \times & \ldots & \times \\
0 & 0 & \times & \ldots & \times \\
0 & 0 & 0 & \ddots & \vdots \\
0 & 0 & 0 & \ldots & \times
\end{bmatrix}$$

If we apply an appropriate Householder reflector to the first two rows of $\tilde{U}$

$$\begin{bmatrix}
* & * & * & \ldots & * \\
355 μs
house (generic function with 1 method)
# Compute Householder reflector vector v s.t.
# Hx = norm(x) e_1, where H = I-2*v*v', norm(v) = 1
function house(x)
v = copy(x)
v[1] -= norm(v)
v /= norm(v)
v
end
392 μs
# Sanity check Householder reflector
let
z = rand(5)
v = house(z)
z-2*v*(v'*z)
end
61.4 μs
chop_chol (generic function with 1 method)
function chop_chol(R)
# TODO: Replace the naive computation with the algorithm described above
F = qr(R[:,2:end])
F.R
end
334 μs
let
A = rand(10,10)
A = A'*A
R = cholesky(A).U
U = chop_chol(R)
norm(A[2:end,2:end]-U'*U)/norm(A[2:end,2:end]), norm(U-UpperTriangular(U))/norm(U)
end
572 ms