Proj 1: Approximation with RBFs
A classic function approximation scheme used for interpolating data at scattered points involves the use of a radial basis function (RBF), typically denoted by $\phi$. For a function $f : \mathbb{R}^d \rightarrow \mathbb{R}$, we approximate $f$ by
$$s(x) = \sum_{i=1}^n \phi(\|x-x_i\|_2) c_i$$
where the points $\{x_i\}_{i=1}^n$ are known as centers – we will assume for the purpose of this assignment that these are all distinct. We sometimes write this more concisely as
$$s(x) = \Phi_{xX} c$$
where $\Phi_{xX}$ is a row vector with entries $(\Phi_{xX})_j = \phi(\|x-x_i\|_2)$.
The coefficient vector $c$ may be chosen by interpolating at the centers; that is, we write $s(x_i) = f(x_i),$ or more compactly
$$\Phi_{XX} c = f_X$$
where $\Phi_{XX} \in \mathbb{R}^{n \times n}$ is the matrix with entries $(\Phi_{XX})_{ij} = \phi(\|x_i-x_j\|)$ and $f_X \in \mathbb{R}^n$ is the vector with entries $(f_X)_i = f(x_i)$. When $\phi$ is a positive definite RBF, the matrix $\Phi_{XX}$ is guaranteed to be positive definite.
There are many reasons to like RBF approximations. There is a great deal of theory associated with them, both from a classic approximation theory perspective and from a statistics perspective (where $\phi$ is associated with the covariance of a Gaussian process). But for this project, we also like RBF approximations because they naturally give rise to many different types of numerical linear algebra problems associated with solving linear systems! We will explore some of these in the current project.
Logistics
You should complete tasks 1 and 2 by Friday, Feb 17; tasks 3-5 are due by Friday, Feb 24.
You are encouraged to work in pairs on this project. I particularly encourage you to try pair-thinking and pair-programming – you learn less if you just try to partition the problems! You should produce short report addressing the analysis tasks, and a few short codes that address the computational tasks. You may use any Julia functions you might want.
You are allowed (even encouraged!) to read outside resources that talk about these types of computations – including my own notes from my "Numerical Methods for Data Science" course. It is possible that you'll find references that tell you outright how to solve a subproblem; if so, feel free to take advantage of them with citation! You may well end up doing more work to find and understand the relevant resources than you would doing it yourself from scratch, but you will learn interesting things along the way.
Most of the code in this project will be short, but that does not make it easy. You should be able to convince both me and your partner that your code is right. A good way to do this is to test thoroughly. Check residuals, compare cheaper or more expensive ways of computing the same thing, and generally use the computer to make sure you don't commit silly errors in algebra or coding. You will also want to make sure that you satisfy the efficiency constraints stated in the tasks.
Code setup
We will be using several built-in packages for this project. We'll use the LinearAlgebra
and Plots
packages in all most all of our computational homeworks and projects in this course, but here we also use the SparseArrays
and SuiteSparse
packages for dealing with sparse linear algebra. We also use the SpecialFunctions
package and QuadGK
for one of the tasks where we use kernel approximations to estimate integrals.
A bestiary of basis functions
We consider several possible radial basis functions, outlined below:
The squared exponential basis function (
ϕ_se
) is the default choice used in many machine learning applications. It is infinitely differentiable, and decays quickly away from the origin.The inverse multiquadric function (
ϕ_imq
) is also infinitely differentiable, but decays much more slowly away from the origin.The Matern family of functions is frequently used in geospatial statistics. These functions have a smoothness parameter; the Matérn 1/2 function (
ϕ_mat12
) is continuous but not differentiable, Matérn 3/2 (ϕ_mat32
) is continuously differentiable but not twice differentiable, and Matérn 5/2 (ϕ_mat52
) is twice continuously differentiable but not three times differentiable.The Wendland family of functions is compactly supported: that means it is exactly zero far enough away from the origin. The definition of these functions depends on the dimension $d$ of the underlying space in which we are doing interpolation. Here we define the 2D version of this function that is twice differentiable (
ϕ_w21
).
ϕ_w21 (generic function with 1 method)
Derivatives
For some of our computations later, it will also be useful to have the derivatives of the various radial basis functions we work with.
dϕ_w21 (generic function with 1 method)
I am as prone to calculus errors and coding typos as anyone, so when I code derivatives, I also like to include a finite difference check; that is, I compare my supposedly-exact derivative comptutation to the approximation
$$f'(x) \approx \frac{f(x+h)-f(x-h)}{2h},$$
which has an absolute error bounded by $h^2 \max_{|z| \leq h} |f'''(z)|/6$ (which we usually write as $O(h^2)$.)
RBF type | Relerr |
---|---|
SE | 9.899196270977514e-9 |
IMQ | 1.4411370482301693e-8 |
Matérn 1/2 | 1.6664242225614643e-9 |
Matérn 3/2 | 4.1939174510743164e-8 |
Matérn 5/2 | 1.780860813958562e-8 |
Wendland 2,1 | 7.970209367545122e-8 |
There are a few other things worth noting about our test code, as we get used to Julia programming:
We put the entire tester inside a
let
block, which defines a local scope for variable names. The functiontest_dϕ
that we defined here cannpt be seen outside this block.Our tester
test_d\phi
takes functions as an input argument.The last expression in the
let
block is the return value, which in this case is a Markdown cell. The Julia Markdown support includes support for rendering tables, which we have done here. We have also used theMarkdown.parse
function directly rather than using themd
macro, making it a little easier to program the output in this case. Note that*
on strings represents concatenation in Julia.
Length scales
We often define radial basis functions to have a length scale $l$, i.e.
$$\phi(r) = \phi_0(r/l).$$
We left out the length scale in the initial definitions of our RBFs, but we add it in now.
scale_rbf (generic function with 2 methods)
RBF matrices and vectors
To compute with our interpolating functions, we need to manipulate matrices $\Phi_{XY}$ with entries $\phi(\|x_i-y_j\|)$. We represent collections of points as matrices $X = \begin{bmatrix} x_1 & \ldots & x_n \end{bmatrix} \in \mathbb{R}^{d \times n}$. As is usual in Julia, we use an exclamation mark as part of the name for functions that mutate their inputs.
form_ΦXY! (generic function with 1 method)
form_ΦXY (generic function with 1 method)
We decorated the definitions in the previous block with AbstractVector
and AbstractMatrix
types in order to also have a specialized 1D version (where we use numbers to represent individual points, and vectors for collection of points).
form_ΦXY! (generic function with 2 methods)
form_ΦXY (generic function with 2 methods)
RBF matrix factorization
We create an RBFCenters
structure to keep track of the location of the centers in our approximation scheme along with the associated function of the RBF matrix. We are going to want to allow the interpolation of multiple functions for the same set of centers (for example), so we keep the representation of the centers and the factorization of $\Phi_{XX}$ separate from the coefficient vector for interpolating any specific function.
The implementation is slightly complicated by us typically keeping some extra storage around so that we can add new centers without reallocating storage.
add_centers_ref! (generic function with 1 method)
Once we have a factorization of the matrix $\Phi_{XX}$ through $n$ centers (which takes $O(n^3)$ time to compute), we can use that factorization to solve $\Phi_{XX} c = f_X$ in $O(n^2)$ time. We can then evaluate $s(x) = \phi_{xX} c$ at a new point in $O(n)$ time. The solve
and feval
routines compute $c$ and evaluate $s(x)$, respectively. Note that the use of iterators means that we never actually form $\phi_{xX}$ in memory – we just compute with it element by element.
We sometimes also want to compute gradients $\nabla s(x)$, which we can do with the dfeval
function. If the implementation of dfeval
seems slightly cryptic, note that the chain rule gives us
$$\nabla_x \phi(\|x-x_j\|_2) = \phi'(\|x-x_j\|_2) \frac{x-x_j}{\|x-x_j\|}$$
and we have to do something special when $x = x_j$ (the answer there is zero when differentiation makes sense at all). Therefore, by linearity of differentiation,
$$\nabla_x s(x) = \sum_{j=1}^n \phi'(\|x-x_j\|_2) \frac{x-x_j}{\|x-x_j\|} c_j$$
dfeval (generic function with 1 method)
Hello world!
Let's use our machinery now for a toy task of 1D interpolation of $\cos(x)$ on the interval from $[0, 2\pi]$. We will use an equi-spaced sampling mesh, with the squared exponential RBF.
Task 1: Fast mean temperatures
So far, we have discussed approximating one function at many points. Sometimes, we would like to quickly approximate something about many functions. For example, in 1D, suppose we have streaming temperature measurements $\theta_j(t)$ taken at $n$ fixed points $x_j \in [0, 1]$, from which we could estimate an instantaneous temperature field
$$\hat{\theta}(x, t) = \sum_{j=1}^n \phi(\|x-x_j\|) c_j$$
by the interpolation condition $\hat{\theta}(x_i, t) = \theta_i(t)$. Assuming we have a factorization for $\Phi_{XX}$, we can compute the estimated mean temperature
$$\tilde{\theta}(t) = \int_0^1 \hat{\theta}(x, t) \, dx = \sum_{j=1}^n \left( \int_0^1 \phi(\|x-x_j\|) \, dx \right) c_j$$
for any given $t$ by solving for the $c$ vector ($O(n^2)$ time) and taking the appropriate linear combination of integrals ($O(n)$ time). Here it is worth noting that
$$\int_0^1 \phi(\|x-x_j\|) \, dx = \int_0^{x_j} \phi(s) \, ds + \int_{0}^{1-x_j} \phi(s) \, ds$$
and for $\phi(r) = \exp(-r^2/l^2)$, we have
$$\int_0^x \phi(s) \, ds = \frac{l \sqrt{\pi}}{2} \operatorname{erf}(x/l),$$
where the error function $\operatorname{erf}(x)$ is implemented as erf(x)
in the Julia SpecialFunction
library. We implement this scheme in the following code, which runs in $O(n^3) + O(n^2 m)$ time with $n$ sensors and $m$ measurements.
mean_temperatures (generic function with 1 method)
An example run with about 2000 virtual sensors and 1000 samples takes a bit under 3 seconds to process this way on my laptop.
Rewrite this code to take $O(n^3) + O(mn)$ time by reassociating the expression $\tilde{\theta}(t) = w^T \Phi_{XX}^{-1} \theta_X(t)$. Compare the timing of the example given above to the timing of your routine; do the numbers seem reasonable? At the same time, do a comparison of the results to make sure that you do not get something different! You may want to use the testing and timing harness below.
mean_temperatures2 (generic function with 1 method)
test_sensor_demo (generic function with 1 method)
0.0
2.668168 seconds (12.02 M allocations: 794.908 MiB, 2.46% gc time) 3.304412 seconds (24.03 M allocations: 1.522 GiB, 9.26% gc time)
Task 2: Adding data points
The add_centers_ref!
function adds centers to an existing RBFCenters
object, then recomputes the factorization from scratch. If we add $k$ centers to an existing object with $n$ centers, this costs $O((n+k)^3)$. However, we can compute the same thing in $O(n^2 k + k^3)$ time by extending the existing Cholesky factorization using what we know about block factorization methods.
It may be helpful to use ldiv!(A, B)
, which overwrites the storage in B
with A\B
(where A
can be a factorization object or a matrix with a simple solver, like an upper triangular or diagonal matrix).
add_centers! (generic function with 1 method)
test_add_centers (generic function with 1 method)
Relative error: 0.0
3.960648 seconds (113.90 M allocations: 7.072 GiB, 12.60% gc time) 4.148284 seconds (113.90 M allocations: 7.072 GiB, 16.45% gc time)
Task 3: Missing data
Now suppose that we are again dealing with streaming sensor data, but every so often there are entries missing. If the $k$th measurement is missing, the interpolation conditions for the remaining points can be written in terms of a smaller system of equations where we remove row and column $k$ from the original problem; or as
$$\begin{align*} f(x_i) &= \sum_{j=1}^n \phi(\|x_i-x_j\|) \hat{c}_j + r_i, & r_i = 0 \mbox{ for } i \neq k \\ \hat{c}_k &= 0. \end{align*}$$
Equivalently, we have
$$\begin{bmatrix} \Phi_{XX} & e_k \\ e_k^T & 0 \end{bmatrix} \begin{bmatrix} \hat{c} \\ r_k \end{bmatrix} = \begin{bmatrix} \tilde{f}_X \\ 0 \end{bmatrix}$$
where $\tilde{f}_X$ agrees with $f_X$ except in the $k$th element. If we set $(\tilde{f}_X)_k = 0$, then $-r_k$ is the value at $x_k$ of the interpolant through the remaining data – potentially a pretty good guess for the true value.
Using block elimination on the system above, complete the following routine to fill in a single missing value in an indicated location. Your code should take $O(n^2)$ time (it would take $O(n^3)$ to refactor from scratch).
fill_missing! (generic function with 1 method)
fill_missing!
vs reference rel diff: 1.0Abs error in approximation: 0.22252093395631445
Task 4: Sparse approximation
So far, we have used dense matrix representations for everything. As you have seen, this is often basically fine for up to a couple thousand data points, provided that we are careful to re-use factorizations and organize our computations for efficiency. When we have much more data, though, the $O(n^3)$ cost of an initial factorization gets to be excessive.
There are several ways to use data sparsity of the RBF matrix for faster (approximate) solves. For this project, though, we will focus on a simple one: using ordinary sparsity of $\Phi_{XX}$ for compactly supported kernels like the Wendland, or approximating that sparsity for rapidly-decaying kernels like the squared exponential.
We are going to solve 2D function approximation problems here. To do this, we will want a couple of additional routines that we will provide.
Code preliminaries
Sampling the space
When we want to sample something in more than one spatial dimension and aren't just using a regular mesh, it is tempting to choose random samples. But taking independent uniform draws is not an especially effective way of covering a space – random numbers tend to clump up. For this reason, low discrepancy sequences are often a better basis for sampling than (pseudo)random draws. There are many such generators; we use a relatively simple one based on an additive recurrence with a multiplier based on the "generalized golden ratio".
kronecker_quasirand (generic function with 2 methods)
Finding neighbors
If we work with a radial basis function $\phi$ that is zero past some distance $r_{\mathrm{cutoff}}$, then $(\Phi_{XX})_{ij} \neq 0$ only if $\|x_i-x_j\| \leq r_{\mathrm{cutoff}}$. We can, of course, find all such pairs in $O(n^2)$ time by checking all possible pairs. But things can go faster with a simple data structure. In 2D, think of putting every point $x_i$ into a square "bucket" with side length $r_{\mathrm{cutoff}}$. Then all points within $r_{\mathrm{cutoff}}$ of $x_i$ must be in one of nine buckets: the one containing $x_i$, or any of its neighbors. If we give the buckets identifiers and sort them in "column major" order (and sort their contents accordingly), this can significantly reduce the number of potential neighbors to any point that we might have to check.
find_neighbors2d (generic function with 1 method)
Sparse RBF centers
Using the data structure in find_neighbors2d
, we can create a 2D sparse version of the RBFCenters
data structure. We will not provide functionality to add new centers after initialization, so we will not bother with making it mutable.
RBFCenters2Ds
As before, we want the ability to solve for a right hand side, evaluate a function, and evaluate a gradient.
dfeval (generic function with 2 methods)
Hello world
As before, we do a "hello world" test to illustrate the behavior of the solver. In this case, we approximate a simple function from 10K sample points in 2D.
test_hello_world_2d (generic function with 2 methods)
1.44689
1.44226
0.510156 seconds (5.48 M allocations: 430.675 MiB, 11.45% gc time, 34.19% compilation time) 0.150889 seconds (278.17 k allocations: 14.567 MiB, 54.32% gc time, 98.02% compilation time) 0.073137 seconds (421.08 k allocations: 22.505 MiB, 99.48% compilation time)
The questions
For the case of the Wendland kernel, the cutoff is exact, and $\Phi_{XX}$ is truly sparse. However, while the squared exponential kernel is never exactly zero, it does get very small. Therefore, we might try to do the same thing here.
hello_world_2d_se (generic function with 3 methods)
2.87052e-6
1.44689
0.757898 seconds (12.91 M allocations: 976.164 MiB, 10.34% gc time) 0.017487 seconds (70.52 k allocations: 3.536 MiB, 98.15% compilation time) 0.004284 seconds (19 allocations: 320.453 KiB)
Even a little playing around with this code should illustrate that it's delicate. Notice that we have changed the length scale to be even shorter. For example, try running with $l = 0.05$; what happens?
Answer: TODO
The code doesn't crash with $l = 0.02$, and our sanity check computation on a simple test function looks pretty good. But we would like to better understand the error in this approximation. Let $\hat{s}(x)$ be the sparsified approximator that we have computed above, i.e.
$$\hat{s}(x) = \sum_{\|x-x_i\| \leq r_{\mathrm{cutoff}}} \phi(\|x-x_i\|) c_{0,i}$$
Let us start by showing that
$$|\hat{s}(x)-s(x)| \leq \|c_0\|_1 \exp(-\sigma^2) + \|c-c_0\|_1.$$
Hint/sketch: Write $\hat{\Phi}_{xX}$ as the vector of evaluations with cutoff, so that $\hat{s}(x) = \hat{\Phi}_{xX} c_0$ and $s(x) = \Phi_{xX} c$. Add and subtract $\Phi_{xX} c_0$; apply the triangle inequality; use the fact that $|u \cdot v| \leq \|u\|_1 \|v\|_\infty$ for any $u$ and $v$; and derive simple bounds on $\|\Phi_{xX}\|_\infty$ and $\|\hat{\Phi}_{xX}-\Phi_{xX}\|_\infty$.
Answer: TODO
The bound is pessimistic, but it gives a sense of what can go wrong: if there is a large error in our coefficients, or if the coefficient norm times the magnitude of the neglected RBF evaluations is big, then we may have large differences between $s$ and $\hat{s}$.
The error in the coefficient vector $\|c-c_0\|_1$ with our initial parameters can be quite large, enough to make our error bounds terrible (even if the error might not be as bad). But even if the factorization of $\hat{\Phi}_{XX}$ is not good enough to get a very accurate $c$ alone, it is good enough to make progress. We therefore try an iterative refinement loop, combining our approximate solver based on $\hat{\Phi}_XX$ and a matrix vector product with a (still sparse) $\tilde{\Phi}_XX$ that provides a much more approximation to $\Phi_{xX}$. Your goal: carry out the iterative refinement and give evidence of its convergence.
hello_world_2d_se_itref (generic function with 4 methods)
We can take a rough guess at the magnitude of the error in $c$ by the magnitude of the corrections taken during the iterative refinement loop. As for the first term in our error bound, we can compute that easily in Julia.
hello_world2d_iterf_err_term (generic function with 4 methods)
0.0022668523126674854
0.719999 seconds (12.91 M allocations: 1003.542 MiB, 8.45% gc time) 0.000074 seconds (4 allocations: 78.234 KiB) 0.004387 seconds (19 allocations: 320.453 KiB)
2.7975179979973083e-7
0.754758 seconds (12.91 M allocations: 1003.542 MiB, 9.67% gc time) 0.000078 seconds (4 allocations: 78.234 KiB) 0.005419 seconds (19 allocations: 320.453 KiB)
Task 5: Fun with Fekete
The Fekete points for RBF interpolation on some domain $\Omega$ are the points that maximize $\det \Phi_{XX}$ – though it will be convenient for us to instead consider $\log \det \Phi_{XX}$. In order to maximize such a quantity, we would like to be able to compute derivatives; that is, we seek to compute the matrix of $G$ with entries
$$G_{ij} = \frac{\partial (\log \det \Phi_{XX})}{\partial x_{ij}}$$
where $x_{ij}$ represents component $i$ of the $j$th center. Some matrix calculus that we will not go into now lets us write this as
$$G_{ij} = \operatorname{tr}\left( \Phi_{XX}^{-1} \, \frac{\partial \Phi_{XX}}{\partial x_{ij}} \right),$$
and we can write the second matrix as
$$\frac{\partial \Phi_{XX}}{\partial x_{ij}} = e_j (v^{ij})^T + v^{ij} e_j^T \mbox{ where } v^{ij}_k = \begin{cases} \phi'(\|x_j-x_k\|) \frac{x_{ij}-x_{ik}}{\|x_j-x_k\|}, & k \neq j \\ 0, & k = j. \end{cases}$$
A standard property of traces (the cyclic property) tells us that in general for any matrix $A$ and vectors $u$ and $w$ such that the dimensions make sense,
$$\operatorname{tr}(Auw^T) = w^T A u$$
Therefore, using the $v$ vectors introduced above, we can write
$$G_{ij} = 2 e_j^T \Phi_{XX}^{-1} v^{ij} = 2 (v^{ij})^T \Phi_{XX}^{-1} e_j$$
For this problem, you should complete the following routines to compute the log determinant of $\Phi_{XX}$ and the derivative matrix $G$ in the indicated time bounds. A tester is provided below.
logdet_ΦXX (generic function with 1 method)
dlogdet_ΦXX (generic function with 1 method)
test_logdet (generic function with 1 method)
NaN