CS 6241: Numerics for Data Science
Computing with kernels
The basic tasks
In the past two lectures, we have discussed various interpretations of kernels and how they are used for interpolation and regression. In today’s lecture, we focus on the numerical methods used to fit kernel-based models and to use them for prediction and uncertainty estimation. More specifically, we consider four problems:
For a fixed kernel, how do we fit the parameters for a kernel-based model? That is, how do we compute
such that approximately satisfies , where are our observed data points?Many kernels involve hyper-parameters, such as length scale and smoothness parameters. How should we determine these hyper-parameters?
How do we quickly evaluate the predicted values
at new points ?How do we quickly evaluate measures that quantify uncertainty, such as the predictive variance?
Because fitting model parameters and hyper-parameters is more costly than evaluating the predicted mean field or predictive variance, we will focus most of our attention on the fitting tasks. We will also focus primarily on the probabilistic interpretation of kernels in terms of Gaussian processes. Because we have in mind downstream tasks in which a user can adaptively obtain new data (as opposed to fitting to a fixed data set), we will also discuss incremental fitting.
Learning at small
We initially consider the case where there are few enough data points that it makes sense to use standard direct factorization methods to solve the fitting problem. These standard factorizations require
Cholesky and kernel-only fitting
The Cholesky factorization of a positive definite kernel matrix
For computation, it is useful to think of the decomposition in block terms
In introductory numerical methods courses, one typically first encounters the (scalar1) right-looking version of the Cholesky algorithm:
% Overwrite K with the Cholesky factor R
for j = 1:n
K(j,j) = sqrt(K(j,j));
K(j,j+1:end) = K(j,j+1:end) / K(j,j);
K(j+1:end,j) = 0;
K(j+1:end,j+1:end) = K(j+1:end,j+1:end) - K(j,j+1:end)'*K(j,j+1:end);
end
At step
An alternate organization, the left-looking version, defers the Schur complement update until it is needed:
% Overwrite K with the Cholesky factor R
K(1,1) = sqrt(K(1,1));
K(2:end,1) = 0;
for j = 2:n
K(1:j-1,j) = K(1:j-1,1:j-1) \ K(1:j-1,j)
K(j,1:j-1) = 0;
K(j,j) = sqrt(K(j,j) - K(1:j-1,j)'*K(1:j-1,j));
end
Both organizations of Cholesky do the same operations, but in a different order. Both require
The advantage of the left-looking factorization is that it can be applied incrementally. That is, suppose that
Fitting with a tail
We now consider fitting in the presence of a tail term. Polynomial tails are typically used with conditionally positive definite kernel functions, but they can also be used with positive definite kernels — and indeed they usually are used in some geostatistical applications. One can also incorporate non-polynomial terms into the tail if they are useful to the model.
The fitting problem with a tail looks like
One way to deal with the tail term is the “null space” approach; that is, rather than adding equations that enforce the discrete orthogonality constraint, we find a coordinate system in which the discrete orthogonality constraint is automatic. Specifically, suppose we write the full QR decomposition of
An alternate approach is to partition the data points into two groups, the first of which is unisolvent and has the same size as the dimension of the tail. Then we can write any
Likelihoods and gradients
Recall from the last lecture that the log likelihood function for a (mean zero) Gaussian process is
In order to optimize kernel hyper-parameters via maximum likelihood, we would also like to compute the gradient (and maybe the Hessian) with respect to the hyper-parameters. Recall from last time that we computed the derivative
Simply computing gradients of the log likelihood is sufficient for gradient descent or quasi-Newton methods such as BFGS. However, if the number of hyper-parameters is not too great, we may also decide to compute second derivatives and compute a true Newton iteration. Let
Optimizing the nugget
An important special case is optimization with only a single hyper-parameter, the noise variance term or “nugget” term2 That is, we seek to maximize
Smooth kernels and low-rank structure
The cost of parameter fitting for a general kernel matrix is
Pivoted Cholesky and kernel approximation
We begin with an outer product factorization of the kernel matrix
The approximation error associated with truncating the pivoted Cholesky factorization is just the Schur complement matrix; that is (suppressing the permutation momentarily),
Part of what makes pivoted Cholesky attractive is that we can run the algorithm without ever forming
Searching the Schur complement diagonal
to find the pivot row and swapping points and and columns and of (and ).Forming the top row of the current Schur complement using the formula
Extending the factorization with
and for .Updating the diagonal of the Schur complement via
.
Running pivoted Cholesky in this way for
The effects of truncation error
We now consider the solution of the linear system
Solving with the kernel
To solve a system
While we can certainly work with the Sherman-Morrison-Woodbury formula directly, we can do a little better if we recognize that the equations in
Evaluating the kernel approximant
We have so far treated the pivoted Cholesky approach as a way to approximate the kernel matrix. Another interpretation is that we have computed a semidefinite kernel function
Writing
Predictive variance
Treating
Likelihoods and gradients
Recall the log likelihood function is
Consider the regularized kernel matrix
What of the derivative of the log likelihood? That is, we would like a fast method to compute
Now suppose we differentiate the log-likelihood for the approximate kernel induced by the points
Beyond low rank
For a number of kernels, the kernel matrix is not all that low rank — the singular values decay, but not so quickly that we would be happy with
Footnotes
The scalar version of the algorithm works one column at a time. Various blocked versions of the factorization algorithm update in blocks of several columns at a time, with a small (scalar) Cholesky factorization for the diagonal blocks. Block algorithms have the same complexity as the corresponding scalar algorithms, but achieve better performance on modern hardware because they make better use of the cache.↩︎
The term “nugget” comes from the use of Gaussian process regression in geostatistical applications. When trying to use Gaussian processes to predict the location of precious mineral deposits based on the mineral content in bore measurements, it is necessary to account for the possibility that a measurement may accidentally happen on a mineral nugget.↩︎