CS 6241: Numerics for Data Science

Computing with kernels

Author

David Bindel

Published

March 11, 2025

The basic tasks

In the past two lectures, we have discussed various interpretations of kernels and how they are used for interpolation and regression. In today’s lecture, we focus on the numerical methods used to fit kernel-based models and to use them for prediction and uncertainty estimation. More specifically, we consider four problems:

  • For a fixed kernel, how do we fit the parameters for a kernel-based model? That is, how do we compute c such that f¯(x)=j=1ncjk(x,xj) approximately satisfies f^(xi)yi, where (xi,yi) are our observed data points?

  • Many kernels involve hyper-parameters, such as length scale and smoothness parameters. How should we determine these hyper-parameters?

  • How do we quickly evaluate the predicted values f^(x) at new points x?

  • How do we quickly evaluate measures that quantify uncertainty, such as the predictive variance?

Because fitting model parameters and hyper-parameters is more costly than evaluating the predicted mean field or predictive variance, we will focus most of our attention on the fitting tasks. We will also focus primarily on the probabilistic interpretation of kernels in terms of Gaussian processes. Because we have in mind downstream tasks in which a user can adaptively obtain new data (as opposed to fitting to a fixed data set), we will also discuss incremental fitting.

Learning at small n

We initially consider the case where there are few enough data points that it makes sense to use standard direct factorization methods to solve the fitting problem. These standard factorizations require O(n3) time, but if we use well-tuned libraries, the constants need not be large. Using a standard laptop, we can work with data sets of size greater than n=1000 as a matter of routine. The direct factorization approach is also useful as a starting point for scalable algorithms more appropriate to large n.

Cholesky and kernel-only fitting

The Cholesky factorization of a positive definite kernel matrix K is K=RTR where R is upper triangular. The fitting system in the case of a positive definite kernel with no tail is Kc=y, and the standard solution algorithm is a Cholesky factorization of K followed by two triangular solves c=R1(RTy). The Cholesky factorization costs O(n3); the triangular solves each cost O(n2). Hence, the cost is dominated by the cost of the Cholesky factorization.

For computation, it is useful to think of the decomposition in block terms [K11K12K12TK22]=[R11T0R12TR22T][R11R120R22]. Rearranging block-by-block, we have K11=R11TR11R11TR11=K11K12=R11TR12R12=R11TK12K22=R12TR12+R22TR22R22TR22=K22R12TR12 That is, we can think of the decomposition in terms of a decomposition of the leading submatrix of K, a triangular solve to get an off-diagonal block, and then decomposition of the trailing Schur complement S22=R22TR22=K22R12TR12=K22K21K111K12. The Schur decomposition has a meaning independent of the Cholesky factorization, as we can see by solving the system [R11T0R12TR22][R11R120R22][X1X2]=[0I]. Block forward substitution gives [R11R120R22][X1X2]=[0R22T] and block back substitution then yields [X1X2]=[R111R12R221R22TR221R22T]=[S111K12S221S221]. That is, the inverse of the Schur complement is a submatrix of the inverse of the original matrix K: S221=[K1]22. In the Gaussian process setting, if K is the covariance matrix, then K1 is the precision matrix, and S221 is the precision matrix of the posterior for the second block of variables conditioned on observations of data points from the first block.

In introductory numerical methods courses, one typically first encounters the (scalar) right-looking version of the Cholesky algorithm:

  % Overwrite K with the Cholesky factor R
  for j = 1:n
    K(j,j) = sqrt(K(j,j));
    K(j,j+1:end) = K(j,j+1:end) / K(j,j);
    K(j+1:end,j) = 0;
    K(j+1:end,j+1:end) = K(j+1:end,j+1:end) - K(j,j+1:end)'*K(j,j+1:end);
  end

At step j of the loop, we have computed the first j1 rows of R, and the trailing submatrix contains the Schur complement. At the end of the step, we perform a rank 1 update to get a new (smaller) Schur complement matrix.

An alternate organization, the left-looking version, defers the Schur complement update until it is needed:

  % Overwrite K with the Cholesky factor R
  K(1,1) = sqrt(K(1,1));
  K(2:end,1) = 0;
  for j = 2:n
    K(1:j-1,j) = K(1:j-1,1:j-1) \ K(1:j-1,j)
    K(j,1:j-1) = 0;
    K(j,j) = sqrt(K(j,j) - K(1:j-1,j)'*K(1:j-1,j));
  end

Both organizations of Cholesky do the same operations, but in a different order. Both require O(n3) time; the dominant cost per step for the right-looking variant is the rank-1 update of the Schur complement, where the dominant cost per step for the left-looking variant is the triangular solve.

The advantage of the left-looking factorization is that it can be applied incrementally. That is, suppose that K11 corresponds to the kernel matrix associated with an initial sample of data points, and we have computed K11=R11TR11. Then to add a second set of data points, we can do a left-looking block update R12=R11TK12R22TR22=K22R12TR12 The cost of this update is O((n12+n22)n2), which is significantly less than the O((n1+n2)3) cost of recomputing the factorization from scratch in the case that n2n1.

Fitting with a tail

We now consider fitting in the presence of a tail term. Polynomial tails are typically used with conditionally positive definite kernel functions, but they can also be used with positive definite kernels — and indeed they usually are used in some geostatistical applications. One can also incorporate non-polynomial terms into the tail if they are useful to the model.

The fitting problem with a tail looks like f^(x)=icik(x,xi)+jdjpj(x) where the coefficients d satisfy the discrete orthogonality condition ipj(xi)di=0 for each basis function pj(x). This gives us the linear system [KPPT0][cd]=[y0]. We can also see this linear system from the perspective of constrained optimization: we are minimizing a quadratic objective (associated with K) subject to linear constraints (associated with P). Note that this formulation is only well posed if P is full rank, a condition sometimes known as unisolvency of the interpolation points.

One way to deal with the tail term is the “null space” approach; that is, rather than adding equations that enforce the discrete orthogonality constraint, we find a coordinate system in which the discrete orthogonality constraint is automatic. Specifically, suppose we write the full QR decomposition of P as P=[Q1Q2][R10]. Then the constraint PTc=0 can be rewritten as c=Q2w, giving us Q2TKQ2w=Q2Ty where Q2TKQ2 is generally symmetric and positive definite even for a conditionally positive definite kernel. Once we have computed w (and from there c), we can compute d by the relation R1d=Q1T(yKc).

An alternate approach is to partition the data points into two groups, the first of which is unisolvent and has the same size as the dimension of the tail. Then we can write any c satisfying PTc=0 in the form [c1c2]=[WTI]c2 where W=P2P11. Substituting this into the constrained optimization of cTKc gives the reduced problem K~22c2=(K22WK12K21WT+WK11W)c2=y2, from which we can recover the remaining coefficients by solving c1=WTc2P1d=y1K11c1K12c2. This reduction is formally equivalent to Gaussian elimination and substitution on the system [0P1TP2TP1K11K12P2K21K22][dc1c2]=[0y1y2]. As with the left-looking factorization described in the previous section, we can form W, K, K~22, and the Cholesky factorization of K~22 incrementally as new data points are added.

Likelihoods and gradients

Recall from the last lecture that the log likelihood function for a (mean zero) Gaussian process is L=12yTK1y12logdetKn2log(2π). Given a Cholesky factorization K=RTR, we can rewrite this as L=12RTy2logdetRn2log(2π), and note that logdetR=ilogrii. The cost of evaluating the log likelihood is dominated by the cost of the initial Cholesky factorization.

In order to optimize kernel hyper-parameters via maximum likelihood, we would also like to compute the gradient (and maybe the Hessian) with respect to the hyper-parameters. Recall from last time that we computed the derivative δL=12cT[δK]c12tr(K1δK) where Kc=y. Unfortunately, I know of no tricks to exactly compute tr(K1δK) for arbitrary δK without in time less than O(n3) without exploiting additional structure beyond what we have discussed so far.

Simply computing gradients of the log likelihood is sufficient for gradient descent or quasi-Newton methods such as BFGS. However, if the number of hyper-parameters is not too great, we may also decide to compute second derivatives and compute a true Newton iteration. Let θ be the vector of hyper-parameters and use [f],j to denote the partial derivative of an expression f with respect to θj; then [yTK1y],ij=[yTK1K,iK1y],j=2yTK1K,iK1K,jK1yyTK1K,ijK1y=2cTK,iK1K,jccTK,ijc[logdetK],ij=[tr(K1K,i)],j=tr(K1K,ijK1K,iK1K,j)L,ij=12cTK,ijccTK,iK1K,jc12tr(K1K,ijK1K,iK1K,j). Barring tricks that take advantage of further structure in the kernel matrix K or its derivatives, computing these second partials has the same O(n3) complexity as computing the first derivatives.

Optimizing the nugget

An important special case is optimization with only a single hyper-parameter, the noise variance term or “nugget” term That is, we seek to maximize L(η)=12yT(K+ηI)1y12logdet(K+ηI)n2log(2π). In this case, rather than doing a Cholesky factorization, we may choose to compute an eigenvalue decomposition K=QΛQT in order to obtain L(η)=12y^T(Λ+ηI)1y^12logdet(Λ+ηI)n2log(2π) where y^=QTy. In this case, the stationary points satisfy a rational equation in η: j=1n(y^j2(λj+η)212(λj+η))=0. We can run Newton on this equation in O(n) time per step.

Smooth kernels and low-rank structure

The cost of parameter fitting for a general kernel matrix is O(n3), and the cost of hyper-parameter fitting by a gradient-based method applied to maximization of the log-likelihood involves an additional O(n3) cost per hyper-parameter per step. However, for smooth kernels, the kernel matrix (without a nugget term) may be effectively very low rank. This case is of enough interest that we give it special treatment here. More specifically, we assume a kernel matrix of the form K~=K+ηI where most of the eigenvalues of K are much less than η. In this case, we can solve the fitting problem and the computation of the log-likelihood and gradients in much less than O(n3) time.

Pivoted Cholesky and kernel approximation

We begin with an outer product factorization of the kernel matrix K without a regularizing nugget. Specifically, we consider the pivoted Cholesky factorization ΠTKΠ=RTR where the diagonal elements of R appear in descending order and the off-diagonal elements in each row of R are dominated by the diagonal element. The pivoted Cholesky factorization algorithm looks very much like the standard Cholesky algorithm, except before adding row j to the R factor, we first swap rows and columns j and jmax of K, where jmax is the index of the largest entry in the Schur complement matrix.

The approximation error associated with truncating the pivoted Cholesky factorization is just the Schur complement matrix; that is (suppressing the permutation momentarily), [K11K12K21K22][R11TR12T][R11R12]=[000K22R12TR12] Typically, we would stop the factorization when all the diagonal elements in the Schur complement are less than some tolerance. Because the Schur complement is positive definite, the sum of the diagonal elements in the Schur complement is the same as the nuclear norm of the Schur complement, so controlling the size of the largest diagonal element also controls the approximation error in the nuclear norm (which dominates both the Frobenius norm and the spectral norm).

Part of what makes pivoted Cholesky attractive is that we can run the algorithm without ever forming K — we only need the rows of K associated with the pivots selected up to the current step, plus a running computation of the diagonal of the Schur complement. If we have formed j1 steps of the pivoted Cholesky factorization, we can extend the factorization by:

  • Searching the Schur complement diagonal d to find the pivot row jmax and swapping points j and jmax and columns j and jmax of R (and d).

  • Forming the top row of the current Schur complement using the formula sjl=k(xj,xl)i=1j1rijril

  • Extending the factorization with sjj=sjj and rjl=sjl/rjj for l>j.

  • Updating the diagonal of the Schur complement via dllnew=dllrjl2.

Running pivoted Cholesky in this way for r steps only requires O(nr) kernel evaluations, O(nr) storage, and O(nr2) time.

The effects of truncation error

We now consider the solution of the linear system (K+ηI)c=y where ηI is the nugget term. Suppose we have a low-rank factorization (computed by truncated pivoted Cholesky, for example) of the form KWWT. Substituting gives us the approximate system (WWT+ηI)c^=y. How close are c^ and c? Let K=WWT+E, i.e. E is the part of K discarded by truncating a decomposition of K. Then subtracting the approximate linear systems gives (K+ηI)(cc^)=Ec^, and norm bounds yield cc^(K+ηI)1Ec^Eηc^. where we have throughout used the matrix two-norm. The matrix two-norm is bounded by the nuclear norm; and if W comes from a pivoted Cholesky factorization, we can compute the nuclear norm of E as the sum of the diagonal elements in the truncated Schur complement term. Hence, if we use pivoted Cholesky with the termination criterion Sδη, then we obtain the relative error bound cc^c^δ if all computations are done in exact arithmetic.

Solving with the kernel

To solve a system (WWT+ηI)c=y efficiently, let us introduce a new variable z=WTc; putting together the definition of z with the equation for c gives us [ηIWWTI][cz]=[y0]. Of course, we can eliminate z to get back to the original equation; but what happens if we instead eliminate c? Then we get the reduced system (Iη1WTW)z=η1WTy. Multiplying through by η gives (WTW+ηI)z=WTy, Back substitution gives the formula c=1η(yW(WTW+ηI)1WTy). This is a special case of the famous Sherman-Morrison-Woodbury formula for the inverse of a matrix (in this case σ2I) plus a low-rank modification.

While we can certainly work with the Sherman-Morrison-Woodbury formula directly, we can do a little better if we recognize that the equations in z alone are the regularized normal equations for the optimization minimize 12Wzy2+η2z2. Furthermore, we have that c=η1(yWz); that is, c is just the scaled residual of the least squares problem. In terms of an economy QR decomposition [WηI]=[Q1Q2]R we have c=η1(yQ1(Q1Ty)). This computational approach (using an economy QR decomposition) enjoys a modest stability advantage compared to working directly with the Sherman-Morrison-Woodbury formula.

Evaluating the kernel approximant

We have so far treated the pivoted Cholesky approach as a way to approximate the kernel matrix. Another interpretation is that we have computed a semidefinite kernel function k^(x,y)=kxIKII1kIx where I refers to the r indices selected in the Cholesky factorization. These two interpretations lead to the same linear system to approximate the coefficient vector c, but they suggest two different approximations to the function: s(x)=i=1nk(x,xi)ci or s^(x)=i=1nk^(x,xi)ci. The two approximations agree for points x that are sufficiently close to the training data, but they may differ when x is farther away. The former approximation is the one that may have motivated us initially, but it is worth spending a moment on the second approximation.

Writing s^ in matrix terms, we have s^(x)=k^xX(K^XX+ηI)1y=kxIKII1KIX(KXIKII1KIX+ηI)1y. Equivalently, we can write s^(x)=kxId where (omitting some algebra) d is the solution to the regularized least squares problem minimize 12KXId^y2+η2dTKIId. If KII=RIITRII is the leading part of the pivoted Cholesky factorization, we can rewrite this minimization as minimize 12[KXIηRII]d[y0]2. Alternately, we can think of the pivoted Cholesky factorization as inducing a feature map associated with the approximate kernel k^: k^(x,y)=ψ(x)Tψ(y),ψ(x)=RIITkIx. Let W denote the matrix of feature vectors (i.e. row i is ψ(xi)T); in the pivoted Cholesky factor, this is just RI,:T. Then the approximation system involves solving minimize 12Wzy+η2z2, which is the same minimization we saw in the previous subsection; and we can write the s^(x) function as s^(x)=ψ(x)Tz.

Predictive variance

Treating k as the covariance kernel of a Gaussian process, the predictive variance at a test point x (conditioned on noisy data at points X) is v(x)=k(x,x)kxX(KXX+ηI)1kXx. As in the previous section, we can again think about using low rank approximation in one of two ways: either we can use the low rank structure with the original kernel for the linear solve (WWT+ηI)1kXx; or we can think of the low rank factorization (via pivoted Cholesky) as defining the new kernel k^, for which the predictive variance is v^(x)=kxI(KII+η1KIXKXI)1kIx. Using the Cholesky factorization RIITRII=KII, we can rewrite (KII+η1KIXKXI)1=ηQ~1TQ~1 where Q~1 comes from the economy QR decomposition [KXIηRII]=[Q~1Q~2]R~, which we may have already formed in the process of computing the coefficients in s^(x)=kxId.

Likelihoods and gradients

Recall the log likelihood function is L=12yTc12logdetK~n2log(2π) where K~=K+ηI is the regularized kernel matrix. The first and last terms need no special treatment in the case when the kernel matrix has special structure, so long as we are able to rapidly compute the coefficient vector c. The only non-obvious computation is the log determinant.

Consider the regularized kernel matrix K~=WWT+ηI where WRn×r, and suppose we write a full SVD for W: W=UΣVT=[U1U2][Σ10]VT Then UTK~U=[Σ2+ηI00ηI2] Hence, logdet(K~)=logdet(UTK~U)=logdet(Σ2+ηI)+(nr)log(η). If we use that Σ2+ηI is the eigenvalue matrix for WTW+ηI=RTR, with R computed by the economy QR we used for kernel solves, we have logdet(K)=2j=1rlog(rjj)(nr)logη.

What of the derivative of the log likelihood? That is, we would like a fast method to compute δL=12cT(δK~)c12tr(K~1δK~). We can simplify the latter term by observing that K~1=1η(IQ1Q1T) where [WηI]=[Q1Q2]R; substitution and using properties of traces gives us tr(K~1δK~)=1η(tr(δK~)tr(Q1TδK~Q1)). If we work with the original kernel, it is difficult to do better than this.

Now suppose we differentiate the log-likelihood for the approximate kernel induced by the points I, i.e. K~XX=KXIKII1KIX+ηI. Differentiating this expression gives us δK~=(δK:,I)FT+F(δK:,I)TF(δKII)FT+(δη)I where F=K:,IKII1. Therefore 12cT(δK)c=(δK:,Ic)T(Fc)12(Fc)T(δKII)(Fc)+12c2δη12tr(K~1δK~)=F~,δK:,IF12F~,FδKIIF+δη(nQ1F2)F~1η(IQ1Q1T)F and so we can compute derivatives of the log likelihood in O(nr) space and O(nr2) time in this approximation.

Beyond low rank

For a number of kernels, the kernel matrix is not all that low rank — the singular values decay, but not so quickly that we would be happy with O(nr2) time algorithms. However, in low-dimensional spaces, one may still have low-rank submatrices, and there are fast factorization techniques based on these rank-structured matrices as well. There are also a variety of iterative solvers that take advantage of fast matrix-vector multiplies; for example, we can use the method of conjugate gradients (CG) to solve linear systems using only matrix-vector products with a kernel matrix. For these types of solvers, the low rank approximations using pivoted Cholesky as described above still give us a good preconditioner that can accelerate convergence.

Footnotes

  1. The scalar version of the algorithm works one column at a time. Various blocked versions of the factorization algorithm update in blocks of several columns at a time, with a small (scalar) Cholesky factorization for the diagonal blocks. Block algorithms have the same complexity as the corresponding scalar algorithms, but achieve better performance on modern hardware because they make better use of the cache.↩︎

  2. The term “nugget” comes from the use of Gaussian process regression in geostatistical applications. When trying to use Gaussian processes to predict the location of precious mineral deposits based on the mineral content in bore measurements, it is necessary to account for the possibility that a measurement may accidentally happen on a mineral nugget.↩︎