Kernels
Kernels are key to many data analysis methods. A kernel is a function of two arguments, and often we think of as a measure of how similar or related the objects and are to each other. But there are several stories we tell that describe different ways to think about kernels. We discuss four such approaches in this lecture:
Kernels come from composing linear methods (e.g. linear regression) with feature maps, nonlinear functions that map from the original function domain (say ) into space of much higher dimension (). In this story, the kernel represents an inner product between feature vectors.
Kernels define basis functions for approximation spaces. Unlike some schemes to approximate from a fixed function space (e.g. low-degree polynomials), kernel methods are adaptive, allowing more variation in regions of the function domain where we have more data.
Kernels are associated with a quadratic form on a space of functions, which we think of as “energy.” Sometimes, as in the case of cubic splines and thin plate splines, the energy and kernel methods for regression minimize this quadratic form subject to data constraints. Thinking of kernel methods in this way gives us one framework for understanding the error in kernel methods.
Kernels are also used to represent the covariance of random processes in general, and Gaussian processes in particular. In this view of kernel-based regression, we start with a prior distribution over functions, and then use Bayes rule to get a posterior distribution based on measurements of the function.
In each case, using a kernel leads us in the end to a linear algebra problem: all the nonlinearity in a kernel method is summarized in the kernel construction.
These notes are strongly influenced by the Acta Numerica article “Kernel techniques: From machine learning to meshless methods” by Robert Schaback and Holger Wendland. This survey article runs to 98 pages, but if you have the time and inclination to read more deeply, I strongly recommend it s a starting point!
Some common examples
The choice of an appropriate kernel is the key to the success or failure of a kernel method. But we often lack the insight to make an inspired choice of kernels, and so fall back on a handful of standard families of kernels. Let us mention a few that are commonly used in function approximation on .
Squared exponential
The squared exponential kernel has the general form for positive and positive definite . Most often we choose , i.e. The scale factor and the length scale are examples of kernel hyper-parameters. In the case where we use a single length scale parameter (rather than a more general ), the squared exponential kernel is an example of a radial basis function, i.e. a kernel that depends only on the distance between the two arguments. In some corners of machine learning, this kernel is referred to as “the” radial basis function, but we will avoid this usage in deference to the many other useful radial basis functions in the world.
Exponential
The (absolute) exponential kernel has the form where and are again the scale factor and length scale hyper-parameters. As with the squared exponential kernel, there is a more general form in which is replaced by for some positive definite matrix. Where the squared exponential function is smooth, the exponential kernel is only continuous — it is not differentiable. This has important implications in modeling, as the function approximations produced by kernel methods inherit the smoothness of the kernel. Hence, a smooth kernel (like the squared exponential) is good for fitting smooth functions, while a non-differentiable kernel (like the absolute exponential) may be a better choice for fitting non-differentiable functions.
Cubic splines and thin plate splines
At the end of the last lecture, we saw that we can write cubic splines for 1D function approximation in terms of the kernel . For functions on , the analogous choice is the thin plate spline with kernel for . Unlike the squared exponential and exponential kernels, we express these kernels in terms of the distance and not a scaled distance , as approximation methods using cubic and thin plate splines are scale invariant: scaling the coordinate system does not change the predictions. Also unlike the squared exponential and exponential kernels, these kernels grow with increasing values of , and so our intuition that kernels measure “how similar things are” founders a bit on these examples. Other ways of explaining kernels that are more appropriate to describing why these choices are useful.
Definitions and notation
A kernel on a set is a function . We will restrict our attention to symmetric kernels, for which . Some kernels depend on additional hyper-parameters, e.g. the length scale and smoothness parameters described for squared exponential and Matérn kernels in the previous section.
Often the set has special structure, and we want kernels to have invariants that are natural to that structure. For , we like to think about invariance under translation and rotation. We say is stationary if it is invariant under translation, i.e. for any . If depends only on and the distance between and , we say it is isotropic. When is both stationary and isotropic, we often identify it with a radial basis function , i.e. .
For ordered , we write the matrix of pairwise evaluations as We say is the kernel matrix for the set and kernel . The kernel function is positive definite if is positive definite whenever consists of distinct points. The squared exponential and absolute exponential kernels are positive definite; the cubic spline and thin plate spline kernels are not.
A kernel is conditionally positive definite relative to a space of functions from if A kernel on is conditionally positive definite of order if it is conditionally positive definite relative to the space of polynomials of total degree at most . The cubic spline and thin plate spline kernels are both conditionally positive definite of order 2.
Feature maps
Suppose we want to approximate . A very simple scheme is to approximate by a linear function, where the coefficients are determined from samples by a least squares fitting method. That is, we solve where is a matrix whose columns are data points and is a vector of function values .
Unfortunately, the space of linear functions is rather limited, so we may want a richer class of models. The next step up in complexity would be to look at a vector space of functions from , with basis functions that we collect into a single vector-valued function . For example, for one-dimensional function approximation on , we might choose a polynomial space of approximating functions and use the Chebyshev basis functions . I think of these as basis vectors for a space of candidate approximating functions, but in the language of machine learning, we say that the functions are features and the vector-valued function is a feature map. We write our approximation as and if we have function values at points, we can again fit the coefficients by the least squares problem where . All nonlinearity in the scheme comes from the nonlinear functions in ; afterward, we have a standard linear problem.
What if the dimension of our approximating space is much larger than the amount of data we have — or even if it is infinite dimensional? Or, equivalently: what if we have more features than training points? In this case, many different approximations in the space all fit the data equally well, and we need a rule to choose from among them. From a linear numerical algebra perspective, a natural choice is the minimal norm solution; that is, we approximate as before, but choose the coefficients to minimize subject to . The solution to this linear system is and therefore We now observe that all the entries of the vector and the Gram matrix can be written in terms of inner products between feature vectors. Let ; then where denotes the row vector with entries and . The function is the kernel function, and this way of writing the approximation only in terms of these inner products, without writing down a feature map, is sometimes called the “kernel trick.”
So far we have shown how to get from feature maps to kernels; what if we want to go in the other direction? As it happens, if we are given a positive (semi)definite kernel function, we can always construct a (possibly infinite) feature map associated with the kernel. To do this, we define an integral operator on an appropriate space of distributions This encodes all the information about the kernel — formally, for example, if we let be a Dirac delta at , then . Mercer’s theorem tell us there is an eigenvalue decomposition with eigenpairs so that and the features give us the kernel.
Kernels as basis functions
As we have seen in the previous section, a standard idea for function approximation is to choose a function from some fixed approximation space . For example, if we define the function then a reasonable space for functions on might consist of approximants that are combinations of these radial basis function “bumps” centered at each of the nodes for : If we have data on a set of points , and denotes the points , then the interpolation equations take the form which we write in matrix form as where . With more data points, we use least squares.
What if the data points are not uniformly distributed on ? For example, what if we have more data points close to 1 than we have close to 0? The choice of a fixed approximation space limits us: we have no way of saying that with more data close to 1, we should allow the approximation more wiggle room to fit the function there. We can do this by putting more basis functions that are “centered” in the area where the data points are dense, e.g. by making the data points and the centers coincide: Then the interpolation conditions are or . By adapting the space to allow more flexibility close to where the data is, we hope to get better approximations, but there is another advantage as well: if the kernel is positive definite then the matrix is symmetric and positive definite, and we need not worry about singularity of the linear system.
Of course, nothing says we are not allowed to use an approximation space that is adapted to the sample points as well as a fixed approximation space. We saw this already in our discussion of cubic splines, where we had a piece associated with the cubic radial basis function together with a linear “tail”: In order to uniquely solve the linear system, we need some additional constraints; for cubic splines, we use the discrete orthogonality condition More generally, if a kernel is conditionally positive definite relative to a space of functions spanned by the basis , then the coefficients of the approximation are determined by the interpolation conditions and the discrete orthogonality condition We can write these conditions in matrix form as where . So long as is full rank (we call this well-posedness of the points for interpolation in ), this linear system is invertible. Most often, we include polynomial tails to guarantee solvability of the interpolation problem with conditionally positive definite kernels, but nothing prevents us from incorporating such terms in other approximations as well — and, indeed, it may do our approximations a great deal of good.
Gaussian processes
Our final story comes from Gaussian processes (GP). Informally, just as the ordinary Gaussian distribution is over numbers and multivariate Gaussian distributions are over vectors, a Gaussian process is a distribution over functions. More formally, a Gaussian process on a set with mean field and covariance kernel is a collection of Gaussian random variables indexed by , any finite subset of which obeys a multi-variate Gaussian distribution; that is, if is a draw from a Gaussian process, then for any , In Bayesian inference using GPs, we start with a prior GP from which we assume is drawn, then compute a posterior GP conditioned on data.
Because GPs are defined in terms of the behavior at finite subsets of points, we can really focus on the multivariate normal case. Suppose a multivariate Gaussian random variable is partitioned into and . For simplicity, we assume has mean zero. Then the distribution is i.e. we have the probability density Now, we rewrite the quadratic form (where is the precision matrix) in terms of the components and : Now we rewrite the block of the equation as , then rearrange to , to get the more convenient formula The same approach gives us the Schur complement relation . Plugging this formulation of the quadratic into the joint density and dividing out by the marginal for gives the conditional density Thus, the conditional distribution for given is again Gaussian:
Applying the same logic to Gaussian processes, we find that if is drawn from a GP with mean field and covariance kernel , then conditioning on observations at points gives a new GP with mean and covariance kernel The conditional mean field is exactly the same as the kernel prediction that we have seen derived in other ways, and we might recognize the preditive variance at conditioned on data at as where is the power function for that we saw in the last section.
Deterministic or stochastic?
In the previous two sections, we have developed two ways to think about the error analysis of kernel methods to interpolate :
Optimal approximation: suppose and let That is, is the center point of a region that is both consistent with the data constraints and the norm constraints. Because it is at the center of this set, minimizes the worst possible error (in the native space norm) over all possible that are consistent with what we know. To get pointwise error estimates, we look at bounds on for all possible that satisfy and .
Bayesian inference: suppose is drawn from a Gaussian process with some known covariance kernel. Conditioned on the data , we have a new Gaussian process with mean field and a conditional kernel. To get pointwise error estimates, we look at the predictive distribution at a new test point (itself a Gaussian distribution), including the predictive variance.
The deterministic approach gives us the best worst-case error given what we know about the function; the Bayesian approach gives us an expected value. Both give the same predictions, and both use the same quantities in computing an error result, though with different interpretations. Both approaches also use information that might not be easy to access (a bound on a native space norm of , or an appropriate prior distribution).
Why should we not just pick one approach or the other? Apart from the question of modeling assumption, the two approaches yield different predictions if the measurements of are more complex, e.g. if we have information such as an inequality bound or a nonlinear relationship between point values of . However, this is beyond the scope of the current discussion.