Lecture 13: Kernels

previous
next



Video II

Linear classifiers are great, but what if there exists no linear decision boundary? As it turns out, there is an elegant way to incorporate non-linearities into most linear classifiers.

Handcrafted Feature Expansion

We can make linear classifiers non-linear by applying basis function (feature transformations) on the input feature vectors. Formally, for a data vector xRd, we apply the transformation xϕ(x) where ϕ(x)RD. Usually Dd because we add dimensions that capture non-linear interactions among the original features.

Advantage: It is simple, and your problem stays convex and well behaved. (i.e. you can still use your original gradient descent code, just with the higher dimensional representation)

Disadvantage: ϕ(x) might be very high dimensional.

Consider the following example: x=(x1x2xd), and define ϕ(x)=(1x1xdx1x2xd1xdx1x2xd).

Quiz: What is the dimensionality of ϕ(x)?

This new representation, ϕ(x), is very expressive and allows for complicated non-linear decision boundaries - but the dimensionality is extremely high. This makes our algorithm unbearable (and quickly prohibitively) slow.

The Kernel Trick

Gradient Descent with Squared Loss

The kernel trick is a way to get around this dilemma by learning a function in the much higher dimensional space, without ever computing a single vector ϕ(x) or ever computing the full vector w. It is a little magical.

It is based on the following observation: If we use gradient descent with any one of our standard loss functions, the gradient is a linear combination of the input samples. For example, let us take a look at the squared loss:

(w)=ni=1(wxiyi)2
The gradient descent rule, with step-size/learning-rate s>0 (we denoted this as α>0 in our previous lectures), updates w over time, wt+1wts(w)  where: w=ni=12(wxiyi)γi : function of xi,yixi=ni=1γixi
We will now show that we can express w as a linear combination of all input vectors, w=ni=1αixi.
Since the loss is convex, the final solution is independent of the initialization, and we can initialize w0 to be whatever we want. For convenience, let us pick w0=(00). For this initial choice of w0, the linear combination in w=ni=1αixi is trivially α1==αn=0. We now show that throughout the entire gradient descent optimization such coefficients α1,,αn must always exist, as we can re-write the gradient updates entirely in terms of updating the αi coefficients: w1=w0sni=12(w0xiyi)xi=ni=1α0ixisni=1γ0ixi=ni=1α1ixi(with α1i=α0isγ0i)w2=w1sni=12(w1xiyi)xi=ni=1α1ixisni=1γ1ixi=ni=1α2ixi(with α2i=α1ixisγ1i)w3=w2sni=12(w2xiyi)xi=ni=1α2ixisni=1γ2ixi=ni=1α3ixi(with α3i=α2isγ2i)wt=wt1sni=12(wt1xiyi)xi=ni=1αt1ixisni=1γt1ixi=ni=1αtixi(with αti=αt1isγt1i)

Formally, the argument is by induction. w is trivially a linear combination of our training vectors for w0 (base case). If we apply the inductive hypothesis for wt it follows for wt+1.

The update-rule for αti is thus αti=αt1isγt1i, and we have αti=st1r=0γri.
In other words, we can perform the entire gradient descent update rule without ever expressing w explicitly. We just keep track of the n coefficients α1,,αn. Now that w can be written as a linear combination of the training set, we can also express the inner-product of w with any input xi purely in terms of inner-products between training inputs: wxj=ni=1αixixj.
Consequently, we can also re-write the squared-loss from (w)=ni=1(wxiyi)2 entirely in terms of inner-product between training inputs: (α)=ni=1(nj=1αjxjxiyi)2
During test-time we also only need these coefficients to make a prediction on a test-input xt, and can write the entire classifier in terms of inner-products between the test point and training points: h(xt)=wxt=nj=1αjxjxt.
Do you notice a theme? The only information we ever need in order to learn a hyper-plane classifier with the squared-loss is inner-products between all pairs of data vectors.

Inner-Product Computation

Let's go back to the previous example, ϕ(x)=(1x1xdx1x2xd1xdx1x2xd).

The inner product ϕ(x)ϕ(z) can be formulated as: ϕ(x)ϕ(z)=11+x1z1+x2z2++x1x2z1z2++x1xdz1zd=dk=1(1+xkzk).

The sum of 2d terms becomes the product of d terms. We can compute the inner-product from the above formula in time O(d) instead of O(2d)! We define the function k(xi,xj)this is called the kernel function=ϕ(xi)ϕ(xj).
With a finite training set of n samples, inner products are often pre-computed and stored in a Kernel Matrix: Kij=ϕ(xi)ϕ(xj).
If we store the matrix K, we only need to do simple inner-product look-ups and low-dimensional computations throughout the gradient descent algorithm. The final classifier becomes: h(xt)=nj=1αjk(xj,xt).

During training in the new high dimensional space of ϕ(x) we want to compute γi through kernels, without ever computing any ϕ(xi) or even w. We previously established that w=nj=1αjϕ(xj), and γi=2(wϕ(xi)yi). It follows that γi=2(nj=1αjKij)yi). The gradient update in iteration t+1 becomes αt+1iαti2s(nj=1αtjKij)yi).

As we have n such updates to do, the amount of work per gradient update in the transformed space is O(n2) --- far better than O(2d).

General Kernels

Below are some popular kernel functions:

Linear: K(x,z)=xz.

(The linear kernel is equivalent to just using a good old linear classifier - but it can be faster to use a kernel matrix if the dimensionality d of the data is high.)

Polynomial: K(x,z)=(1+xz)d.

Radial Basis Function (RBF) (aka Gaussian Kernel): K(x,z)=exz2σ2.

The RBF kernel is the most popular Kernel! It is a Universal approximator!! Its corresponding feature vector is infinite dimensional and cannot be computed. However, very effective low dimensional approximations exist (see this paper).

Exponential Kernel: K(x,z)=exz2σ2

Laplacian Kernel: K(x,z)=e|xz|σ

Sigmoid Kernel: K(x,z)=tanh(ax+c)

Kernel functions

Can any function K(,)R be used as a kernel?

No, the matrix K(xi,xj) has to correspond to real inner-products after some transformation xϕ(x). This is the case if and only if K is positive semi-definite.

Definition: A matrix ARn×n is positive semi-definite iff qRn, qAq0.

Remember Kij=ϕ(xi)ϕ(xj). So K=ΦΦ, where Φ=[ϕ(x1),,ϕ(xn)]. It follows that K is p.s.d., because qKq=(Φq)20. Inversely, if any matrix A is p.s.d., it can be decomposed as A=ΦΦ for some realization of Φ.

You can even define kernels over sets, strings, graphs and molecules.

Figure 1: The demo shows how kernel function solves the problem linear classifiers can not solve. RBF works well with the decision boundary in this case.