Linear classifiers are great, but what if there exists no linear decision boundary? As it turns out, there is an elegant way to incorporate non-linearities into most linear classifiers.
Handcrafted Feature Expansion
We can make linear classifiers non-linear by applying basis function (feature transformations) on the input feature vectors. Formally, for a data vector x∈Rd, we apply the transformation x→ϕ(x) where ϕ(x)∈RD. Usually D≫d because we add dimensions that capture non-linear interactions among the original features.
Advantage: It is simple, and your problem stays convex and well behaved. (i.e. you can still use your original gradient descent code, just with the higher dimensional representation)
Disadvantage: ϕ(x) might be very high dimensional.
Consider the following example: x=(x1x2⋮xd), and define ϕ(x)=(1x1⋮xdx1x2⋮xd−1xd⋮x1x2⋯xd).
Quiz: What is the dimensionality of ϕ(x)?
This new representation, ϕ(x), is very expressive and allows for complicated non-linear decision boundaries - but the dimensionality is extremely high. This makes our algorithm unbearable (and quickly prohibitively) slow.
The Kernel Trick
Gradient Descent with Squared Loss
The kernel trick is a way to get around this dilemma by learning a function in the much higher dimensional space, without ever computing a single vector ϕ(x) or ever computing the full vector w. It is a little magical.
It is based on the following observation: If we use gradient descent with any one of our standard loss functions, the gradient is a linear combination of the input samples. For example, let us take a look at the squared loss:
ℓ(w)=n∑i=1(w⊤xi−yi)2
The gradient descent rule, with step-size/learning-rate s>0 (we denoted this as α>0 in our previous lectures), updates w over time,
wt+1←wt−s(∂ℓ∂w) where: ∂ℓ∂w=n∑i=12(w⊤xi−yi)⏟γi:function of xi,yixi=n∑i=1γixi
We will now show that we can express w as a linear combination of all input vectors,
w=n∑i=1αixi.
Since the loss is convex, the final solution is independent of the initialization, and we can initialize w0 to be whatever we want. For convenience, let us pick w0=(0⋮0).
For this initial choice of w0, the linear combination in w=∑ni=1αixi is trivially α1=⋯=αn=0. We now show that throughout the entire gradient descent optimization such coefficients α1,…,αn must always exist, as we can re-write the gradient updates entirely in terms of updating the αi coefficients:
w1=w0−sn∑i=12(w⊤0xi−yi)xi=n∑i=1α0ixi−sn∑i=1γ0ixi=n∑i=1α1ixi(with α1i=α0i−sγ0i)w2=w1−sn∑i=12(w⊤1xi−yi)xi=n∑i=1α1ixi−sn∑i=1γ1ixi=n∑i=1α2ixi(with α2i=α1ixi−sγ1i)w3=w2−sn∑i=12(w⊤2xi−yi)xi=n∑i=1α2ixi−sn∑i=1γ2ixi=n∑i=1α3ixi(with α3i=α2i−sγ2i)⋯⋯⋯wt=wt−1−sn∑i=12(w⊤t−1xi−yi)xi=n∑i=1αt−1ixi−sn∑i=1γt−1ixi=n∑i=1αtixi(with αti=αt−1i−sγt−1i)
Formally, the argument is by induction. w is trivially a linear combination of our training vectors for w0 (base case). If we apply the inductive hypothesis for wt it follows for wt+1.
The update-rule for αti is thus
αti=αt−1i−sγt−1i, and we have αti=−st−1∑r=0γri.
In other words, we can perform the entire gradient descent update rule without ever expressing w explicitly. We just keep track of the n coefficients α1,…,αn.
Now that w can be written as a linear combination of the training set, we can also express the inner-product of w with any input xi purely in terms of inner-products between training inputs:
w⊤xj=n∑i=1αix⊤ixj.
Consequently, we can also re-write the squared-loss from ℓ(w)=∑ni=1(w⊤xi−yi)2 entirely in terms of inner-product between training inputs:
ℓ(α)=n∑i=1(n∑j=1αjx⊤jxi−yi)2
During test-time we also only need these coefficients to make a prediction on a test-input xt, and can write the entire classifier in terms of inner-products between the test point and training points:
h(xt)=w⊤xt=n∑j=1αjx⊤jxt.
Do you notice a theme? The only information we ever need in order to learn a hyper-plane classifier with the squared-loss is inner-products between all pairs of data vectors.
Inner-Product Computation
Let's go back to the previous example, ϕ(x)=(1x1⋮xdx1x2⋮xd−1xd⋮x1x2⋯xd).
The inner product ϕ(x)⊤ϕ(z) can be formulated as:
ϕ(x)⊤ϕ(z)=1⋅1+x1z1+x2z2+⋯+x1x2z1z2+⋯+x1⋯xdz1⋯zd=d∏k=1(1+xkzk).
The sum of 2d terms becomes the product of d terms. We can compute the inner-product from the above formula in time O(d) instead of O(2d)!
We define the function
k(xi,xj)⏟this is called the kernel function=ϕ(xi)⊤ϕ(xj).
With a finite training set of n samples, inner products are often pre-computed and stored in a Kernel Matrix:
Kij=ϕ(xi)⊤ϕ(xj).
If we store the matrix K, we only need to do simple inner-product look-ups and low-dimensional computations throughout the gradient descent algorithm.
The final classifier becomes:
h(xt)=n∑j=1αjk(xj,xt).
During training in the new high dimensional space of ϕ(x) we want to compute γi through kernels, without ever computing any ϕ(xi) or even w. We previously established that w=∑nj=1αjϕ(xj), and
γi=2(w⊤ϕ(xi)−yi). It follows that γi=2(∑nj=1αjKij)−yi). The gradient update in iteration t+1 becomes αt+1i←αti−2s(n∑j=1αtjKij)−yi).
As we have n such updates to do, the amount of work per gradient update in the transformed space is O(n2) --- far better than O(2d).
General Kernels
Below are some popular kernel functions:
Linear: K(x,z)=x⊤z.
(The linear kernel is equivalent to just using a good old linear classifier - but it can be faster to use a kernel matrix if the dimensionality d of the data is high.)
Polynomial: K(x,z)=(1+x⊤z)d.
Radial Basis Function (RBF) (aka Gaussian Kernel): K(x,z)=e−‖x−z‖2σ2.
The RBF kernel is the most popular Kernel! It is a Universal approximator!! Its corresponding feature vector is infinite dimensional and cannot be computed. However, very effective low dimensional approximations exist (see this paper).
Exponential Kernel: K(x,z)=e−‖x−z‖2σ2
Laplacian Kernel: K(x,z)=e−|x−z|σ
Sigmoid Kernel: K(x,z)=tanh(ax⊤+c)
Kernel functions
Can any function K(⋅,⋅)→R be used as a kernel?
No, the matrix K(xi,xj) has to correspond to real inner-products after some transformation x→ϕ(x). This is the case if and only if K is positive semi-definite.
Definition: A matrix A∈Rn×n is positive semi-definite iff ∀q∈Rn, q⊤Aq≥0.
Remember Kij=ϕ(xi)⊤ϕ(xj). So K=Φ⊤Φ, where Φ=[ϕ(x1),…,ϕ(xn)].
It follows that K is p.s.d., because q⊤Kq=(Φ⊤q)2≥0. Inversely, if any matrix A is p.s.d., it can be decomposed as A=Φ⊤Φ for some realization of Φ.
You can even define kernels over sets, strings, graphs and molecules.
Figure 1: The demo shows how kernel function solves the problem linear classifiers can not solve. RBF works well with the decision boundary in this case.