Processing math: 100%
Lecture 13: Kernels
ERM is cool, but so far all classifiers are linear. What if there exists no linear decision boundary? As it turns out, there is an elegant way to incorporate non-linearities into ERM.
Handcrafted Feature Expansion
We can make classifiers non-linear by applying basis function (feature transformations) on input. For a data vector x∈Rd, do transformation x→ϕ(x) where ϕ(x)∈RD. Usually D≫d because you add dimensions that capture non-linear interactions among features.
Advantage: It is simple, and your problem stays convex and well behaved. (i.e.you can still use your normal gradient descent descent code)
Disadvantage: ϕ(x) might be very high dimensional. (Let's worry about this later)
Consider the following example: x=(x1x2⋮xd), and define ϕ(x)=(1x1⋮xdx1x2⋮xd−1xd⋮x1x2⋯xd).
Quiz: What is the dimensionality of ϕ(x)?
So, as is shown above, ϕ(x) is very expressive but the dimensionality is extremely high.
The Kernel Trick
Gradient Descent with Squared Loss
Now, note that when we do gradient descent with many loss functions, the gradient is a linear combination of the input samples. Take squared loss for example:
ℓ(w)=n∑i=1(w⊤xi−yi)2
The gradient descent rule, with step-size s>0, updates w over time,
wt+1←wt−s(∂ℓ∂w) where: ∂ℓ∂w=n∑i=12(w⊤xi−yi)⏟γi : function of xi,yixi=n∑i=1γixi
We will now show that we can express w as a linear combination of all input vectors,
w=n∑i=1αixi.
Since the loss is convex, the final solution is independent of the initialization, and we can initialize w0 to be whatever we want. For convenience, let us pick w0=(0⋮0).
For this initial choice of w0, the linear combination in w=∑ni=1αixi is trivially α1=⋯=αn=0. We now show that throughout the entire gradient descent optimization such coefficients α1,…,αn must always exist, as we can re-write the gradient updates entirely in terms of updating the αi coefficients:
w1=w0−sn∑i=12(w⊤0xi−yi)xi=n∑i=1α0ixi−sn∑i=1γ0ixi=n∑i=1α1ixi(with α1i=α0i−sγ0i)w2=w1−sn∑i=12(w⊤1xi−yi)xi=n∑i=1α1ixi−sn∑i=1γ1ixi=n∑i=1α2ixi(with α2i=α1ixi−sγ1i)w3=w2−sn∑i=12(w⊤2xi−yi)xi=n∑i=1α2ixi−sn∑i=1γ2ixi=n∑i=1α3ixi(with α3i=α2i−sγ2i)⋯⋯⋯wt=wt−1−sn∑i=12(w⊤t−1xi−yi)xi=n∑i=1αt−1ixi−sn∑i=1γt−1ixi=n∑i=1αtixi(with αti=αt−1i−sγt−1i)
The update-rule for αti is thus
αti=αt−1i−sγt−1i, and we have αti=−st−1∑r=0γri.
In other words, we can perform the entire gradient descent update rule without ever expressing w explicitly. We just keep track of the n coefficients α1,…,αn.
Now that w can be written as a linear combination of the training set, we can also express the inner-product of w with any input xi purely in terms of inner-products between training inputs:
w⊤xj=n∑i=1αix⊤ixj.
Consequently, we can also re-write the squared-loss from ℓ(w)=∑ni=1(w⊤xi−yi)2 entirely in terms of inner-product between training inputs:
ℓ(α)=n∑i=1(n∑j=1αjx⊤jxi−yi)2
During test-time we also only need these coefficients to make a prediction on a test-input xt, and can write the entire classifier in terms of inner-products between the test point and training points:
h(xt)=w⊤xt=n∑j=1αjx⊤jxt.
Do you notice a theme? The only information we ever need in order to learn a hyper-plane classifier with the squared-loss is inner-products between all pairs of data vectors.
Inner-Product Computation
Let's go back to the previous example, ϕ(x)=(1x1⋮xdx1x2⋮xd−1xd⋮x1x2⋯xd).
The inner product ϕ(x)⊤ϕ(z) can be formulated as:
ϕ(x)⊤ϕ(z)=1⋅1+x1z1+x2z2+⋯+x1x2z1z2+⋯+x1⋯xdz1⋯zd=d∏k=1(1+xkzk).
The sum of 2d terms becomes the product of d terms. We can compute the inner-product from above formula in time O(d) instead of O(2d)!
In fact, we can pre-compute them and store them in a kernel matrix
K(xi,xj)⏟this is called the Kernel Matrix=ϕ(xi)⊤ϕ(xj).
If we store the matrix K, we only need to do simple inner-product look-ups and low-dimensional computations throughout the gradient descent algorithm.
General Kernels
Following all some common kernel functions:
Linear: K(x,z)=x⊤z.
(The linear kernel is equivalent to just using a good old linear classifier - but it can be faster to use a kernel matrix if the dimensionality d of the data is high.)
Polynomial: K(x,z)=(1+x⊤z)d.
Radial Basis Function (RBF) (aka Gaussian Kernel): K(x,z)=e−‖x−z‖2σ2.
The RBF kernel is the most popular Kernel! It is a Universal approximator!! Its corresponding feature vector is infinite dimensional.
In the following we provide some other kernels.
Exponential Kernel: K(x,z)=e−‖x−z‖2σ2
Laplacian Kernel: K(x,z)=e−|x−z|σ
Sigmoid Kernel: K(x,z)=tanh(ax⊤+c)
Think about it: Can any function K(⋅,⋅) be used as a kernel?
No, the matrix K(xi,xj) has to correspond to real inner-products after some transformation x→ϕ(x). This is the case if and only if K is positive semi-definite.
Definition: A matrix A∈Rn×n is positive semi-definite iff ∀q∈Rn, q⊤Aq≥0.
Why is that?
Remember K(x,z)=ϕ(x)⊤ϕ(z). A matrix of form A=(x⊤1x1,...,x⊤1xn⋮ ⋮x⊤nx1,...,x⊤nxn)=(x1⋮xn)(x1,⋯xn) must be positive semi-definite because:
q⊤Aq=((x1,⋯xn)q⏟a vector with the same dimension of xi)⊤((x1,⋯xn)q)≥0 for ∀q∈Rn.
You can even define kernels over sets, strings, graphs and molecules.
Figure 1: The demo shows how kernel function solves the problem linear classifiers can not solve. RBF works well with the decision boundary in this case.