Linear classifiers are great, but what if there exists no linear decision boundary? As it turns out, there is an elegant way to incorporate non-linearities into most linear classifiers.
Handcrafted Feature Expansion
We can make linear classifiers non-linear by applying basis function (feature transformations) on the input feature vectors. Formally, for a data vector , we apply the transformation where . Usually because we add dimensions that capture non-linear interactions among the original features.
Advantage: It is simple, and your problem stays convex and well behaved. (i.e. you can still use your original gradient descent code, just with the higher dimensional representation)
Disadvantage: might be very high dimensional.
Consider the following example: , and define .
Quiz: What is the dimensionality of ?
This new representation, , is very expressive and allows for complicated non-linear decision boundaries - but the dimensionality is extremely high. This makes our algorithm unbearable (and quickly prohibitively) slow.
The Kernel Trick
Gradient Descent with Squared Loss
The kernel trick is a way to get around this dilemma by learning a function in the much higher dimensional space, without ever computing a single vector or ever computing the full vector . It is a little magical.
It is based on the following observation: If we use gradient descent with any one of our standard loss functions, the gradient is a linear combination of the input samples. For example, let us take a look at the squared loss:
The gradient descent rule, with step-size/learning-rate (we denoted this as in our previous lectures), updates over time,
We will now show that we can express as a linear combination of all input vectors,
Since the loss is convex, the final solution is independent of the initialization, and we can initialize to be whatever we want. For convenience, let us pick .
For this initial choice of , the linear combination in is trivially . We now show that throughout the entire gradient descent optimization such coefficients must always exist, as we can re-write the gradient updates entirely in terms of updating the coefficients:
Formally, the argument is by induction. is trivially a linear combination of our training vectors for (base case). If we apply the inductive hypothesis for it follows for .
The update-rule for is thus
In other words, we can perform the entire gradient descent update rule without ever expressing explicitly. We just keep track of the coefficients .
Now that can be written as a linear combination of the training set, we can also express the inner-product of with any input purely in terms of inner-products between training inputs:
Consequently, we can also re-write the squared-loss from entirely in terms of inner-product between training inputs:
During test-time we also only need these coefficients to make a prediction on a test-input , and can write the entire classifier in terms of inner-products between the test point and training points:
Do you notice a theme? The only information we ever need in order to learn a hyper-plane classifier with the squared-loss is inner-products between all pairs of data vectors.
Inner-Product Computation
Let's go back to the previous example, .
The inner product can be formulated as:
The sum of terms becomes the product of terms. We can compute the inner-product from the above formula in time instead of !
We define the function
With a finite training set of samples, inner products are often pre-computed and stored in a Kernel Matrix:
If we store the matrix , we only need to do simple inner-product look-ups and low-dimensional computations throughout the gradient descent algorithm.
The final classifier becomes:
During training in the new high dimensional space of we want to compute through kernels, without ever computing any or even . We previously established that , and
. It follows that . The gradient update in iteration becomes
As we have such updates to do, the amount of work per gradient update in the transformed space is --- far better than .
General Kernels
Below are some popular kernel functions:
Linear: .
(The linear kernel is equivalent to just using a good old linear classifier - but it can be faster to use a kernel matrix if the dimensionality of the data is high.)
Polynomial: .
Radial Basis Function (RBF) (aka Gaussian Kernel): .
The RBF kernel is the most popular Kernel! It is a Universal approximator!! Its corresponding feature vector is infinite dimensional and cannot be computed. However, very effective low dimensional approximations exist (see this paper).
Exponential Kernel:
Laplacian Kernel:
Sigmoid Kernel:
Kernel functions
Can any function be used as a kernel?
No, the matrix has to correspond to real inner-products after some transformation . This is the case if and only if is positive semi-definite.
Definition: A matrix is positive semi-definite iff , .
Remember . So , where .
It follows that is p.s.d., because . Inversely, if any matrix is p.s.d., it can be decomposed as for some realization of .
You can even define kernels over sets, strings, graphs and molecules.
Figure 1: The demo shows how kernel function solves the problem linear classifiers can not solve. RBF works well with the decision boundary in this case.