Kernels built by recursively combining one or more of the following rules are
called well-defined kernels:
where are well-defined kernels, , is a polynomial function with positive coefficients, is any function and is positive semi-definite.
Kernel being well-defined is equivalent to the corresponding kernel matrix, , being
positive semidefinite (not proved here), which is equivalent to
any of the following statement:
All eigenvalues of are non-negative.
It is trivial to prove that linear kernel and polynomial kernel with integer
are both well-defined kernel.
The RBF kernel
is a well-defined kernel matrix.
Quiz1: Prove it!
You can even define kernels of sets, or strings or molecules.
The following kernel is defined on any two sets ,
Quiz2: Prove it!
List out all possible samples and arrange them into a sorted list.
We define a vector ,
where each of its element indicates whether a corresponding sample
is included in the set . It is easy to prove that
which is a well-defined kernel by rules 1 and 7.
Kernel Machines
(In practice) an algorithm can be kernelized in 2 steps:
Prove that the solution lies in the span of the training points (i.e. for some )
Rewrite the algorithm and the classifier so that all training or testing inputs are only accessed in inner-products with other inputs, e.g. .
We begin by expressing the solution as a linear combination of the training inputs
We derived in the previous lecture that such a vector must always exist by observing the gradient updates that occur if (5) is minimized with gradient descent and the initial vector is set to (because the squared loss is convex the solution is independent of its initialization.)
Similarly, during testing a test point is only accessed through inner-products with training inputs:
We can now immediately kernelize the algorithm by substituting for any inner-product .
It remains to show that we can also solve for the values of in closed form. As it turns out, this is straight-forward.
Kernelized ordinary least squares has the solution .
Kernel regression can be extended to the kernelized version of ridge regression. The solution then becomes
In practice a small value of increases stability, especially if is not invertible. If kernel ridge regression, becomes kernelized ordinary least squares. Typically kernel ridge regression is also referred to as kernel regression.
Testing
Remember that we defined The prediction of a test point then becomes
or, if everything is in closed form:
where is the kernel (vector) of the test point with the training points, i.e. the dimension corresponds to , the inner-product between the test point with the training point after the mapping into feature space through .
Nearest Neighbors
Quiz 3: Let How can you kernelize nearest neighbors (with Euclidean distances)?
Kernel SVM
The original, primal SVM is a quadratic programming problem:
has the dual form
where (although this is never computed) and
Support Vectors
There is a very nice interpretation of the dual problem in terms of support vectors. For the primal formulation we know (from a previous lecture) that only support vectors satisfy the constraint with equality:
In the dual, these same training inputs can be identified as their corresponding dual values satisfy (all other training inputs have ).
For test-time you only need to compute the sum in over the support vectors and all inputs with can be discarded after training.
Recovering
One apparent problem with the dual version is that is no longer part of the optimization. However, we need it to perform classification. Luckily, we know that the primal solution and the dual solution are identical.
In the duel, support vectors are those with .
We can then solve the following equation for
This allows us to solve for from the support vectors (in practice it is best to average the from several support vectors, as there may be numerical precision problems).
Quiz: What is the dual form of the hard-margin SVM?
Kernel SVM - the smart nearest neighbor
Do you remember the k-nearest neighbor algorithm? For binary classification problems (), we can write the decision function for a test point as
where with only if is one of the nearest neighbors of test point .
The SVM decision function
is very similar, but instead of limiting the decision to the nearest neighbors, it considers all training points but the kernel function assigns more weight to those that are closer (large . In some sense you can view the RBF kernel as a soft nearest neighbor assignment, as the exponential decay with distance will assign almost no weight to all but the neighboring points of .
The Kernel SVM algorithm also learns a weight for each training point and a bias and it essentially "removes" useless training points by setting many .