Kernels continued

Cornell CS 4/5780

Fall 2022

previous
next




Video II Video III

Well-defined kernels

Here are the most common kernels: Kernels built by recursively combining one or more of the following rules are called well-defined kernels:
  1. k(x,z)=xz
  2. k(x,z)=ck1(x,z)
  3. k(x,z)=k1(x,z)+k2(x,z)
  4. k(x,z)=g(k(x,z))
  5. k(x,z)=k1(x,z)k2(x,z)
  6. k(x,z)=f(x)k1(x,z)f(z)
  7. k(x,z)=ek1(x,z)
  8. k(x,z)=xAz
where k1,k2 are well-defined kernels, c0, g is a polynomial function with positive coefficients, f is any function and A0 is positive semi-definite. Kernel being well-defined is equivalent to the corresponding kernel matrix, K, being positive semidefinite (not proved here), which is equivalent to any of the following statement:
  1. All eigenvalues of K are non-negative.
  2.  real matrix P s.t. K=PP.
  3.  real vector x,xKx0.
It is trivial to prove that linear kernel and polynomial kernel with integer d are both well-defined kernel.
The RBF kernel k(x,z)=e(xz)2σ2 is a well-defined kernel matrix.

Quiz1: Prove it!

k1(x,z)=xzwell defined by rule 1k2(x,z)=2σ2k1(x,z)=2σ2xzwell defined by rule 2k3(x,z)=ek2(x,z)=e2xzσ2well defined by rule 7k4(x,z)=exxσ2k3(x,z)ezzσ2=exxσ2e2xzσ2ezzσ2well defined by rule 6 with f(x)=exxσ2=exx+2xzzzσ2=e(xz)2σ2=kRBF(x,z)

You can even define kernels of sets, or strings or molecules.

The following kernel is defined on any two sets S1,S2Ω, k(S1,S2)=e|S1S2|.

Quiz2: Prove it!

List out all possible samples Ω and arrange them into a sorted list. We define a vector xS{0,1}|Ω|, where each of its element indicates whether a corresponding sample is included in the set S. It is easy to prove that k(S1,S2)=exS1xS2, which is a well-defined kernel by rules 1 and 7.

Kernel Machines

(In practice) an algorithm can be kernelized in 2 steps:

  1. Prove that the solution lies in the span of the training points (i.e. w=i=1nαixi for some αi)
  2. Rewrite the algorithm and the classifier so that all training or testing inputs xi are only accessed in inner-products with other inputs, e.g. xixj.
  3. Define a kernel function and substitute k(xi,xj) for xixj.

Kernelized Linear Regression

Recap

Vanilla Ordinary Least Squares Regression (OLS) [also referred to as linear regression] minimizes the following squared loss regression loss function, minwi=1n(wxiyi)2, to find the hyper-plane w. The prediction at a test-point is simply h(x)=wx.
If we let X=[x1,,xn] and y=[y1,,yn], the solution of OLS can be written in closed form: w=(XX)1Xy(5)

Kernelization

We begin by expressing the solution w as a linear combination of the training inputs w=i=1nαixi=Xα. We derived in the previous lecture that such a vector α must always exist by observing the gradient updates that occur if (5) is minimized with gradient descent and the initial vector is set to w0=0 (because the squared loss is convex the solution is independent of its initialization.)

Similarly, during testing a test point is only accessed through inner-products with training inputs: h(z)=wz=i=1nαixiz. We can now immediately kernelize the algorithm by substituting k(x,z) for any inner-product xz. It remains to show that we can also solve for the values of α in closed form. As it turns out, this is straight-forward.

Kernelized ordinary least squares has the solution α=K1y.
Xα=w=(XX)1Xy    | multiply from left by XXX(XX)(XX)α=X(XX(XX)1)Xy    |substitute K=XXK2α=Ky    |multiply from left by (K1)2α=K1y

Kernel regression can be extended to the kernelized version of ridge regression. The solution then becomes α=(K+τ2I)1y. In practice a small value of τ2>0 increases stability, especially if K is not invertible. If τ=0 kernel ridge regression, becomes kernelized ordinary least squares. Typically kernel ridge regression is also referred to as kernel regression.
Testing

Remember that we defined w=Xα. The prediction of a test point z then becomes h(z)=zw=zXαw=kzX(K+τ2I)1yα=kα, or, if everything is in closed form: h(z)=k(K+τ2I)1y, where k is the kernel (vector) of the test point with the training points, i.e. the ith dimension corresponds to [k]i=ϕ(z)ϕ(xi), the inner-product between the test point z with the training point xi after the mapping into feature space through ϕ.

Nearest Neighbors

Quiz 3: Let D={(x1,y1),,(xn,yn)}. How can you kernelize nearest neighbors (with Euclidean distances)?

Kernel SVM

The original, primal SVM is a quadratic programming problem: minw,bww+Ci=1nξis.t. i,yi(wxi+b)1ξiξi0 has the dual form minα1,,αn12i,jαiαjyiyjKiji=1nαis.t.0αiCi=1nαiyi=0
where w=i=1nαiyiϕ(xi) (although this is never computed) and h(x)=sign(i=1nαiyik(xi,x)+b).

Support Vectors
There is a very nice interpretation of the dual problem in terms of support vectors. For the primal formulation we know (from a previous lecture) that only support vectors satisfy the constraint with equality: yi(wϕ(xi)+b)=1. In the dual, these same training inputs can be identified as their corresponding dual values satisfy αi>0 (all other training inputs have αi=0). For test-time you only need to compute the sum in h(x) over the support vectors and all inputs xi with αi=0 can be discarded after training.

Recovering b

One apparent problem with the dual version is that b is no longer part of the optimization. However, we need it to perform classification. Luckily, we know that the primal solution and the dual solution are identical. In the duel, support vectors are those with αi>0. We can then solve the following equation for b yi(wϕ(xi)+b)=1yi(jyjαjk(xj,xi)+b)=1yijyjαjk(xj,xi)=b This allows us to solve for b from the support vectors (in practice it is best to average the b from several support vectors, as there may be numerical precision problems).



Quiz: What is the dual form of the hard-margin SVM?

Kernel SVM - the smart nearest neighbor

Do you remember the k-nearest neighbor algorithm? For binary classification problems (yi{+1,1}), we can write the decision function for a test point z as h(z)=sign(i=1nyiδnn(xi,z)), where δnn(z,xi){0,1} with δnn(z,xi)=1 only if xi is one of the k nearest neighbors of test point z. The SVM decision function h(z)=sign(i=1nyiαik(xi,z)+b) is very similar, but instead of limiting the decision to the k nearest neighbors, it considers all training points but the kernel function assigns more weight to those that are closer (large k(z,xi)). In some sense you can view the RBF kernel as a soft nearest neighbor assignment, as the exponential decay with distance will assign almost no weight to all but the neighboring points of z. The Kernel SVM algorithm also learns a weight αi>0 for each training point and a bias b and it essentially "removes" useless training points by setting many αi=0.