Quiz1: Prove it!
You can even define kernels of sets, or strings or molecules.
Quiz2: Prove it!
An algorithm can be kernelized in 2 steps:
Vanilla Ordinary Least Squares Regression (OLS) [also referred to as linear regression] minimizes the following squared loss regression loss function,
minwn∑i=1(w⊤xi−yi)2,
to find the hyper-plane w. The prediction at a test-point is simply h(x)=w⊤x.
If we let X=[x1,…,xn] and y=[y1,…,yn]⊤, the solution of OLS can be written in closed form:
w=(XX⊤)−1Xy(5)
We begin by expressing the solution w as a linear combination of the training inputs w=n∑i=1αixi=X→α. We derived in the previous lecture that such a vector →α must always exist by observing the gradient updates that occur if (5) is minimized with gradient descent and the initial vector is set to w0=→0 (because the squared loss is convex the solution is independent of its initialization.)
Similarly, during testing a test point is only accessed through inner-products with training inputs: h(z)=w⊤z=n∑i=1αix⊤iz. We can now immediately kernelize the algorithm by substituting k(x,z) for any inner-product x⊤z. It remains to show that we can also solve for the values of α in closed form. As it turns out, this is straight-forward.
Remember that we defined w=X→α. The prediction of a test point z then becomes h(z)=z⊤w=z⊤X→α⏟w=k∗⏟z⊤X(K+σ2I)−1y⏟→α=k∗(K+σ2I)−1y, where k∗ is the kernel (vector) of the test point with the training points, i.e. the ith dimension corresponds to [K∗]i=ϕ(z)⊤ϕ(xi), the inner-product between the test point z with the training point xi after the mapping into feature space through ϕ.
Quiz 3: Let D={(x1,y1),…,(xn,yn)}. How can you kernelize nearest neighbors (with Euclidean distances)?
The original, primal SVM is a quadratic programming problem:
minw,bw⊤w+Cn∑i=1ξis.t. ∀i,yi(w⊤xi+b)≥1−ξiξi≥0
has the dual form
minα1,⋯,αn12∑i,jαiαjyiyjKij−n∑i=1αis.t.0≤αi≤Cn∑i=1αiyi=0
where w=∑ni=1αiyiϕ(xi) (although this is never computed) and
h(x)=sign(n∑i=1αiyik(xi,x)+b).
One seeming problem with the dual version is that b is no longer part of the optimization. However, we need it to perform classification. Luckily, we know that the primal solution and the dual solution are identical. In the duel, support vectors are those with αi>0. We can then solve the following equation for b yi(w⊤ϕ(xi)+b)=1yi(∑jyjαjk(xj,xi)+b)=1yi−∑jyjαjk(xj,xi)=b This allows us to solve for b from the support vectors (in practice it is best to average the b from several support vectors, as there may be numerical precision problems).
Do you remember the k-nearest neighbor algorithm? For binary classification problems (yi∈{+1,−1}), we can write the decision function for a test point z as h(z)=sign(n∑i=1yiδnn(xi,z)), where δnn(z,xi)∈{0,1} with δnn(z,xi)=1 only if xi is one of the k nearest neighbors of test point z. The SVM decision function h(z)=sign(n∑i=1yiαik(xi,z)+b) is very similar, but instead of limiting the decision to the k nearest neighbors, it considers all training points but the kernel function assigns more weight to those that are closer (large k(z,xi)). In some sense you can view the RBF kernel as a soft nearest neighbor assignment, as the exponential decay with distance will assign almost no weight to all but the neighboring points of z. The Kernel SVM algorithm also learns a weight αi>0 for each training point and a bias b and it essentially "removes" useless training points by setting many αi=0.