Quiz1: Prove it!
You can even define kernels of sets, or strings or molecules.
Quiz2: Prove it!
(In practice) an algorithm can be kernelized in 2 steps:
Vanilla Ordinary Least Squares Regression (OLS) [also referred to as linear regression] minimizes the following squared loss regression loss function,
\begin{equation}
\min_\mathbf{w} \sum_{i=1}^{n} (\mathbf{w}^\top \mathbf{x}_i -y_i)^2,
\end{equation}
to find the hyper-plane \(\mathbf{w}\). The prediction at a test-point is simply \(h(\mathbf{x})=\mathbf{w}^\top \mathbf{x}\).
If we let \(\mathbf{X}=[\mathbf{x}_1,\ldots,\mathbf{x}_n]\) and \(\mathbf{y}=[y_1,\ldots,y_n]^\top\), the solution of OLS can be written in closed form:
\begin{equation}
\mathbf{w}=(\mathbf{X}\mathbf{X}^\top)^{-1} \mathbf{X} \mathbf{y}\qquad(5)%\label{eq:kernel:OLS}
\end{equation}
We begin by expressing the solution \(\mathbf{w}\) as a linear combination of the training inputs \begin{equation} \mathbf{w}=\sum_{i=1}^{n} \alpha_i\mathbf{x}_i=\mathbf{X}\vec{\alpha}. \end{equation} We derived in the previous lecture that such a vector \(\vec \alpha\) must always exist by observing the gradient updates that occur if (5) is minimized with gradient descent and the initial vector is set to \(\mathbf{w}_0=\vec 0\) (because the squared loss is convex the solution is independent of its initialization.)
Similarly, during testing a test point is only accessed through inner-products with training inputs: \begin{equation} h(\mathbf{z})=\mathbf{w}^\top \mathbf{z} = \sum_{i=1}^n\alpha_i \mathbf{x}_i^\top\mathbf{z}. \end{equation} We can now immediately kernelize the algorithm by substituting \(k(\mathbf{x},\mathbf{z})\) for any inner-product \(\mathbf{x}^\top \mathbf{z}\). It remains to show that we can also solve for the values of \(\alpha\) in closed form. As it turns out, this is straight-forward.
Remember that we defined \(\mathbf{w}=\mathbf{X}\vec{\alpha}.\) The prediction of a test point \(\mathbf{z}\) then becomes $$h(\mathbf{z})=\mathbf{z}^\top \mathbf{w} =\mathbf{z}^\top\underbrace{\mathbf{X}\vec{\alpha}}_{\mathbf{w}} =\underbrace{\mathbf{k}_*}_{\mathbf{z}^\top\mathbf{X}}\underbrace{(\mathbf{K}+\tau^2\mathbf{I})^{-1}\mathbf{y}}_{\vec{\alpha}}=\mathbf{k}_*\vec{\alpha},$$ or, if everything is in closed form: $$h(\mathbf{z})=\mathbf{k}_*(\mathbf{K}+\tau^2\mathbf{I})^{-1}\mathbf{y},$$ where \(\mathbf{k}_*\) is the kernel (vector) of the test point with the training points, i.e. the \(i^{th}\) dimension corresponds to \([\mathbf{k}_*]_{i}=\phi(\mathbf{z})^\top\phi(\mathbf{x}_i)\), the inner-product between the test point \(\mathbf{z}\) with the training point \(\mathbf{x}_i\) after the mapping into feature space through \(\phi\).
Quiz 3: Let \(D=\{(\mathbf{x}_1,y_1),\ldots,(\mathbf{x}_n,y_n)\}.\) How can you kernelize nearest neighbors (with Euclidean distances)?
Note that in the lecture, we kernelized the Gradient Descent algorithm for soft-margin SVM. That's all you need to know. Indeed, I believe that even in practice, running gradient descent on the soft-margin SVM loss is a pretty good idea -- the research paper that suggests using GD for optimizing soft-margin SVM loss won 10 year test of time award at ICML. The content below is optional, since the concept of Dual formulation of a constraint optimization problem is out of the scope of this class.
The original, primal SVM is a quadratic programming problem:
\[\begin{aligned}
&\min_{\mathbf{w},b}\mathbf{w}^\top\mathbf{w}+C \sum_{i=1}^{n} \xi_i \\
\text{s.t. }\forall i, &\quad y_i(\mathbf{w}^\top\mathbf{x}_i +b) \geq 1 - \xi_i\\
&\quad \xi_i \geq 0
\end{aligned}\]
has the dual form
\[\begin{aligned}
&\min_{\alpha_1,\cdots,\alpha_n}\frac{1}{2} \sum_{i,j}\alpha_i \alpha_j y_i y_j K_{ij} - \sum_{i=1}^{n}\alpha_i \\
\text{s.t.} &\quad 0 \leq \alpha_i \leq C\\
&\quad \sum_{i=1}^{n} \alpha_i y_i = 0
\end{aligned}\]
where \(\mathbf{w}=\sum_{i=1}^n \alpha_i y_i\phi(\mathbf{x}_i)\) (although this is never computed) and
\begin{equation}
h(\mathbf{x})=\textrm{sign}\left(\sum_{i=1}^n \alpha_i y_i k(\mathbf{x}_i,\mathbf{x})+b\right).
\end{equation}
One apparent problem with the dual version is that \(b\) is no longer part of the optimization. However, we need it to perform classification. Luckily, we know that the primal solution and the dual solution are identical. In the duel, support vectors are those with \(\alpha_i>0\). We can then solve the following equation for \(b\) \[\begin{aligned} y_i(\mathbf{w}^\top \phi(x_i)+b)&=1\\ y_i\left(\sum_{j}y_j\alpha_jk(\mathbf{x}_j,\mathbf{x}_i)+b\right)&=1\\ y_i-\sum_{j}y_j\alpha_jk(\mathbf{x}_j,\mathbf{x}_i)&=b\\ \end{aligned} \] This allows us to solve for \(b\) from the support vectors (in practice it is best to average the \(b\) from several support vectors, as there may be numerical precision problems).
Do you remember the k-nearest neighbor algorithm? For binary classification problems (\(y_i\in\{+1,-1\}\)), we can write the decision function for a test point \(\mathbf{z}\) as $$ h(\mathbf{z})=\textrm{sign}\left(\sum_{i=1}^n y_i \delta^{nn}(\mathbf{x}_i,\mathbf{z})\right), $$ where \(\delta^{nn}(\mathbf{z},\mathbf{x}_i)\in\{0,1\}\) with \(\delta^{nn}(\mathbf{z},\mathbf{x}_i)=1\) only if \(\mathbf{x}_i\) is one of the \(k\) nearest neighbors of test point \(\mathbf{z}\). The SVM decision function $$h(\mathbf{z})=\textrm{sign}\left(\sum_{i=1}^n y_i\alpha_i k(\mathbf{x}_i,\mathbf{z})+b\right)$$ is very similar, but instead of limiting the decision to the \(k\) nearest neighbors, it considers all training points but the kernel function assigns more weight to those that are closer (large \(k(\mathbf{z},\mathbf{x}_i))\). In some sense you can view the RBF kernel as a soft nearest neighbor assignment, as the exponential decay with distance will assign almost no weight to all but the neighboring points of \(\mathbf{z}\). The Kernel SVM algorithm also learns a weight \(\alpha_i>0\) for each training point and a bias \(b\) and it essentially "removes" useless training points by setting many \(\alpha_i=0\).