Lecture 14: Kernels continued

\[\begin{aligned} \mathsf{k}_1(\mathbf{x},\mathbf{z})&=\mathbf{x}^\top \mathbf{z} \qquad \textrm{well defined by rule (1)}\\ \mathsf{k}_2(\mathbf{x},\mathbf{z})&=\frac{2}{\sigma^2}\mathsf{k}_1(\mathbf{x},\mathbf{z})=\frac{2}{\sigma^2}\mathbf{x}^\top \mathbf{z} \qquad \textrm{well defined by rule (2)}\\ \mathsf{k}_3(\mathbf{x},\mathbf{z})&=e^{\mathsf{k}_2(\mathbf{x},\mathbf{z})}=e^{\frac{2\mathbf{x}^\top \mathbf{z}}{\sigma^2}} \qquad \textrm{well defined by rule (4)}\\ \mathsf{k}_{4}(\mathbf{x},\mathbf{z})&=e^{-\frac{\mathbf{x}^\top\mathbf{x}}{\sigma^2}} \mathsf{k}_3(\mathbf{x},\mathbf{z}) e^{-\frac{\mathbf{z}^\top\mathbf{z}}{\sigma^2}} =e^{-\frac{\mathbf{x}^\top\mathbf{x}}{\sigma^2}} e^{\frac{2\mathbf{x}^\top \mathbf{z}}{\sigma^2}} e^{-\frac{\mathbf{z}^\top\mathbf{z}}{\sigma^2}} \qquad \textrm{well defined by rule (3) with $f(\mathbf{x})= e^{-\frac{\mathbf{x}^\top\mathbf{x}}{\sigma^2}}$}\\ &=e^{\frac{-\mathbf{x}^\top\mathbf{x}+2\mathbf{x}^\top\mathbf{z}-\mathbf{z}^\top\mathbf{z}}{\sigma^2}}=e^{\frac{-(\mathbf{x}-\mathbf{z})^2}{\sigma^2}}=\mathsf{k}_{RBF}(\mathbf{x},\mathbf{z}) \end{aligned}\]

Kernel Machines

Kernel Regression

kernelization

\[\begin{aligned} \mathbf{X}\vec{\alpha}&=\mathbf{w}= (\mathbf{X}\mathbf{X}^\top)^{-1} \mathbf{X} \mathbf{y} \ \ \ \textrm{ | multiply from left by $(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$} \\ (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}\vec{\alpha} &= (\mathbf{X}^\top\mathbf{X})^{-1}\underbrace{\mathbf{X}^\top(\mathbf{X}\mathbf{X}^\top)^{-1} \mathbf{X}}_{=\mathbf{I}} \mathbf{y}\\ \vec{\alpha}&= (\mathbf{X} ^\top \mathbf{X})^{-1} \mathbf{y} | \textrm{ substitute $\mathbf{K}=\mathbf{X}^\top \mathbf{X}$}\\ \vec{\alpha}&= \mathbf{K}^{-1} \mathbf{y} \\ \end{aligned}\]

Testing

Remember that we defined $\mathbf{w}=\mathbf{X}\vec{\alpha}.$ The prediction of a test point $\mathbf{z}$ then becomes $$h(\mathbf{z})=\mathbf{z}^\top \mathbf{w} =\mathbf{z}^\top\underbrace{\mathbf{X}\vec{\alpha}}_{\mathbf{w}} =\mathbf{z}\mathbf{X}\underbrace{(\mathbf{K}+\sigma^2\mathbf{I})^{-1}\mathbf{y}}_{\vec{\alpha}} =\underbrace{\mathbf{K}_*}_{\mathbf{z}\mathbf{x}}(\mathbf{K}+\sigma^2\mathbf{I})^{-1}\mathbf{y},$$ where $\mathbf{K}_*$ is the kernel of the test point with the training points, i.e. $[\mathbf{K}_*]_{i}=\phi(\mathbf{z})^\top\phi(\mathbf{x}_i)$, the inner-product between the test point $\mathbf{z}$ with the training point $\mathbf{x}_i$ after the mapping into feature space through $\phi$.

Kernel SVM

Kernel SVM - the smart nearest neighbor

Do you remember the k-nearest neighbor algorithm? For binary classification problems ($y_i\in\{+1,-1\}$), we can write the decision function for a test point $\mathbf{z}$ as $$ h(\mathbf{z})=\textrm{sign}\left(\sum_{i=1}^n y_i \delta^{nn}(\mathbf{x}_i,\mathbf{z})\right), $$ where $\delta^{nn}(\mathbf{z},\mathbf{x}_i)\in\{0,1\}$ with $\delta^{nn}(\mathbf{z},\mathbf{x}_i)=1$ only if $\mathbf{x}_i$ is one of the $k$ nearest neighbors of test point $\mathbf{z}$. The SVM decision function $$h(\mathbf{z})=\textrm{sign}\left(\sum_{i=1}^n y_i\alpha_i k(\mathbf{x}_i,\mathbf{z})+b\right)$$ is very similar, but instead of limiting the decision to the $k$ nearest neighbors, it considers all training points but the kernel function assigns more weight to those that are closer (large $k(\mathbf{z},\mathbf{x}_i))$. In some sense you can view the RBF kernel as a soft nearest neighbor assignment, as the exponential decay with distance with assign almost no weight to all but the neighboring points of $\mathbf{z}$. The Kernel SVM algorithm also learns a weight $\alpha_i>0$ for each training point and a bias $b$ and it essentially "removes" useless training points by setting many $\alpha_i=0$.