The Support Vector Machine (SVM) is a linear classifier that can be viewed as an extension of the Perceptron developed by Rosenblatt in 1958. The Perceptron guaranteed that you find a hyperplane if it exists. The SVM finds the maximum margin separating hyperplane.
Setting: We define a linear classifier: \(h(\mathbf{x})=\textrm{sign}(\mathbf{w}^{T}\mathbf{x}+b)\) and we assume a binary classification setting with labels \(\{+1,-1\}\).
Typically, if a data set is linearly separable, there are infinitely many separating hyperplanes. A natural question to ask is: Question: What is the best separating hyperplane? SVM Answer: The one that maximizes the distance to the closest data points from both classes. We say it is the hyperplane with maximum margin.Figure 1: (Left:) Two different separating hyperplanes for the same data set. (Right:) The maximum margin hyperplane. The margin, \(\gamma\), is the distance from the hyperplane (solid line) to the closest points in either class (which touch the parallel dotted lines).
We already saw the definition of a margin in the context of the Perceptron. A hyperplane is defined through \(\mathbf{w},b\) as a set of points such that \(\mathcal{H}=\left\{\mathbf{x}\vert{}\mathbf{w}^\top\mathbf{x}+b=0\right\}\). Let the margin \(\gamma\) be defined as the distance from the hyperplane to the closest point across both classes.
What is the distance of a point \(\mathbf{x}\) to the hyperplane \(\mathcal{H}\)?
Consider some point \(\mathbf{x}\). Let \(\mathbf{d}\) be the vector from \(\mathcal{H}\) to \(\mathbf{x}\) of minimum length. Let \(\mathbf{x}^P\) be the projection of \(\mathbf{x}\) onto \(\mathcal{H}\). It follows then that:
$$(1)\ \mathbf{x}^P=\mathbf{x}-\mathbf{d}.$$
\(\mathbf{d}\) is parallel to \(\mathbf{w}\), there exists some \(\alpha\in\mathbb{R}\) such that
$$(2)\ \mathbf{d}=\alpha\mathbf{w}.$$
\(\mathbf{x}^P\) lies on the hyperplane (i.e. \(\mathbf{x}^P\in\mathcal{H}\) ) which implies
$$(3)\ \mathbf{w}^\top\mathbf{x}^P+b=0.$$
We can now take (3) and substitute in (1) and (2).
$$\mathbf{w}^\top\mathbf{x}^P+b=\mathbf{w}^\top(\mathbf{x}-\mathbf{d})+b=\mathbf{w}^\top(\mathbf{x}-\alpha\mathbf{w})+b=0$$
which implies $$(4)\ \alpha=\frac{\mathbf{w}^\top\mathbf{x}+b}{\mathbf{w}^\top\mathbf{w}}.$$
This allows us to compute the length of \(\mathbf{d}\) by plugging in (4) into (2):
$$\| \mathbf{d} \|_2=\sqrt{\mathbf{d}^\top\mathbf{d}}=\sqrt{\alpha^2\mathbf{w}^\top\mathbf{w}}=\frac{| \mathbf{w}^\top\mathbf{x}+b |}{\sqrt{\mathbf{w}^\top\mathbf{w}}}=\frac{| \mathbf{w}^\top\mathbf{x}+b |}{\| \mathbf{w} \|_{2}}$$
Margin of \(\mathcal{H}\) with respect to \(D\): $$\gamma(\mathbf{w},b)=\min_{\mathbf{x}\in D}\frac{| \mathbf{w}^\top\mathbf{x}+b |}{\| \mathbf{w} \|_{2}}$$
Useful observation: By definition, the margin and hyperplane are scale invariant. $$\gamma(\beta\mathbf{w},\beta b)=\gamma(\mathbf{w},b), \forall \beta \neq 0$$
Note that if the hyperplane is such that \(\gamma\) is maximized, it must lie right in the middle of the two classes. In other words, \(\gamma\) must be the distance to the closest point within both classes. (If not, you could move the hyperplane towards data points of the class that is further away and increase \(\gamma\), which contradicts that \(\gamma\) is maximized.)
These constraints are still hard to deal with, however luckily we can show that (for the optimal solution) they are equivalent to a much simpler formulation. (Makes sure you know how to prove that the two sets of constraints are equivalent.) $$ \begin{align} &\min_{\mathbf{w},b}\mathbf{w}^\top\mathbf{w}&\\ &\textrm{s.t.} \ \ \ \forall i \ y_{i}(\mathbf{w}^\top \mathbf{x}_{i}+b) \geq 1 & \end{align} $$
This new formulation is a quadratic optimization problem. The objective is quadratic and the constraints are all linear. We can be solve it efficiently with any QCQP (Quadratically Constrained Quadratic Program) solver. It has a unique solution whenever a separating hyper plane exists. It also has a nice interpretation: Find the simplest hyperplane (where simpler means smaller \(\mathbf{w}^\top\mathbf{w}\)) such that all inputs lie at least 1 unit away from the hyperplane on the correct side.
For the optimal \(\mathbf{w},b\) pair, some training points will have tight constraints, i.e. $$y_{i}(\mathbf{w}^\top \mathbf{x}_{i}+b) = 1.$$ (This must be the case, because if for all training points we had a strict \(>\) inequality, it would be possible to scale down both parameters \(\mathbf{w},b\) until the constraints are tight and obtained an even lower objective value.) We refer to these training points as support vectors. Support vectors are special because they are the training points that define the maximum margin of the hyperplane to the data set and they therefore determine the shape of the hyperplane. If you were to move one of them and retrain the SVM, the resulting hyperplane would change. The opposite is the case for non-support vectors (provided you don't move them too much, or they would turn into support vectors themselves). This will become particularly important in the dual formulation for Kernel-SVMs.
If the data is low dimensional it is often the case that there is no separating hyperplane between the two classes. In this case, there is no solution to the optimization problems stated above. We can fix this by allowing the constraints to be violated ever so slight with the introduction of slack variables: $$\begin{matrix} \min_{\mathbf{w},b}\mathbf{w}^\top\mathbf{w}+C\sum_{i=1}^{n}\xi _{i} \\ s.t. \ \forall i \ y_{i}(\mathbf{w}^\top\mathbf{x}_{i}+b)\geq 1-\xi_i \\ \forall i \ \xi_i \geq 0 \end{matrix}$$ The slack variable \(\xi_i\) allows the input \(\mathbf{x}_i\) to be closer to the hyperplane (or even be on the wrong side), but there is a penalty in the objective function for such "slack". If C is very large, the SVM becomes very strict and tries to get all points to be on the right side of the hyperplane. If C is very small, the SVM becomes very loose and may "sacrifice" some points to obtain a simpler (i.e. lower \(\|\mathbf{w}\|_2^2\)) solution.
Let us consider the value of \(\xi_i\) for the case of \(C\neq 0\). Because the objective will always try to minimize \(\xi_i\) as much as possible, the equation must hold as an equality and we have: $$ \xi_i=\left\{ \begin{array}{cc} \ 1-y_{i}(\mathbf{w}^\top \mathbf{x}_{i}+b) & \textrm{ if \(y_{i}(\mathbf{w}^\top \mathbf{x}_{i}+b)<1\)}\\ 0 & \textrm{ if \(y_{i}(\mathbf{w}^\top \mathbf{x}_{i}+b)\geq 1\)} \end{array} \right. $$ This is equivalent to the following closed form: $$ \xi_i=\max(1-y_{i}(\mathbf{w}^\top \mathbf{x}_{i}+b) ,0). $$ If we plug this closed form into the objective of our SVM optimization problem, we obtain the following unconstrained version as loss function and regularizer: $$ \min_{\mathbf{w},b}\underbrace{\mathbf{w}^\top\mathbf{w}}_{l_{2}-regularizer}+C\ \sum_{i=1}^{n}\underbrace{\max[ 1-y_{i}(\mathbf{w}^\top \mathbf{x}+b),0 ]}_{hinge-loss} \label{eq:svmunconst} $$ This formulation allows us to optimize the SVM paramters (\(\mathbf{w},b\)) just like logistic regression (e.g. through gradient descent). The only difference is that we have the hinge-loss instead of the logistic loss.
Figure 2: The five plots above show different boundary of hyperplane and the optimal hyperplane separating example data, when C=0.01, 0.1, 1, 10, 100.