b is the bias term (without the bias term, the hyperplane that w defines would always have to go through the origin).
Dealing with b can be a pain, so we 'absorb' it into the feature vector w by adding one additional constant dimension.
Under this convention,
xibecomes[xi1]wbecomes[wb]
We can verify that
[xi1]⊤[wb]=w⊤xi+b
Using this, we can simplify the above formulation of h(xi) to
h(xi)=sign(w⊤x)
(Left:) The original data is 1-dimensional (top row) or 2-dimensional (bottom row). There is no hyper-plane that passes through the origin and separates the red and blue points. (Right:) After a constant dimension was added to all data points such a hyperplane exists.
Observation: Note that
yi(w⊤xi)>0⟺xiis classified correctly
where 'classified correctly' means that xi is on the correct side of the hyperplane defined by w.
Also, note that the left side depends on yi∈{−1,+1} (it wouldn't work if, for example yi∈{0,+1}).
Perceptron Algorithm
Now that we know what the w is supposed to do (defining a hyperplane the separates the data), let's look at how we can get such w.
Perceptron Algorithm
Geometric Intuition
Illustration of a Perceptron update. (Left:) The hyperplane defined by wt misclassifies one red (-1) and one blue (+1) point. (Middle:) The red point x is chosen and used for an update. Because its label is -1 we need to subtractx from wt. (Right:) The udpated hyperplane wt+1=wt−x separates the two classes and the Perceptron algorithm has converged.
Quiz: Assume a data set consists only of a single data point {(x,+1)}. How often can a Perceptron misclassify this point x repeatedly? What if the initial weight vector w was initialized randomly and not as the all-zero vector?
Perceptron Convergence
The Perceptron was arguably the first algorithm with a strong formal guarantee. If a data set is linearly separable, the Perceptron will find a separating hyperplane in a finite number of updates. (If the data is not linearly separable, it will loop forever.)
The argument goes as follows:
Suppose ∃w∗ such that yi(x⊤w∗)>0∀(xi,yi)∈D.
Now, suppose that we rescale each data point and the w∗ such that
||w∗||=1and||xi||≤1∀xi∈D
Let us define the Margin γ of the hyperplanew∗ as
γ=min(xi,yi)∈D|x⊤iw∗|.
To summarize our setup:
All inputs xi live within the unit sphere
There exists a separating hyperplane defined by w∗, with ‖w‖∗=1 (i.e. w∗ lies exactly on the unit sphere).
γ is the distance from this hyperplane (blue) to the closest data point.
Theorem: If all of the above holds, then the Perceptron algorithm makes at most 1/γ2 mistakes.
Proof:
Keeping what we defined above, consider the effect of an update (w becomes w+yx) on the two terms w⊤w∗ and w⊤w.
We will use two facts:
y(x⊤w)≤0: This holds because x is misclassified by w - otherwise we wouldn't make the update.
y(x⊤w∗)>0: This holds because w∗ is a separating hyper-plane and classifies all points correctly.
Consider the effect of an update on w⊤w∗:
(w+yx)⊤w∗=w⊤w∗+y(x⊤w∗)≥w⊤w∗+γ
The inequality follows from the fact that, for w∗, the distance from the hyperplane defined by w∗ to x must be at least γ (i.e. y(x⊤w∗)=|x⊤w∗|≥γ).
This means that for each update,w⊤w∗grows by at leastγ.
Consider the effect of an update on w⊤w:
(w+yx)⊤(w+yx)=w⊤w+2y(w⊤x)⏟<0+y2(x⊤x)⏟0≤≤1≤w⊤w+1
The inequality follows from the fact that
2y(w⊤x)<0 as we had to make an update, meaning x was misclassified
0≤y2(x⊤x)≤1 as y2=1 and all x⊤x≤1 (because ‖x‖≤1).
This means that for each update,w⊤wgrows by at most 1.
Now we know that after M updates the following two inequalities must hold:
(1) w⊤w∗≥Mγ
(2) w⊤w≤M.
We can then complete the proof:
Mγ≤w⊤w∗By (1)=‖w‖cos(θ)by definition of inner-product, where θ is the angle between w and w∗.≤||w||by definition of cos, we must have cos(θ)≤1.=√w⊤wby definition of ‖w‖≤√MBy (2)⇒Mγ≤√M⇒M2γ2≤M⇒M≤1γ2And hence, the number of updates M is bounded from above by a constant.
Quiz: Given the theorem above, what can you say about the margin of a classifier (what is more desirable, a large margin or a small margin?) Can you characterize data sets for which the Perceptron algorithm will converge quickly? Draw an example.