Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js
Lecture 3: The Perceptron
Assumptions
- Binary classification (i.e. yi∈{−1,+1})
- Data is linearly separable
Classifier
h(xi)=sign(→w⊤→xi+b)
b is the bias term (without the bias term, the hyperplane that →w defines would always have to go through the origin).
Dealing with b can be a pain, so we 'absorb' it into the feature vector →w by adding one additional constant dimension.
Under this convention,
→xibecomes[→xi1]→wbecomes[→wb]
We can verify that
[→xi1]⋅[→wb]=→w⊤→xi+b
Using this, we can simplify the above formulation of h(xi) to
h(xi)=sign(→w⊤→x)
Observation: Note that
yi(→w⊤→xi)>0⟺xiis classified correctly
where 'classified correctly' means that xi is on the correct side of the hyperplane defined by →w.
Also, note that the left side depends on yi∈{−1,+1} (it wouldn't work if, for example yi∈{0,+1}).
Perceptron Algorithm
Now that we know what the →w is supposed to do (defining a hyperplane the separates the data), let's look at how we can get such →w.
Perceptron Algorithm
Geometric Intuition
Below we can see that the point →xi is initially misclassified (left) by the hyperplane defined by →wt at time t.
The middle portion shows how the perceptron update changes →wt to →wt+1.
Right shows that the point is no longer misclassified after the update.
Perceptron Convergence
Suppose that ∃→w∗ such that yi(→w∗⋅→x)>0 ∀(→xi,yi)∈D.
Now, suppose that we rescale each data point and the →w∗ such that
||→w∗||=1and||→xi||≤1∀→xi∈D
The Margin of a hyperplane, γ, is defined as
γ=min(→xi,yi)∈D{→w∗T⋅→xi}
We can visualize this as follows
- All inputs →xi live within the unit sphere
- γ is the distance from the hyperplane (blue) to the closest data point
- →w lies on the unit sphere
Theorem: If all of the above holds, then the perceptron algorithm makes at most 1/γ2 mistakes.
Keeping what we defined above, consider the effect of an update on →w on →w⊤→w and →w⊤→w∗
-
Consider →w⊤→w∗:
(→w+y→x)⊤→w∗=→w⊤→w∗+y(→x⋅→w∗)≥→w⊤→w∗+γ
The inequality follows from the fact that, for →w∗, the distance from the hyperplane defined by →w∗ to →x must be at least γ.
This means that for each update, →w⊤→w∗ grows by at least γ.
-
Now consider →w⊤→w:
(→w+y→x)⊤(→w+y→x)=→w⊤→w+2y(→w⊤⋅→x)+y2(→x⋅→x)≤→w⊤→w+1
The inequality follows from the fact that
- 2y(→w⊤→x)<0 as we had to make an update, meaning →x was misclassified
- y2(→x⊤→x)≤1 as y2=1 and all →x lives in the unit sphere
This means that for each update, →w⊤→w grows by at most 1.
-
Now we can put together the above findings. Suppose we had M updates.
Mγ≤→w⊤⋅→w∗By (1)=|→w⊤⋅→w∗|By (1) again (the dot-product must be non-negative because the initialization is 0 and each update increases it by at least γ)≤||→w||||→w∗||By Cauchy-Schwartz inequality=||→w||As ||→w∗||=1=√→w⊤→wby definition of ‖
History
- Initially, huge wave of excitement ("Digital brains")
- Then, contributed to the A.I. Winter. Famous counter-example XOR problem (Minsky 1969):
- If data is not linearly separable, it loops forver.