Processing math: 100%
Lecture 3: The Perceptron
Assumptions
- Binary classification (i.e. yi∈{−1,+1})
- Data is linearly separable
Classifier
h(xi)=sign(→w⋅→xi+b)
b is the bias term (without the bias term, the hyperplane that →w defines would always have to go through the origin).
Dealing with b can be a pain, so we 'absorb' it into the feature vector →w by adding one additional constant dimension.
Under this convention,
→xibecomes[→xi1]→wbecomes[→wb]
We can verify that
[→xi1]⋅[→wb]=→w⋅→xi+b
Using this, we can simplify the above formulation of h(xi) to
h(xi)=sign(→w⋅→x)
Observation: Note that
yi(→w⋅→xi)>0⟺xiis classified correctly
where 'classified correctly' means that xi is on the correct side of the hyperplane defined by →w.
Also, note that the left side depends on yi∈{−1,+1} (it wouldn't work if, for example yi∈{0,+1}).
Perceptron Algorithm
Now that we know what the →w is supposed to do (defining a hyperplane the separates the data), let's look at how we can get such →w.
Perceptron Algorithm
Geometric Intuition
Quiz#1: Can you draw a visualization of a Perceptron update?
Quiz#2: How often can a Perceptron misclassify a point →x repeatedly?
Perceptron Convergence
Suppose that ∃→w∗ such that yi(→w∗⋅→x)>0 ∀(→xi,yi)∈D.
Now, suppose that we rescale each data point and the →w∗ such that
||→w∗||=1and||→xi||≤1∀→xi∈D
The Margin of a hyperplane, γ, is defined as
γ=min(→xi,yi)∈D|→w∗⋅→xi|
We can visualize this as follows
- All inputs →xi live within the unit sphere
- γ is the distance from the hyperplane (blue) to the closest data point
- →w∗ lies on the unit sphere
Theorem: If all of the above holds, then the perceptron algorithm makes at most 1/γ2 mistakes.
Proof:
Keeping what we defined above, consider the effect of an update (→w becomes →w+y→x) on the two terms →w⋅→w∗ and →w⋅→w.
We will use two facts:
- y(→x⋅→w)≤0: This holds because →x is misclassified by →w - otherwise we wouldn't make the update.
- y(→x⋅→w∗)>0: This holds because →w∗ is a separating hyper-plane and classifies all points correctly.
-
Consider the effect of an update on →w⋅→w∗:
(→w+y→x)⋅→w∗=→w⋅→w∗+y(→x⋅→w∗)≥→w⋅→w∗+γ
The inequality follows from the fact that, for →w∗, the distance from the hyperplane defined by →w∗ to →x must be at least γ (i.e. y(→x⋅→w∗)=|→x⋅→w∗|≥γ).
This means that for each update, →w⋅→w∗ grows by at least γ.
-
Consider the effect of an update on →w⋅→w:
(→w+y→x)⋅(→w+y→x)=→w⋅→w+2y(→w⋅→x)+y2(→x⋅→x)≤→w⋅→w+1
The inequality follows from the fact that
- 2y(→w⋅→x)<0 as we had to make an update, meaning →x was misclassified
- y2(→x⋅→x)≤1 as y2=1 and all →x⋅→x≤1 (because ‖→x‖≤1).
This means that for each update, →w⋅→w grows by at most 1.
-
Now we can put together the above findings. Suppose we had M updates.
Mγ≤→w⋅→w∗By (1)=|→w⋅→w∗|By (1) again (the dot-product must be non-negative because the initialization is 0 and each update increases it by at least γ)≤||→w|| ||→w∗||By Cauchy-Schwartz inequality=||→w||As ||→w∗||=1=√→w⋅→wby definition of ‖→w‖≤√MBy (2)⇒Mγ≤√M⇒M2γ2≤M⇒M≤1γ2
History
- Initially, huge wave of excitement ("Digital brains")
- Then, contributed to the A.I. Winter. Famous counter-example XOR problem (Minsky 1969):
- If data is not linearly separable, it loops forver.