Lecture 3: The Perceptron

previous
next


Video II

Assumptions

  1. Binary classification (i.e. yi{1,+1})
  2. Data is linearly separable

Classifier

h(xi)=sign(wxi+b)
b is the bias term (without the bias term, the hyperplane that w defines would always have to go through the origin). Dealing with b can be a pain, so we 'absorb' it into the feature vector w by adding one additional constant dimension. Under this convention, xibecomes[xi1]wbecomes[wb]
We can verify that [xi1][wb]=wxi+b
Using this, we can simplify the above formulation of h(xi) to h(xi)=sign(wx)
(Left:) The original data is 1-dimensional (top row) or 2-dimensional (bottom row). There is no hyper-plane that passes through the origin and separates the red and blue points. (Right:) After a constant dimension was added to all data points such a hyperplane exists.
Observation: Note that yi(wxi)>0xiis classified correctly
where 'classified correctly' means that xi is on the correct side of the hyperplane defined by w. Also, note that the left side depends on yi{1,+1} (it wouldn't work if, for example yi{0,+1}).

Perceptron Algorithm

Now that we know what the w is supposed to do (defining a hyperplane the separates the data), let's look at how we can get such w.

Perceptron Algorithm


Geometric Intuition

Illustration of a Perceptron update. (Left:) The hyperplane defined by wt misclassifies one red (-1) and one blue (+1) point. (Middle:) The red point x is chosen and used for an update. Because its label is -1 we need to subtract x from wt. (Right:) The udpated hyperplane wt+1=wtx separates the two classes and the Perceptron algorithm has converged.

Quiz: Assume a data set consists only of a single data point {(x,+1)}. How often can a Perceptron misclassify this point x repeatedly? What if the initial weight vector w was initialized randomly and not as the all-zero vector?

Perceptron Convergence

The Perceptron was arguably the first algorithm with a strong formal guarantee. If a data set is linearly separable, the Perceptron will find a separating hyperplane in a finite number of updates. (If the data is not linearly separable, it will loop forever.)

The argument goes as follows: Suppose w such that yi(xw)>0 (xi,yi)D.

Now, suppose that we rescale each data point and the w such that ||w||=1and||xi||1xiD

Let us define the Margin γ of the hyperplane w as γ=min(xi,yi)D|xiw|.

To summarize our setup:

Theorem: If all of the above holds, then the Perceptron algorithm makes at most 1/γ2 mistakes.

Proof:
Keeping what we defined above, consider the effect of an update (w becomes w+yx) on the two terms ww and ww. We will use two facts: