Processing math: 100%

Lecture 3: The Perceptron

previous
next
back

Assumptions

  1. Binary classification (i.e. yi{1,+1})
  2. Data is linearly separable

Classifier

h(xi)=sign(wxi+b)
b is the bias term (without the bias term, the hyperplane that w defines would always have to go through the origin). Dealing with b can be a pain, so we 'absorb' it into the feature vector w by adding one additional constant dimension. Under this convention, xibecomes[xi1]wbecomes[wb] We can verify that [xi1][wb]=wxi+b Using this, we can simplify the above formulation of h(xi) to h(xi)=sign(wx) Observation: Note that yi(wxi)>0xiis classified correctly where 'classified correctly' means that xi is on the correct side of the hyperplane defined by w. Also, note that the left side depends on yi{1,+1} (it wouldn't work if, for example yi{0,+1}).

Perceptron Algorithm

Now that we know what the w is supposed to do (defining a hyperplane the separates the data), let's look at how we can get such w.

Perceptron Algorithm


Geometric Intuition


Quiz#1: Can you draw a visualization of a Perceptron update?
Quiz#2: How often can a Perceptron misclassify a point x repeatedly?
































Perceptron Convergence

Suppose that w such that yi(wx)>0 (xi,yi)D.

Now, suppose that we rescale each data point and the w such that ||w||=1and||xi||1xiD The Margin of a hyperplane, γ, is defined as γ=min(xi,yi)D|wxi| We can visualize this as follows


Theorem: If all of the above holds, then the perceptron algorithm makes at most 1/γ2 mistakes.

Proof:
Keeping what we defined above, consider the effect of an update (w becomes w+yx) on the two terms ww and ww. We will use two facts: