Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js

Lecture 3: The Perceptron

Assumptions

  1. Binary classification (i.e. yi{1,+1})
  2. Data is linearly separable

Classifier

h(xi)=sign(wxi+b)
b is the bias term (without the bias term, the hyperplane that w defines would always have to go through the origin). Dealing with b can be a pain, so we 'absorb' it into the feature vector w by adding one additional constant dimension. Under this convention, xibecomes[xi1]wbecomes[wb] We can verify that [xi1][wb]=wxi+b Using this, we can simplify the above formulation of h(xi) to h(xi)=sign(wx) Observation: Note that yi(wxi)>0xiis classified correctly where 'classified correctly' means that xi is on the correct side of the hyperplane defined by w. Also, note that the left side depends on yi{1,+1} (it wouldn't work if, for example yi{0,+1}).

Perceptron Algorithm

Now that we know what the w is supposed to do (defining a hyperplane the separates the data), let's look at how we can get such w.

Perceptron Algorithm


Geometric Intuition

Below we can see that the point xi is initially misclassified (left) by the hyperplane defined by wt at time t. The middle portion shows how the perceptron update changes wt to wt+1. Right shows that the point is no longer misclassified after the update.

Perceptron Convergence

Suppose that w such that yi(wx)>0 (xi,yi)D.

Now, suppose that we rescale each data point and the w such that ||w||=1and||xi||1xiD The Margin of a hyperplane, γ, is defined as γ=min(xi,yi)D{wTxi} We can visualize this as follows


Theorem: If all of the above holds, then the perceptron algorithm makes at most 1/γ2 mistakes.

Keeping what we defined above, consider the effect of an update on w on ww and ww
  1. Consider ww: (w+yx)w=ww+y(xw)ww+γ The inequality follows from the fact that, for w, the distance from the hyperplane defined by w to x must be at least γ.

    This means that for each update, ww grows by at least γ.
  2. Now consider ww: (w+yx)(w+yx)=ww+2y(wx)+y2(xx)ww+1 The inequality follows from the fact that
  3. This means that for each update, ww grows by at most 1.
  4. Now we can put together the above findings. Suppose we had M updates. MγwwBy (1)=|ww|By (1) again (the dot-product must be non-negative because the initialization is 0 and each update increases it by at least γ)||w||||w||By Cauchy-Schwartz inequality=||w||As ||w||=1=wwby definition of 

History