Linear Regression

Cornell CS 4/5780

Fall 2021

previous
next

Assumptions

Data Assumption: yiR
Model Assumption: yi=wxi+ϵi where ϵiN(0,σ2)
yi|xiN(wxi,σ2)P(yi|xi,w)=12πσ2e(xiwyi)22σ2

In words, we assume that the data is drawn from a "line" wx through the origin (one can always add a bias / offset through an additional dimension, similar to the Perceptron). For each data point with features xi, the label y is drawn from a Gaussian with mean wxi and variance σ2. Our task is to estimate the slope w from the data.

Estimating with MLE

w=argmaxwP(y1,x1,...,yn,xn|w)=argmaxwi=1nP(yi,xi|w)Because data points are independently sampled.=argmaxwi=1nP(yi|xi,w)P(xi|w)Chain rule of probability.=argmaxwi=1nP(yi|xi,w)P(xi)xi is independent of w, we only model P(yi|x)=argmaxwi=1nP(yi|xi,w)P(xi) is a constant - can be dropped=argmaxwi=1nlog[P(yi|xi,w)]log is a monotonic function=argmaxwi=1n[log(12πσ2)+log(e(xiwyi)22σ2)]Plugging in probability distribution=argmaxw12σ2i=1n(xiwyi)2First term is a constant, and log(ez)=z=argminw1ni=1n(xiwyi)2Always minimize; 1n makes the loss interpretable (average squared error).

We are minimizing a loss function, l(w)=1ni=1n(xiwyi)2. This particular loss function is also known as the squared loss or Ordinary Least Squares (OLS). OLS can be optimized with gradient descent, Newton's method, or in closed form.

Closed Form: w=(XX)1Xy where X=[x1,,xn] and y=[y1,,yn].

Estimating with MAP

Additional Model Assumption: P(w)=12πτ2eww2τ2
w=argmaxwP(w|y1,x1,...,yn,xn)=argmaxwP(y1,x1,...,yn,xn|w)P(w)P(y1,x1,...,yn,xn)=argmaxwP(y1,x1,...,yn,xn|w)P(w)=argmaxw[i=1nP(yi,xi|w)]P(w)=argmaxw[i=1nP(yi|xi,w)P(xi|w)]P(w)=argmaxw[i=1nP(yi|xi,w)P(xi)]P(w)=argmaxw[i=1nP(yi|xi,w)]P(w)=argmaxwi=1nlogP(yi|xi,w)+logP(w)=argminw12σ2i=1n(xiwyi)2+12τ2wwλ=σ2nτ2=argminw1ni=1n(xiwyi)2+λ||w||22

This objective is known as Ridge Regression. It has a closed form solution of: w=(XX+λI)1Xy, where X=[x1,,xn] and y=[y1,,yn].

Summary

Ordinary Least Squares:
Ridge Regression: