Linear Regression

Cornell CS 4/5780

Spring 2022

Old Recorded Lectures

Assumptions

Data Assumption: yiR
Model Assumption: yi=wTxi+ϵi where ϵiN(0,σ2)
yi|xiN(wTxi,σ2)P(yi|xi,w)=12πσ2e(xiTwyi)22σ2

In words, we assume that the data is drawn from a "line" wTx through the origin (one can always add a bias / offset through an additional dimension, similar to the Perceptron). For each data point with features xi, the label y is drawn from a Gaussian with mean wTxi and variance σ2. Our task is to estimate the slope w from the data.

Estimating with MLE

w^MLE=argmaxwP(y1,x1,...,yn,xn|w)=argmaxwi=1nP(yi,xi|w)Because data points are independently=argmaxwi=1nP(yi|xi,w)P(xi|w)Chain rule of probability=argmaxwi=1nP(yi|xi,w)P(xi)xi is independent of w, we only model P(yi|x)=argmaxwi=1nP(yi|xi,w)P(xi) is a constant - can be dropped=argmaxwi=1nlog[P(yi|xi,w)]log is a monotonic function=argmaxwi=1n[log(12πσ2)+log(e(xiTwyi)22σ2)]Plugging in probability distribution=argmaxw12σ2i=1n(xiTwyi)2First term is a constant, and log(ez)=z=argminw1ni=1n(xiTwyi)2Scale and switch to minimize

We are minimizing a loss function, l(w)=1ni=1n(xiTwyi)2. This particular loss function is also known as the squared loss or Ordinary Least Squares (OLS). In this form, it has a natural interpretation as the average squared error of the prediction over the training set. OLS can be optimized with gradient descent, Newton's method, or in closed form.

Closed Form Solution: if XXT is invertible, then w^=(XXT)1XyT where X=[x1,,xn]Rd×n and y=[y1,,yn]R1×n. Otherwise, there is not a unique solution, and any w that is a solution of the linear equation XXTw^=XyT minimizes the objective.

Estimating with MAP

To use MAP, we will need to make an additional modeling assumption of a prior for the weight w. P(w)=12πτ2ewTw2τ2. With this, our MAP estimator becomes w^MAP=argmaxwP(w|y1,x1,...,yn,xn)=argmaxwP(y1,x1,...,yn,xn|w)P(w)P(y1,x1,...,yn,xn)=argmaxwP(y1,x1,...,yn,xn|w)P(w)=argmaxw[i=1nP(yi,xi|w)]P(w)=argmaxw[i=1nP(yi|xi,w)P(xi|w)]P(w)=argmaxw[i=1nP(yi|xi,w)P(xi)]P(w)=argmaxw[i=1nP(yi|xi,w)]P(w)=argmaxwi=1nlogP(yi|xi,w)+logP(w)=argminw12σ2i=1n(xiTwyi)2+12τ2wTwλ=σ2τ2=argminw1n(i=1n(xiTwyi)2+λw22)

This objective is known as Ridge Regression. It has a closed form solution of: w^=(XXT+λI)1XyT, where X=[x1,,xn] and y=[y1,,yn]. The solution must always exist and be unique (why?).

Summary

Ordinary Least Squares:
Ridge Regression: