Lecture 12: Linear Regression

In words, we assume that the data is drawn from a "line"

w^{T} x

through the origin (one can always add a bias / offset through an additional dimension, similar to the Perceptron). For each data point with features

x_{i}

, the label

y

is drawn from a Gaussian with mean

w^{T} x_{i}

and variance

σ^{2}

. Our task is to estimate the slope

w

from the data.

Estimating with MLE

\begin{aligned} {\hat{w}}_{MLE} & = \underset{w}{argmax} P (y_{1}, x_{1}, . . ., y_{n}, x_{n} | w) \\ = \underset{w}{argmax} \prod_{i = 1}^{n} P (y_{i}, x_{i} | w) & Because data points are independently \\ = \underset{w}{argmax} \prod_{i = 1}^{n} P (y_{i} | x_{i}, w) P (x_{i} | w) & Chain rule of probability \\ = \underset{w}{argmax} \prod_{i = 1}^{n} P (y_{i} | x_{i}, w) P (x_{i}) & x_{i} is independent of w, we only model P (y_{i} | x) \\ = \underset{w}{argmax} \prod_{i = 1}^{n} P (y_{i} | x_{i}, w) & P (x_{i}) is a constant - can be dropped \\ = \underset{w}{argmax} \sum_{i = 1}^{n} \log [P (y_{i} | x_{i}, w)] & log is a monotonic function \\ = \underset{w}{argmax} \sum_{i = 1}^{n} [\log (\frac{1}{\sqrt{2 π σ^{2}}}) + \log (e^{- \frac{(x_{i}^{T} w - y_{i})^{2}}{2 σ^{2}}})] & Plugging in probability distribution \\ = \underset{w}{argmax} - \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} (x_{i}^{T} w - y_{i})^{2} & First term is a constant, and \log (e^{z}) = z \\ = \underset{w}{argmin} \frac{1}{n} \sum_{i = 1}^{n} (x_{i}^{T} w - y_{i})^{2} & Scale and switch to minimize \end{aligned}

We are minimizing a loss function,

l (w) = \frac{1}{n} \sum_{i = 1}^{n} (x_{i}^{T} w - y_{i})^{2}

. This particular loss function is also known as the squared loss or Ordinary Least Squares (OLS). In this form, it has a natural interpretation as the average squared error of the prediction over the training set. OLS can be optimized with gradient descent, Newton's method, or in closed form.

Estimating with MAP

This objective is known as Ridge Regression. It has a closed form solution of:

\hat{w} = (X X^{T} + λ I)^{- 1} X y^{T},

where

X = [x_{1}, \dots, x_{n}]

and

y = [y_{1}, \dots, y_{n}]

. The solution must always exist and be unique (why?).

Linear Regression

Old Recorded Lectures

Assumptions

Estimating with MLE

Estimating with MAP

Summary