Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js
Linear Regression
In this lecture we will learn about Linear Regression.
Assumptions
Data Assumption: yi∈R
Model Assumption: yi=w⊤xi+ϵi where ϵi∼N(0,σ2)
⇒yi|xi∼N(w⊤xi,σ2)⇒P(yi|xi,w)=1√2πσ2e−(x⊤iw−yi)22σ2
Estimating with MLE
w=argmax
The loss is thus l(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2 AKA square loss or Ordinary Least Squares (OLS). OLS can be optimized with gradient descent, Newton's method, or in closed form.
Closed Form: \mathbf{w} = (\mathbf{X X^\top})^{-1}\mathbf{X}\mathbf{y}^\top
Estimating with MAP
Additional Model Assumption: P(\mathbf{w}) = \frac{1}{\sqrt{2\pi\tau^2}}e^{-\frac{\mathbf{w}^\top\mathbf{w}}{2\tau^2}}
\begin{align}
\mathbf{w} &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} P(\mathbf{w}|y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n)\\
&= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \frac{P(y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n|\mathbf{w})P(\mathbf{w})}{P(y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n)}\\
&= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} P(y_1,...,y_n|\mathbf{x}_1,...,\mathbf{x}_n,\mathbf{w})P(\mathbf{x}_1,...,\mathbf{x}_n|\mathbf{w})P(\mathbf{w})\\
&= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w})P(\mathbf{w})\\
&= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \sum_{i=1}^n \left[ logP(y_i|\mathbf{x}_i,\mathbf{w})+ logP(\mathbf{w})\right]\\
&= \operatorname*{argmin}_{\mathbf{\mathbf{w}}} \frac{1}{2\sigma^2} \sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2 + \frac{1}{2\tau^2}\mathbf{w}^\top\mathbf{w}\\
&= \operatorname*{argmin}_{\mathbf{\mathbf{w}}} \frac{1}{n} \sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2 + \lambda|| \mathbf{w}||_2^2 \tag*{$\lambda=\frac{\sigma^2}{n\tau^2}$}\\
\end{align}
This formulation is known as Ridge Regression. It has a closed form solution of: \mathbf{w} = (\mathbf{X X^{\top}}+\lambda^2 \mathbf{I})^{-1}\mathbf{X}\mathbf{y}^\top
Summary
Ordinary Least Squares:
- \operatorname*{min}_{\mathbf{\mathbf{w}}} \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2.
- Squared loss.
- No regularization.
- Closed form: \mathbf{w} = (\mathbf{X X^\top})^{-1}\mathbf{X} \mathbf{y}^\top.
Ridge Regression:
- \operatorname*{min}_{\mathbf{\mathbf{w}}} \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2 + \lambda ||\mathbf{w}||_2^2.
- Squared loss.
- l2\text{-regularization}.
- Closed form: \mathbf{w} = (\mathbf{X X^{\top}}+\lambda \mathbf{I})^{-1}\mathbf{X} \mathbf{y}^\top.