Loss \(\ell(h_{\mathbf{w}}(\mathbf{x}_i,y_i))\) | Comments |
Squared Loss \(\left.(h(\mathbf{x}_{i})-y_{i})^{2}\right.\)
|
- Most popular regression loss function
- Estimates Mean Label
- ADVANTAGE: Differentiable everywhere
- DISADVANTAGE: Somewhat sensitive to outliers/noise
- Also known as Ordinary Least Squares (OLS)
|
Absolute Loss \(\left.|h(\mathbf{x}_{i})-y_{i}|\right.\)
|
- Also a very popular loss function
- Estimates Median Label
- ADVANTAGE: Less sensitive to noise
- DISADVANTAGE: Not differentiable at \(0\)
|
Huber Loss
- \(\left.\frac{1}{2}\left(h(\mathbf{x}_{i})-y_{i}\right)^{2}\right.\) if \(|h(\mathbf{x}_{i})-y_{i}|<\delta\),
- otherwise \(\left.\delta(|h(\mathbf{x}_{i})-y_{i}|-\frac{\delta}{2})\right.\)
|
- Also known as Smooth Absolute Loss
- ADVANTAGE: "Best of Both Worlds" of Squared and Absolute Loss
- Once-differentiable
- Takes on behavior of Squared-Loss when loss is small, and Absolute Loss when loss is large.
|
Log-Cosh Loss \(\left.log(cosh(h(\mathbf{x}_{i})-y_{i}))\right.\), \(\left.cosh(x)=\frac{e^{x}+e^{-x}}{2}\right.\)
|
ADVANTAGE: Similar to Huber Loss, but twice differentiable everywhere
|
Regularizer \(r(\mathbf{w})\) | Properties |
\(l_{2}\)-Regularization
\(\left.r(\mathbf{w}) = \mathbf{w}^{\top}\mathbf{w} = \|{\mathbf{w}}\|_{2}^{2}\right.\)
|
- ADVANTAGE: Strictly Convex
- ADVANTAGE: Differentiable
- DISADVANTAGE: Uses weights on all features, i.e. relies on all features to some degree (ideally we would like to avoid this) - these are known as Dense Solutions.
|
\(l_{1}\)-Regularization \(\left.r(\mathbf{w}) = \|\mathbf{w}\|_{1}\right.\)
|
- Convex (but not strictly)
- DISADVANTAGE: Not differentiable at \(0\) (the point which minimization is intended to bring us to
- Effect: Sparse (i.e. not Dense) Solutions
|
\(l_p\)-Norm
\(\left.\|{\mathbf{w}}\|_{p} = (\sum\limits_{i=1}^d v_{i}^{p})^{1/p}\right.\)
|
- (often \(\left.0<p\leq1\right.\))
- DISADVANTAGE: Non-convex
- ADVANTAGE: Very sparse solutions
- Initialization dependent
- DISADVANTAGE: Not differentiable
|
Loss and Regularizer |
Comments |
Ordinary Least Squares
\(\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}-y_{i})^{2}\)
|
- Squared Loss
- No Regularization
- Closed form solution:
- \(\left.\mathbf{w}=(\mathbf{X}\mathbf{X}^\top)^{-1}\mathbf{X}\mathbf{y}^{\top}\right.\)
- \(\left.\mathbf{X}=[\mathbf{x}_{1}, ..., \mathbf{x}_{n}]\right.\)
- \(\left.\mathbf{y}=[y_{1},...,y_{n}]\right.\)
|
Ridge Regression
\(\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}-y_{i})^{2}+\lambda\|{w}\|_{2}^{2}\)
|
- Squared Loss
- \(l_{2}\)-Regularization
- \(\left.\mathbf{w}=(\mathbf{X}\mathbf{X}^{\top}+\lambda\mathbb{I})^{-1}\mathbf{X}\mathbf{y}^{\top}\right.\)
|
Lasso
\(\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}\mathbf{x}_{i}-{y}_{i})^{2}+\lambda\|\mathbf{w}\|_{1}\)
|
- + sparsity inducing (good for feature selection)
- + Convex
- - Not strictly convex (no unique solution)
- - Not differentiable (at 0)
- Solve with (sub)-gradient descent or
SVEN
|
Elastic Net \(\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}\mathbf{x}_{i}-{y}_{i})^{2}+\left.\alpha\|\mathbf{w}\|_{1}+(1-\alpha)\|{\mathbf{w}}\|_{2}^{2}\right.\)
\(\left.\alpha\in[0, 1)\right.\)
|
- ADVANTAGE: Strictly convex (i.e. unique solution)
- + sparsity inducing (good for feature selection)
- + Dual of squared-loss SVM, see SVEN
- DISADVANTAGE: - Non-differentiable
|
Logistic Regression
\(\min_{\mathbf{w},b} \frac{1}{n}\sum\limits_{i=1}^n \log{(1+e^{-y_i(\mathbf{w}^{\top}\mathbf{x}_{i}+b)})}\)
|
- Often \(l_{1}\) or \(l_{2}\) Regularized
- Solve with gradient descent.
- \(\left.\Pr{(y|x)}=\frac{1}{1+e^{-y(\mathbf{w}^{\top}x+b)}}\right.\)
|
Linear Support Vector Machine
\(\min_{\mathbf{w},b} C\sum\limits_{i=1}^n \max[1-y_{i}(\mathbf{w}^\top{\mathbf{x}_i+b}), 0]+\|\mathbf{w}\|_2^2\)
|
- Typically \(l_2\) regularized (sometimes \(l_1\)).
- Quadratic program.
- When kernelized leads to sparse solutions.
- Kernelized version can be solved very efficiently with specialized algorithms (e.g. SMO)
|
[1] In Bayesian Machine Learning, it is common to optimize \(\lambda\),
but for the purposes of this class, it is assumed to be fixed.