Loss $\ell(h_{\mathbf{w}}(\mathbf{x}_i,y_i))$ | Comments |
Squared Loss $\left.(h(\mathbf{x}_{i})-y_{i})^{2}\right.$
|
- Most popular regression loss function
- Estimates Mean Label
- ADVANTAGE: Differentiable everywhere
- DISADVANTAGE: Somewhat sensitive to outliers/noise
- Also known as Ordinary Least Squares (OLS)
|
Absolute Loss $\left.|h(\mathbf{x}_{i})-y_{i}|\right.$
|
- Also a very popular loss function
- Estimates Median Label
- ADVANTAGE: Less sensitive to noise
- DISADVANTAGE: Not differentiable at $0$
|
Huber Loss
- $\left.\frac{1}{2}\left(h(\mathbf{x}_{i})-y_{i}\right)^{2}\right.$ if $|h(\mathbf{x}_{i})-y_{i}|<\delta$,
- otherwise $\left.\delta(|h(\mathbf{x}_{i})-y_{i}|-\frac{\delta}{2})\right.$
|
- Also known as Smooth Absolute Loss
- ADVANTAGE: "Best of Both Worlds" of Squared and Absolute Loss
- Once-differentiable
- Takes on behavior of Squared-Loss when loss is small, and Absolute Loss when loss is large.
|
Log-Cosh Loss $\left.log(cosh(h(\mathbf{x}_{i})-y_{i}))\right.$, $\left.cosh(x)=\frac{e^{x}+e^{-x}}{2}\right.$
|
ADVANTAGE: Similar to Huber Loss, but twice differentiable everywhere
|
Regularizer $r(\mathbf{w})$ | Properties |
$l_{2}$-Regularization
$\left.r(\mathbf{w}) = \mathbf{w}^{\top}\mathbf{w} = \|{\mathbf{w}}\|_{2}^{2}\right.$
|
- ADVANTAGE: Strictly Convex
- ADVANTAGE: Differentiable
- DISADVANTAGE: Uses weights on all features, i.e. relies on all features to some degree (ideally we would like to avoid this) - these are known as Dense Solutions.
|
$l_{1}$-Regularization $\left.r(\mathbf{w}) = \|\mathbf{w}\|_{1}\right.$
|
- Convex (but not strictly)
- DISADVANTAGE: Not differentiable at $0$ (the point which minimization is intended to bring us to
- Effect: Sparse (i.e. not Dense) Solutions
|
$l_p$-Norm
$\left.\|{\mathbf{w}}\|_{p} = (\sum\limits_{i=1}^d v_{i}^{p})^{1/p}\right.$
|
- (often $\left.0<p\leq1\right.$)
- DISADVANTAGE: Non-convex
- ADVANTAGE: Very sparse solutions
- Initialization dependent
- DISADVANTAGE: Not differentiable
|
Loss and Regularizer |
Comments |
Ordinary Least Squares
$\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}-y_{i})^{2}$
|
- Squared Loss
- No Regularization
- Closed form solution:
- $\left.\mathbf{w}=(\mathbf{X}\mathbf{X}^\top)^{-1}\mathbf{X}\mathbf{y}^{\top}\right.$
- $\left.\mathbf{X}=[\mathbf{x}_{1}, ..., \mathbf{x}_{n}]\right.$
- $\left.\mathbf{y}=[y_{1},...,y_{n}]\right.$
|
Ridge Regression
$\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}-y_{i})^{2}+\lambda\|{w}\|_{2}^{2}$
|
- Squared Loss
- $l_{2}$-Regularization
- $\left.\mathbf{w}=(\mathbf{X}\mathbf{X}^{\top}+\lambda\mathbb{I})^{-1}\mathbf{X}\mathbf{y}^{\top}\right.$
|
Lasso
$\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}\mathbf{x}_{i}-{y}_{i})^{2}+\lambda\|\mathbf{w}\|_{1}$
|
- + sparsity inducing (good for feature selection)
- + Convex
- - Not strictly convex (no unique solution)
- - Not differentiable (at 0)
- Solve with (sub)-gradient descent or
SVEN
|
Elastic Net $\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}\mathbf{x}_{i}-{y}_{i})^{2}+\left.\alpha\|\mathbf{w}\|_{1}+(1-\alpha)\|{\mathbf{w}}\|_{2}^{2}\right.$
$\left.\alpha\in[0, 1)\right.$
|
- ADVANTAGE: Strictly convex (i.e. unique solution)
- + sparsity inducing (good for feature selection)
- + Dual of squared-loss SVM, see SVEN
- DISADVANTAGE: - Non-differentiable
|
Logistic Regression
$\min_{\mathbf{w},b} \frac{1}{n}\sum\limits_{i=1}^n \log{(1+e^{-y_i(\mathbf{w}^{\top}\mathbf{x}_{i}+b)})}$
|
- Often $l_{1}$ or $l_{2}$ Regularized
- Solve with gradient descent.
- $\left.\Pr{(y|x)}=\frac{1}{1+e^{-y(\mathbf{w}^{\top}x+b)}}\right.$
|
Linear Support Vector Machine
$\min_{\mathbf{w},b} C\sum\limits_{i=1}^n \max[1-y_{i}(\mathbf{w}^\top{\mathbf{x}_i+b}), 0]+\|\mathbf{w}\|_2^2$
|
- Typically $l_2$ regularized (sometimes $l_1$).
- Quadratic program.
- When kernelized leads to sparse solutions.
- Kernelized version can be solved very efficiently with specialized algorithms (e.g. SMO)
|