Loss $\ell(h_{\mathbf{w}}(\mathbf{x}_i,y_i))$ | Usage | Comments |
1.Hinge-Loss$max\left[1-h_{\mathbf{w}}(\mathbf{x}_{i})y_{i},0\right]^{p}$ | Standard SVM($\left.p=1\right.$)(Differentiable) Squared Hingeless SVM ($\left.p=2\right.$) | When used for Standard SVM, the loss function denotes margin length between linear separator and its closest point in either class. Only differentiable everywhere at $\left.p=2\right.$. |
2.Log-Loss $\left.log(1+e^{-h_{\mathbf{w}}(\mathbf{x}_{i})y_{i}})\right.$ | Logistic Regression | One of the most popular loss functions in Machine Learning, since its outputs are very well-tuned. |
3.Exponential Loss $\left. e^{-h_{\mathbf{w}}(\mathbf{x}_{i})y_{i}}\right.$ | AdaBoost | This function is very aggressive. The loss of a mis-prediction increases exponentially with the value of $-h_{\mathbf{w}}(\mathbf{x}_i)y_i$. |
4.Zero-One Loss $\left.\delta(\textrm{sign}(h_{\mathbf{w}}(\mathbf{x}_{i}))\neq y_{i})\right.$ | Actual Classification Loss | Non-continuous and thus impractical to optimize. |
Loss $\ell(h_{\mathbf{w}}(\mathbf{x}_i,y_i))$ | Comments |
1.Squared Loss $\left.(h(\mathbf{x}_{i})-y_{i})^{2}\right.$
|
Most popular regression loss function
Estimates Mean Label
ADVANTAGE: Differentiable everywhere
DISADVANTAGE: Somewhat sensitive to outliers/noise
Also known as Ordinary Least Squares (OLS)
|
2.Absolute Loss $\left.|h(\mathbf{x}_{i})-y_{i}|\right.$
|
Also a very popular loss function
Estimates Median Label
ADVANTAGE: Less sensitive to noise
DISADVANTAGE: Not differentiable at $0$
|
3.Huber Loss
$\left.\frac{1}{2}\left(h(\mathbf{x}_{i})-y_{i}\right)^{2}\right.$ if $|h(\mathbf{x}_{i})-y_{i}|<\delta$,
otherwise $\left.\delta(|h(\mathbf{x}_{i})-y_{i}|-\frac{\delta}{2})\right.$
|
Also known as Smooth Absolute Loss
ADVANTAGE: "Best of Both Worlds" of Squared and Absolute Loss
Once-differentiable
Takes on behavior of Squared-Loss when loss is small, and Absolute Loss when loss is large.
|
4.Log-Cosh Loss $\left.log(cosh(h(\mathbf{x}_{i})-y_{i}))\right.$, $\left.cosh(x)=\frac{e^{x}+e^{-x}}{2}\right.$
|
ADVANTAGE: Similar to Huber Loss, but twice differentiable everywhere
|
Regularizer $r(\mathbf{w})$ | Properties |
1.$l_{2}$-Regularization
$\left.r(\mathbf{w}) = \mathbf{w}^{\top}\mathbf{w} = (\|{\mathbf{w}}\|_{2})^{2}\right.$
|
ADVANTAGE: Strictly Convex
ADVANTAGE: Differentiable
DISADVANTAGE: Uses weights on all features, i.e. relies on all features to some degree (ideally we would like to avoid this) - these are known as Dense Solutions.
|
2.$l_{1}$-Regularization $\left.r(\mathbf{w}) = \|\mathbf{w}\|_{1}\right.$
|
Convex (but not strictly)
DISADVANTAGE: Not differentiable at $0$ (the point which minimization is intended to bring us to
Effect: Sparse (i.e. not Dense) Solutions
|
3.Elastic Net $\left.\alpha\|\mathbf{w}\|_{1}+(1-\alpha)(\|{\mathbf{w}}\|_{2})^{2}\right.$
$\left.\alpha\in[0, 1)\right.$
|
ADVANTAGE: Strictly convex (i.e. unique solution)
DISADVANTAGE: Non-differentiable
|
4.lp-Norm often $\left.0<p\leq1\right.$
$\left.\|{\mathbf{w}}\|_{p} = (\sum\limits_{i=1}^d v_{i}^{p})^{1/p}\right.$
|
DISADVANTAGE: Non-convex
ADVANTAGE: Very sparse solutions
Initialization dependent
DISADVANTAGE: Not differentiable
|
Loss and Regularizer | Classification | Solutions |
1.Ordinary Least Squares $\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}-y_{i})^{2}$
|
Squared Loss
No Regularization
|
$\left.\mathbf{w}=(\mathbf{X}\mathbf{X}^\top)^{-1}\mathbf{X}\mathbf{y}^{\top}\right.$
$\left.X=[\mathbf{x}_{1}, ..., \mathbf{x}_{n}]\right.$
$\left.Y=[y_{1},...,y_{n}]\right.$
|
2.Ridge Regression $\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}-y_{i})^{2}+\lambda\|{w}\|_{2}^{2}$
|
Squared Loss
$l_{2}$-Regularization
|
$\left.w=(\mathbf{X}\mathbf{X}^{\top}+\lambda\mathbb{I})^{-1}\mathbf{X}^\top\mathbf{y}^{\top}\right.$
|
3.Lasso $\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}\mathbf{x}_{i}-\vec{y_{i}})^{2}+\lambda\|\mathbf{w}\|_{1}$
|
+ sparsity inducing (good for feature selection)
+ Convex
- Not strictly convex (no unique solution)
- Not differentiable (at 0)
|
Solve with (sub)-gradient descent or SVEN
|
4.Logistic Regression $\min_{\mathbf{w},b} \frac{1}{n}\sum\limits_{i=1}^n \log{(1+e^{-y_i(\mathbf{w}^{\top}\mathbf{x}_{i}+b)})}$
|
Often $l_{1}$ or $l_{2}$ Regularized
|
Estimation:
$\left.\Pr{(y=+1|x)}=\frac{1}{1+e^{-y(\mathbf{w}^{\top}x+b)}}\right.$
|
[1] In Bayesian Machine Learning, it is possible to optimize $\lambda$,
but for the purposes of this class, it is assumed to be fixed.