Loss \ell(h_{\mathbf{w}}(\mathbf{x}_i,y_i)) | Usage | Comments | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1.Hinge-Loss max\left[1-h_{\mathbf{w}}(\mathbf{x}_{i})y_{i},0\right]^{p} | When used for Standard SVM, the loss function denotes margin length between linear separator and its closest point in either class. Only differentiable everywhere at \left.p=2\right.. | ||||||||||||||||
2.Log-Loss \left.log(1+e^{-h_{\mathbf{w}}(\mathbf{x}_{i})y_{i}})\right. | Logistic Regression | One of the most popular loss functions in Machine Learning, since its outputs are very well-tuned. | |||||||||||||||
3.Exponential Loss \left. e^{-h_{\mathbf{w}}(\mathbf{x}_{i})y_{i}}\right. | AdaBoost | This function is very aggressive. The loss of a mis-prediction increases exponentially with the value of -h_{\mathbf{w}}(\mathbf{x}_i)y_i. | |||||||||||||||
4.Zero-One Loss \left.\delta(\textrm{sign}(h_{\mathbf{w}}(\mathbf{x}_{i}))\neq y_{i})\right. | Actual Classification Loss | Non-continuous and thus impractical to optimize. |
Some additional notes on loss functions:![]()
Figure 4.1: Plots of Common Classification Loss Functions - x-axis: \left.h(\mathbf{x}_{i})y_{i}\right., or "correctness" of prediction; y-axis: loss value
Loss \ell(h_{\mathbf{w}}(\mathbf{x}_i,y_i)) | Comments | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1.Squared Loss \left.(h(\mathbf{x}_{i})-y_{i})^{2}\right. |
|
||||||||||
2.Absolute Loss \left.|h(\mathbf{x}_{i})-y_{i}|\right. |
|
||||||||||
3.Huber Loss
|
|
||||||||||
4.Log-Cosh Loss \left.log(cosh(h(\mathbf{x}_{i})-y_{i}))\right., \left.cosh(x)=\frac{e^{x}+e^{-x}}{2}\right. |
ADVANTAGE: Similar to Huber Loss, but twice differentiable everywhere |
![]()
Figure 4.2: Plots of Common Regression Loss Functions - x-axis: \left.h(\mathbf{x}_{i})y_{i}\right., or "error" of prediction; y-axis: loss value
Regularizer r(\mathbf{w}) | Properties | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1.l_{2}-Regularization
\left.r(\mathbf{w}) = \mathbf{w}^{\top}\mathbf{w} = (\|{\mathbf{w}}\|_{2})^{2}\right. |
|
||||||||||
2.l_{1}-Regularization \left.r(\mathbf{w}) = \|\mathbf{w}\|_{1}\right. |
|
||||||||||
3.Elastic Net \left.\alpha\|\mathbf{w}\|_{1}+(1-\alpha)(\|{\mathbf{w}}\|_{2})^{2}\right. \left.\alpha\in[0, 1)\right. |
|
||||||||||
4.lp-Norm often \left.0<p\leq1\right. \left.\|{\mathbf{w}}\|_{p} = (\sum\limits_{i=1}^d v_{i}^{p})^{1/p}\right. |
|
![]()
Figure 4.3: Plots of Common Regularizers
Loss and Regularizer | Classification | Solutions | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1.Ordinary Least Squares \min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}-y_{i})^{2} |
|
|
|||||||||||||||
2.Ridge Regression \min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}-y_{i})^{2}+\lambda\|{w}\|_{2}^{2} |
|
|
|||||||||||||||
3.Lasso \min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}\mathbf{x}_{i}-\vec{y_{i}})^{2}+\lambda\|\mathbf{w}\|_{1} |
|
|
|||||||||||||||
4.Logistic Regression \min_{\mathbf{w},b} \frac{1}{n}\sum\limits_{i=1}^n \log{(1+e^{-y_i(\mathbf{w}^{\top}\mathbf{x}_{i}+b)})} |
|
|