Loss ℓ(hw(xi,yi)) | Usage | Comments | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Hinge-Loss max[1−hw(xi)yi,0]p |
When used for Standard SVM, the loss function denotes the size of the margin between linear separator and its closest points in either class. Only differentiable everywhere with p=2. | ||||||||||||||||
Log-Loss log(1+e−hw(xi)yi) | Logistic Regression | One of the most popular loss functions in Machine Learning, since its outputs are well-calibrated probabilities. | |||||||||||||||
Exponential Loss e−hw(xi)yi | AdaBoost | This function is very aggressive. The loss of a mis-prediction increases exponentially with the value of −hw(xi)yi. This can lead to nice convergence results, for example in the case of Adaboost, but it can also cause problems with noisy data. | |||||||||||||||
Zero-One Loss δ(sign(hw(xi))≠yi) | Actual Classification Loss | Non-continuous and thus impractical to optimize. |
Some questions about the loss functions:![]()
Figure 4.1: Plots of Common Classification Loss Functions - x-axis: h(xi)yi, or "correctness" of prediction; y-axis: loss value
Loss ℓ(hw(xi,yi)) | Comments | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Squared Loss (h(xi)−yi)2 |
|
||||||||||
Absolute Loss |h(xi)−yi| |
|
||||||||||
Huber Loss
|
|
||||||||||
Log-Cosh Loss log(cosh(h(xi)−yi)), cosh(x)=ex+e−x2 |
ADVANTAGE: Similar to Huber Loss, but twice differentiable everywhere |
![]()
Figure 4.2: Plots of Common Regression Loss Functions - x-axis: h(xi)yi, or "error" of prediction; y-axis: loss value
Regularizer r(w) | Properties | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
l2-Regularization
r(w)=w⊤w=‖w‖22 |
|
||||||||||
l1-Regularization r(w)=‖w‖1 |
|
||||||||||
lp-Norm ‖w‖p=(d∑i=1vpi)1/p |
|
![]()
Figure 4.3: Plots of Common Regularizers
Loss and Regularizer | Comments | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Ordinary Least Squares minw1nn∑i=1(w⊤xi−yi)2 |
|
||||||||||
Ridge Regression minw1nn∑i=1(w⊤xi−yi)2+λ‖w‖22 |
|
||||||||||
Lasso minw1nn∑i=1(w⊤xi−yi)2+λ‖w‖1 |
|
||||||||||
Elastic Net minw1nn∑i=1(w⊤xi−yi)2+α‖w‖1+(1−α)‖w‖22 α∈[0,1) |
|
||||||||||
Logistic Regression minw,b1nn∑i=1log(1+e−yi(w⊤xi+b)) |
|
||||||||||
Linear Support Vector Machine minw,bCn∑i=1max[1−yi(w⊤xi+b),0]+‖w‖22 |
|