Processing math: 100%

18: Bagging

previous
next



Also known as Bootstrap Aggregating (Breiman 96). Bagging is an ensemble method.

Bagging Reduces Variance

Remember the Bias / Variance decomposition: E[(hD(x)y)2]Error=E[(hD(x)ˉh(x))2]Variance+E[(ˉh(x)ˉy(x))2]Bias+E[(ˉy(x)y(x))2]Noise
Our goal is to reduce the variance term: E[(hD(x)ˉh(x))2].
For this, we want hDˉh.

Weak law of large numbers

The weak law of large numbers says (roughly) for i.i.d. random variables xi with mean ˉx, we have, 1mmi=1xiˉx as m

Apply this to classifiers: Assume we have m training sets D1,D2,...,Dn drawn from Pn. Train a classifier on each one and average result: ˆh=1mmi=1hDiˉhas m
We refer to such an average of multiple classifiers as an ensemble of classifiers.
Good news: If ˆhˉh the variance component of the error must also vanish, i.e. E[(ˆh(x)ˉh(x))2]0
Problem:We don't have m data sets D1,....,Dm, we only have D.

Solution: Bagging (Bootstrap Aggregating)

Simulate drawing from P by drawing uniformly with replacement from the set D.
i.e. let Q(X,Y|D) be a probability distribution that picks a training sample (xi,yi) from D uniformly at random. More formally, Q((xi,yi)|D)=1n(xi,yi)D with n=|D|.
We sample the set DiQn, i.e. |Di|=n, and Di is picked with replacement from Q|D.

Q: What is E[|DDi|]?
Bagged classifier: ˆhD=1mmi=1hDi
Notice: ˆhD=1mmi=1hDiˉh(cannot use W.L.L.N here, W.L.L.N only works for i.i.d. samples). However, in practice bagging still reduces variance very effectively.
Analysis
Although we cannot prove that the new samples are i.i.d., we can show that they are drawn from the original distribution P. Assume P is discrete, with P(X=xi)=pi over some set Ω=x1,...xN (N very large) (let's ignore the label for now for simplicity)
Q(X=xi)=nk=1(nk)pki(1pi)nkProbability that arek copies of xi in DknProbabilitypick one ofthese copies=1nnk=1(nk)pki(1pi)nkkExpected value ofBinomial Distributionwith parameter piE[B(pi,n)]=npi=1nnpi=piTATAAA_!! Each data set Dl is drawn from P, but not independently.

There is a simple intuitive argument why Q(X=xi)=P(X=xi). So far we assumed that you draw D from Pn and then Q picks a sample from D. However, you don't have to do it in that order. You can also view sampling from Q in reverse order: Consider that you first use Q to reserve a "spot" in D, i.e. a number from 1,...,n, where i means that you sampled the ith data point in D. So far you only have the slot, i, and you still need to fill it with a data point (xi,yi). You do this by sampling (xi,yi) from P. It is now obvious that which slot you picked doesn't really matter, so we have Q(X=x)=P(X=x).

Bagging summarized
  1. Sample m data sets D1,,Dm from D with replacement.
  2. For each Dj train a classifier hj()
  3. The final classifier is h(x)=1mmj=1hj(x).
In practice larger m results in a better ensemble, however at some point you will obtain diminishing returns. Note that setting m unnecessarily high will only slow down your classifier but will not increase the error of your classifier.

Advantages of Bagging

Random Forest

One of the most famous and useful bagged algorithms is the Random Forest! A Random Forest is essentially nothing else but bagged decision trees, with a slightly modified splitting criteria.

The algorithm works as follows:

  1. Sample m data sets D1,,Dm from D with replacement.
  2. For each Dj train a full decision tree hj() (max-depth=) with one small modification: before each split randomly subsample kd features (without replacement) and only consider these for your split. (This further increases the variance of the trees.)
  3. The final classifier is h(x)=1mmj=1hj(x).

The Random Forest is one of the best, most popular and easiest to use out-of-the-box classifier. There are two reasons for this:

Useful variants of Random Forests: