$\newcommand{\R}{\mathbb{R}}$ $\newcommand{\norm}[1]{\left\|#1\right\|}$ $\newcommand{\Exv}[1]{\mathbf{E}\left[#1\right]}$ $\newcommand{\Prob}[1]{\mathbf{P}\left(#1\right)}$ $\newcommand{\Var}[1]{\operatorname{Var}\left(#1\right)}$ $\newcommand{\Abs}[1]{\left|#1\right|}$
import numpy
import scipy
import matplotlib
from matplotlib import pyplot
import time
matplotlib.rcParams.update({'font.size': 18})
Prelim Exam This Thursday: Take-home, open-book, open-computer exam covering everything we've done up to that point (but with a focus on things that have appeared in problem sets and programming assignments). 24 hours, starting from the posted exam time. Submit your solutions on gradescope.
If you have an accommodation that would get you extra time on in-class timed exams, I have been told that this generally does not apply to take-home exams like this one, so I will not be automatically granting you any extra time based on your existing SDS letter. If you think this is in error as regards your particular SDS accommodation, please let me know ASAP.
No programming assignment today. Because of the prelim, we will leave off on assigning PA2 until next Monday.
Suppose we have a dataset $\mathcal{D} = \{ (x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n) \}$, where $x_i \in \mathcal{X}$ is an example and $y_i \in \mathcal{Y}$ is a label.
Let $h: \mathcal{X} \rightarrow \mathcal{Y}$ be a hypothesized model (mapping from examples to labels) we are trying to evaluate.
Let $L: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}$ be a \emph{loss function} which measures how different two labels are.
The empirical risk is
$$R(h) = \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i).$$We need to compute the empirical risk a lot during training, both during validation (and hyperparameter optimization) and testing, so it's nice if we can do it fast.
Suppose we want to know the average number of dogs kept as pets in an American household.
One way to do this is to call up all 100 million or so American households and ask them how many dogs they have, then compute the average.
Is this efficient? Is this an effective way to allocate the resources we have to do science?
How can we gain knowledge of the average more efficiently?
Let $\tilde i$ be a random index drawn uniformly from $\{1, \ldots, n\}$, and let $Z$ be a random variable that takes on the value $L(h(x_{\tilde i}), y_{\tilde i})$. That is, $Z$ is the result of sampling a single element from the sum in the formula for the empirical risk. By the definition of expected value:
\begin{align*} \Exv{Z} &= \sum_{i=1}^n \mathbf{P}(\tilde i = i) \cdot L(h(x_i), y_i) \\ &= \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i) = R(h). \end{align*}If we sample a bunch of independent identically distributed random variables $Z_1, Z_2, \ldots, Z_K$ identical to $Z$, then their average will be a good approximation of the empirical risk. That is,
$$S_K = \frac{1}{K} \sum_{k=1}^K Z_k \approx R(h).$$$O(K)$
This is an instance of the statistical principle that the average of a collection of independent random variables tends to cluster around their mean.
We can formalize this asymptotically with the strong law of large numbers, which says that
$$\Prob{\lim_{K \rightarrow \infty} \frac{1}{K} \sum_{k=1}^K Z_k = \Exv{Z} = R(h)} = 1,$$i.e. that as the number of samples approaches infinity, their average converges to their mean almost surely.
To get this, we use the central limit theorem, which characterizes the behavior of large sums of independent variables. If our random variables $Z$ have bounded mean and variance, then
$$\sqrt{K} \left( \frac{1}{K} \sum_{k=1}^K Z_k - \Exv{Z} \right) \text{ converges in distribution to } \mathcal{N}(0, \Var{Z}) \text{ as } K \rightarrow \infty.$$Takeaway: pretty much any large enough average is going to look like a bell curve.
z = numpy.arange(-5,5,0.01)
pz = numpy.exp(-(z**2)/2) / numpy.sqrt(2*numpy.pi)
pyplot.plot(z,pz);
pyplot.title("Normal Distribution");
Takeaway: pretty much any large enough average is going to look like a bell curve.
Need some way to tell how many samples we need to average to get approximations we can be confident in.
To address this problem, we use something called a concentration inequality: a formula that lets us bound for a finite sum what the probability it will diverge from its expected value will be.
Concentration inequalities and subsampling aren't just about making machine learning more efficient. They also drive statistical science. This is how we can do surveys of the population and be confident that our estimates are accurate. And applying this incorrectly can be very embarrasing!
Another way to look at these sorts of question is: how large do I need to make my sample to be confident in my estimate to within a specified margin of error?
Other applications in medicine and drug trials.
Note that in practice, there are many other sources of error we also need to take into account (e.g. the random variable we're sampling may have some bias relative to the thing we're trying to measure), but concentration of some sort is usually at the core of the analysis!
The granddaddy of all concentration inequalities is Markov's inequality, which states that if $S$ is a non-negative random variable with finite expected value, then for any constant $a > 0$,
$$\Prob{S \ge a} \le \frac{\Exv{S}}{a}.$$...from which it immediately follows that...
$$\Prob{S \ge a} \le \frac{\Exv{S}}{a}.$$The granddaddy of all concentration inequalities is Markov's inequality, which states that if $S$ is a non-negative random variable with finite expected value, then for any constant $a > 0$,
$$\Prob{S \ge a} \le \frac{\Exv{S}}{a}.$$What bound do we get for our empirical risk sum using Markov's inequality if we are using a 0-1 loss (so $0 \le Z \le 1$)? Is this a useful bound?
$$\color{green}{\Prob{\frac{1}{K} \sum_{k=1}^K Z_k \ge a} \le \frac{R(h)}{a}}$$...
A perhaps-more-useful concentration inequality is Chebyshev's inequality. This inequality uses the variance of the random variable, in addition to its expected value, to bound its distance from its expected value. If $S$ is a non-negative random variable with finite expected value and variance, then for any constant $a > 0$,
$$\Prob{\Abs{S - \Exv{S}} \ge a} \le \frac{\Var{S}}{a^2}.$$Apply Markov's inequality to the random variable $Y = (S - \Exv{S})^2$. Since it's the square of something, it's always non-negative, so we can apply Markov's inequality.
$$\Prob{Y \ge a} \le \frac{\Exv{Y}}{a}.$$Substituting $a \rightarrow a^2$ and substituting the definition of $Y$,
$$\Prob{(S - \Exv{S})^2 \ge a^2} \le \frac{\Exv{(S - \Exv{S})^2}}{a^2} = \frac{\Var{S}}{a^2}.$$A perhaps-more-useful concentration inequality is Chebyshev's inequality. This inequality uses the variance of the random variable, in addition to its expected value, to bound its distance from its expected value. If $S$ is a non-negative random variable with finite expected value and variance, then for any constant $a > 0$,
$$\Prob{\Abs{S - \Exv{S}} \ge a} \le \frac{\Var{Z}}{a^2}.$$What bound do we get for our empirical risk sum using Chebyshev's inequality if we are using a 0-1 loss (so $0 \le Z \le 1$)? Is this a useful bound?
$$\color{green}{\Prob{\Abs{ \frac{1}{K} \sum_{k=1}^K Z_k - R(h) } \ge a} \le \frac{1}{4K a^2}}$$Activity: if we want to estimate the empirical risk with 0-1 loss to within $10\%$ error (i.e. $\Abs{S_K - R(h)} \le 10\%$) with probability $99\%$, how many samples $K$ do we need to average up if we use this Chebyshev's inequality bound?
...
But we often want to approximate the empirical risk many times during training, either to validate a model or to monitor convergence of training loss.
For example, suppose we have $M$ hypotheses we want to validate ($h^{(1)}, \ldots, h^{(M)}$), and we use independent subsamples ($S_K^{(1)}, \ldots, S_K^{(M)}$, each of size $K$) to approximate the empirical risk for each of them. What bound can we get using Chebyshev's inequality on the probability that all $M$ of our independent approximations are within a distance $a$ of their true empirical risk?
$$\color{green}{\Prob{\Abs{ \frac{1}{K} \sum_{k=1}^K Z_k - R(h) } \ge a} \le \frac{1}{4K a^2}}$$$$\color{green}{\Prob{\Abs{S_K^{(m)} - R(h^{(m)})} \le a \text{ for all } m \in \{1,\ldots, M\}} \ge 1 - \frac{M}{4K a^2}}$$Now if we want to estimate the empirical risk with 0-1 loss to within the same $10\%$ error rate with the same probability of $99\%$, but for all of $M = 100$ different hypotheses, how many samples do we need according to this Chebyshev bound?
Takeaways:
Hoeffding's inequality states that if $Z_1, \ldots, Z_K$ are independent random variables, and
$$S_K = \frac{1}{K} \sum_{k=1}^K Z_k,$$then if those variables are bound absolutely by $z_{\min} \le Z_k \le z_{\max}$, then
$$\Prob{ \Abs{ S_K - \Exv{S_K} } \ge a } \le 2 \exp\left( -\frac{2 K a^2}{(z_{\max} - z_{\min})^2} \right).$$Activity: if we want to estimate the empirical risk with 0-1 loss to within $10\%$ error (i.e. $\Abs{S_K - R(h)} \le 10\%$) with probability $99\%$, how many samples $K$ do we need to average up if we use this Hoeffding's inequality bound?
...
What if we want to estimate the empirical risk with 0-1 loss to within the same $10\%$ error rate with the same probability of $99\%$, but for all of $M = 100$ different hypotheses. How many samples do we need according to this Hoeffding's inequality bound?
...
Takeaway: the Hoeffding's inequality bound is much tighter, and scales better with the number of times we want to estimate using subsampling. We can use this sort of bound to estimate the number of samples we need to use to estimate a sum like the empirical risk to within some level of accuracy with high probability.
Concentration inequalities...
This is useful not just for subsampling for efficiency, but also for bounding the errors that result from using a sample of test/validation data rather than the exact statistics on the true "real-world" test distribution.
And it has loads of applications beyond machine learning, to pretty much everywhere subsampling is used...and subsampling is the tool used in statistical science, from drug trials to politcal surveys.
Many other concentration inequalities exist.