Note: This lecture makes heavy use of some of the common notational shorthands that lead to confusion. Keep these in mind:
if c∈R, we use the same variable c to refer to a random variable by letting c(k)::=c for all k∈S. In other words, c is the constant function / constant random variable that always takes on the value c.
X=a refers to an event: it is the set of outcomes k∈S with X(k)=a. Similarly, (X≥a) is {k∈S∣X(k)≥a}.
We do arithmetic on random variables by doing the operations on their outputs. For example, X2(k)::=(X(k))2.
The idea behind the variance of a random variable X is that it is the expected distance from the mean (expected value) of X. It turns out to be more useful to use the square of the distance. This leads to the following definition:
Definition: The variance of a random variable X is given by Var(X)::=E((X−E(X))2).
There is another formula for the variance that is often useful:
Claim: Var(X)=E(X2)−(E(X))2.
Proof: We make use of the fact that for any constant c∈R, E(c)=c (here we are using c as both a random variable as described in a previous lecture). In particular, since E(X) is a number, E(E(X))=E(X).
Var(X)=E((X−E(X))2)by definition=E(X2−2XE(X)+(E(X))2)follows from definition of addition and multiplication of RVs=E(X2)−E(2XE(X))+E((E(X))2)linearity of expectation=E(X2)−2E(X)E(X)+E((E(X))2)linearity of expectation; 2E(X) is a constant=E(X2)−(E(X))2E(X) is a constant, so is E(X)2
Note: while walking through this proof (as with any proof), it is helpful to keep track of which things are numbers and which things are random variables, and to pay attention to when we are using one as the other.
Exercise: prove E(c)=c.
If X is in units of inches, then Var(X) is measured in units of inches squared (you can verify this by expanding the definitions of expectation and variance, using the fact that probabilities are unitless).
Therefore it is often useful to work with the square root of the variance, which is called the standard deviation.
Definition: The standard deviation of X, often written σ_X or just σ if X is clear from context, is simply \sqrt{Var(X)}.
Markov's and Chebychev's inequalities are useful because they apply to any random variable (almost any: Markov's requires that the random variable only output non-negative numbers). They let you put bounds on the probabilities that certain surprising things occur. Both have the form
Pr(\text{surprising thing}) \leq \text{bound}
For Markov's, the surprising thing is that the variable gives a large answer:
Pr(X \geq a) \leq \text{bound}
while for Chebychev's, the surprising thing is that the variable gives an answer far from the expected value:
Pr(|X - E(X)| \geq a) \leq \text{bound}
Here is how I remember/understand the bounds. For Markov, the bound depends on X and a. If X returns very large values on average (i.e. if E(X) is large), then it is likely that X is large, while if E(X) is very small, then it is quite unlikely that X is large. Bigger E(X) leads to bigger probability, so E(X) is in the numberator.
a is our definition of "large". It is more likely that I am taller than 3' than it is that I am taller than 10'. Pr(X \geq 3) \geq Pr(X \geq 10). Increasing a decreases the probability, so a goes in the denominator. This gives:
Claim (Markov's inequality): If X \geq 0, then Pr(X \geq a) \leq \frac{E(X)}{a}
Proof is below.
Note that X \geq 0 is shorthand for ∀k \in S, X(k) \geq 0.
For Chebychev, the bound also depends on X and a, but it depends on how spread out the distribution of X is. If X is very spread out (has a large variance) then I am likely to sample a point far away from the expected value. If the values of X are concentrated, then I would be surprised to sample a value far from the mean (the probability would be low). Higher variance leads to higher probability, so Var(X) is in the numerator.
As with Markov's, increasing my notion of "large" decreases the probability that I will cross the threshold. This tells me a is in the denominator. As discussed above, Var is in the units of X squared, while (since we are comparing a to X) a is in the units of X. That reminds me that a should be squared in the denominator (since probabilities are unitless). This leads to:
Claim (Chebychev's inequality): For any X, Pr\left(|X - E(X)| \geq a\right) \leq \frac{Var(X)}{a^2}
Proof in the next lecture.
Suppose we want to build a door such that 90% of the population can walk through it without ducking. How tall should we build the door? Suppose all we know is that the average height is 5.5 feet.
We want to find a height a such that Pr(X \lt a) \geq 9/10. This is the same as requiring that Pr(X \geq a) \leq 1/10.
If we knew that E(X)/a \leq 1/10, then Markov's inequality would tell us that Pr(X \geq a) \leq 1/10. Solving for a, we see that if a \geq (5.5)(10), then E(x)/a \leq 1/10, so Pr(X \geq a) \leq E(x)/a \leq 1/10.
Therefore, if we build the door 55 feet tall, we are guaranteed that at least 90% of the people can pass through it without ducking.
Exercise: suppose you also knew that everyone was taller than 4'. Use Markov's to build a smaller door.
Next lecture, we will use Chebychev's to get a much tighter bound.
Claim (Markov's inequality): If X \geq 0, and a > 0, then Pr(X \geq a) \leq E(X)/a.
Proof: We start by expanding E(X):
\begin{aligned} E(X) &= \sum_{x \in \mathbb{R}} x Pr(X = x) && \text{by definition} \\ &= \sum_{x \lt a} x Pr(X = x) + \sum_{x \geq a} x Pr(X = x) && \text{rearranging terms} \\ &\geq \sum_{x \geq a} xPr(X = x) && \text{the first sum is postitive since $X \geq 0$; dropping it decreases the sum} \\ &\geq \sum_{x \geq a} aPr(X = x) && \text{since $x \geq a$ for all terms in the sum} \\ &= a\sum_{x \geq a} Pr(X = x) && \text{algebra} \\ &= aPr(X \geq a) && \text{by the third axiom, since the event $(X \geq a)$ is $\bigcup_{x \geq a} (X = x)$} \\ \end{aligned}
Dividing both sides by a gives the result.