In this section, we provide a brief introduction to some basic probability theory; our goal is to provide the minimal preliminaries needed to understand the material in the book. Part of the text is taken almost verbatim from [PT10].
Originally motivated by gambling, the study of probability is now fundamental to a wide variety of subjects, including social behavior (e.g., economics and game theory) and physical laws (e.g., quantum mechanics and radioactive decay). But what is probability? What does it mean that a fair coin toss comes up heads with probability 50%? One interpretation is frequentist “50%” means that if we toss the coin 10 million times, it will come up heads in roughly 5 million tosses. A different interpretation is Bayesian (or subjective): “50%” is a statement about our beliefs, and how much we are willing to bet on one coin toss. For the purpose of this book, the second interpretation will be more relevant to us.
A.1 Probability Spaces
In our treatment, we restrict our attention to discrete probability spaces:1
Definition A.1 (Probability Space). A probability space is a pair where is a countable set called the sample space, and is called the probability mass function.2 Additionally, satisfies the property .
Intuitively, the sample space corresponds to the set of possible states that the world could be in, and the probability mass function assigns a probability from 0 to 1 to each of these states. To model our conventional notion of probability, we require that the total probability assigned by to all possible states should sum up to 1.
Definition A.2 (Event). Given a probability space , an event is simply a subset of . The probability of an event , denoted by , is defined to be . In particular, the event that includes “everything,” , has probability .
Even though events and probabilities are not well-defined without a probability space, by convention, we often omit and in our statements when they are clear from context.
Example A.1. Consider rolling a regular 6-sided die. The sample space is , and the probability mass function is constant: for all . We refer to such a probability mass function (i.e., one that is constant) as being equiprobable. The event of an even roll is , and this occurs with probability
Some basic properties Now that probability spaces are defined, we give a few basic properties of probability:
Claim A.1. If and are disjoint events (), then .
Proof. By definition,
which concludes the proof.
■
Corollary A.1. For any event , .
When events are not disjoint, we instead have the following:
Claim A.2. Given events and , .
Proof. First observe that and that all the terms on the RHS are disjoint. Therefore,
(A.1)
Similarly, we have
because, say is the disjoint union of and . Substituting (A.2) and (A.3) into (A.1) gives
which concludes the proof.
■
A useful corollary of Claim A.2 is the Union Bound.
Corollary A.2 (Union Bound). Given events and , . In general, given events ,
A.2 Conditional Probability
Suppose that after receiving a random 5-card hand dealt from a standard 52-card deck, we are told that the hand contains “at least a pair” (that is, at least two of the cards have the same rank). How do we calculate the probability of a full-house given this extra information? Consider the following thought process:
Motivated by this line of reasoning, we define conditional probability in the following way:
Definition A.3. Let and be events, and let . The conditional probability of , conditioned on , denoted by , is defined as
Independence By defining conditional probability, we modeled how the occurrence of one event can affect the probability of another event. An equally important concept is independence, where a set of events do not affect each other.
Definition A.4 (Independence). A sequence of events are (mutually) independent if and only if for every subset of these events, ,
If there are just two events, and , then they are independent if and only if . The following claim gives justification to the definition of independence.
Claim A.3. If and are independent events and , then .
Proof.
which concludes the proof.
■
A.3 Bayes’ Rule
Suppose that we have a test against a rare disease that affects only 0.3% of the population, and that the test is 99% effective (i.e., if a person has the disease the test says YES with probability 0.99, and otherwise it says NO with probability 0.99). If a random person in the populous tested positive, what is the probability that he has the disease? The answer is not 0.99. Indeed, this is an exercise in conditional probability: what are the chances that a random person has the rare disease, given the occurrence of the event that he tested positive?
We start with with some preliminaries.
Claim A.4. Let be disjoint events with non-zero probability such that (i.e., the events are exhaustive; the events partition the sample space ). Let be an event. Then .
Proof. By definition , and so the RHS equals
Since are disjoint it follows that the events are also disjoint. Therefore,
which concludes the proof.
■
Theorem A.1 (Bayes’ Rule). Let and be events with non-zero probability. Then:
Proof. Multiply both sides by . Now by definition of conditional probabilities, both sides equal:
which concludes the proof.
■
Sometimes, it is useful to expand the statement of Bayes’ Rule with Claim A.4:
Corollary A.3 (Bayes’ Rule Expanded). Let and be events with non-zero probability. Then:
Proof. The corollary follows directly by applying Claim A.4, and using the facts that (a) and are disjoint, and (b) .
■
We return to our original question of testing for rare diseases. Let us consider the sample space , where represents the outcome of the test on a random person in the populous, and represents whether the same person carries the disease or not. Let be the event that a randomly drawn person has the disease (), and be the event that a randomly drawn person tests positive ().
We know that (because 0.3% of the population has the disease). We also know that and (because the test is 99% effective). Using Bayes’ rule, we can now calculate the probability that a random person, who tested positive, actually has the disease:
Notice that 23%, while significant, is a far cry from 99% (the effectiveness of the test). This final probability can vary if we have a different prior (initial belief). For example, if a random patient has other medical conditions that raises the probability of contracting the disease up to 10%, then the final probability of having the disease, given a positive test, raises to 92%.
Updating beliefs after multiple signals Our treatment so far discusses how to update our beliefs after receiving one signal (the outcome of the test). How should we update if we receive multiple signals? That is, how do we compute ? To answer this question, we first need to define a notion of conditional independence.
Definition A.5 (Conditional Independence). A sequence of events are conditionally independent given event if and only if for every subset of the sequence of events, ,
In other words, given that the event has occurred, then the events are independent.
When there are only two events, and , they are conditionally independent given event if and only if .
If the signals we receive are conditionally independent, we can still use Bayes’ rule to update our beliefs. More precisely, if we assume that the signals and are independent when conditioned on , and also independent when conditioned on , then:
In general, given signals that are conditionally independent given and conditionally independent given , we have
Spam detection Using “training data” (e-mails classified as spam or not by hand), we can estimate the probability that a message contains a certain string conditioned on being spam (or not), for example Pr[ “viagra” | spam ], Pr[ “viagra” | not spam ]. We can also estimate the probability that a random e-mail is spam; that is, (this is about 80% in real life, although most spam detectors are “unbiased” and assume to make calculations nicer).
By choosing a diverse set of keywords, say , and assuming that the occurrence of these keywords are conditionally independent given a spam message or given a non-spam e-mail, we can use Bayes’ rule to estimate the probability that an e-mail is spam based on the words it contains (we have simplified the expression assuming ):
A.4 Random Variables
We use events to express whether a particular class of outcomes has occurred or not. Sometimes we want to express more: for example, after 100 fair coin tosses, we want to study how many coin tosses were heads (instead of focusing on just one event, say, that there were 50 coin tosses). This takes us to the definition of random variables.
Definition A.6. A random variable on a probability space is a function from the sample space to the real numbers .
So, in the example of 100 coin tosses, given any outcome of the experiment , we would define to be the number of heads that occurred in that outcome.
Definition A.7. Given a random variable on probability space , we can consider a new probability space where the sample space is the range of , , and the probability mass function is extended from , . We call the probability distribution or the probability density function of the random variable .
Example A.2. Suppose we toss two 6-sided dice. The sample space would be pairs of outcomes, , and the probability mass function is equiprobable. Consider the random variables, , , and . These random variables denote the outcome of the first die, the outcome of the second die, and the sum of the two dice, respectively. The probability density function of would take values:
Notation regarding random variables We can describe events by applying predicates to random variables (e.g., the event that , the number of heads, is equal to 50). We often use a short-hand notation, in which we treat random variables as if they are real numbers: if is a random variable, we let, for example, “” denote the event . Using this notation, we may define the probability density function of a random variable as .
In a similar vein, we can define new random variables from existing random variables. In Example A.2, we can write , to mean that for any , (again, the notation treats, , and as if the are real numbers).
Independent random variables The intuition behind independent random variables is just like that of events: the value of one random variable should not affect the value of another independent random variable.
Definition A.8. A sequence of random variables are (mutually) independent if for every subset and for any real numbers , the events are (mutually) independent.
In the case of two random variables and , they are independent if and only if for all real values and , .
A common use of independence is to model the outcome of consecutive coin tosses: Consider a biased coin that comes up heads with probability . Define if the coin comes up heads and if the coin comes up tails; then is called the Bernoulli random variable with probability . Suppose now we toss this biased coin times, and let be the random variable that denotes the total number of occurrence of heads. We can view as a sum of independent random variables, , where is a Bernoulli random variable with probability that represents the outcome of the toss. We leave it as an exercise to show that the random variables are indeed independent.
A.5 Expectation
Given a random variable defined on a probability space, what is its “average” value? Naturally, we need to weigh things according to the probability that the random variable takes on each value.
Definition A.9. Given a random variable defined over a probability space , we define the expectation of to be
An alternative but equivalent definition is
These definitions are equivalent because:
The following fact can be shown with a similar argument:
Claim A.5. Given a random variable and a function ,
Proof.
which concludes the proof.
■
Example A.3. Suppose in a game, with probability we are paid $10, and with probability 9/10 we are paid $2. What is our expected payment? The answer is
An application to decision theory In decision theory, we assign a real number, called the utility, to each outcome in the sample space of a probabilistic game. We then assume that rational players make decisions that maximize their expected utility. For example, should we be willing to pay $2 to participate in the game in Example A.3? If we assume that our utility is exactly the amount of money that we earn, then
with probability 1/10 we get paid $10 and get utility 8
with probability 9/10 we get paid $ 2 and get utility 0
This gives a positive expected utility of 0.8, so we should play the game!
This reasoning of utility does not always explain human behavior though. Suppose there is a game that cost a thousand dollars to play. With one chance in a million, the reward is two billion dollars (!), but otherwise there is no reward. The expected utility is
One expects to earn a thousand dollars from the game on average. Would you play it? Turns out many people are risk-averse and would turn down the game. After all, except with one chance in a million, you simply lose a thousand dollars. This example shows how expectation does not capture all the important features of a random variable, such as how likely the random variable is to end up close to its expectation (in this case, the utility is either dollars or two billion dollars, not close to the expectation of dollars at all).
In other instances, people are risk-seeking. Take yet another game that takes a dollar to play. This time, with one chance in a billion, the reward is a million dollars; otherwise there is no reward. The expected utility is
Essentially, to play the game is to throw a dollar away. Would you play the game? Turns out many people do; this is called a lottery!
How can we justify these behaviors in the expected utility framework? The point is that utility may not always be linear in money received/spent. Non-linear utility functions may be used to reconcile observed behavior with expected utility theory (but doing so is outside the scope of this course).
Linearity of expectation One nice property of expectation is that the expectation of the sum of random variables is the sum of the expectations. This can often simplify the calculation of expectations.
Theorem A.2. Let be random variables, and be real constants. Then
Proof.
which concludes the proof.
■
Example A.4. If we make tosses of a biased coin that ends up heads with probability , what is the expected number of heads? Let if the toss is heads, and otherwise. Then, is an independent Bernoulli random variable with probability , and has expectation
The expected number of heads is then
Thus if the coin was fair, we would expect , half of the tosses, to be heads.
Conditional expectations We may also define a notion of an expectation of a random variable conditioned on some event by simply conditioning the probability space on :
Definition A.10. Given a random variable defined over a probability space , and some event on , we define the expectation of conditioned on to be
We end this section by showing an analog of A.4 for the case of expectations.
Claim A.6. Let be disjoint events with non-zero probability such that . Let be a random variable over . Then
Proof. By definition and so the right-hand side evaluates to
which equals the left-hand side.
■