CMSC 27100 — Lecture 25

Observe that we can view any indicator random variable as a flip of a biased coin. For instance, in our random graph example, we determine inclusion of each edge independently with probability $p$. But this is really the same as taking a biased coin with probability $p$ of landing on tails, flipping the coin, and deciding whether to include the edge on that basis.

What does it mean for a random variable to be independent? It turns out not to be all that different from how we view independence of events.

We say the random variables $X$ and $Y$ are independent if for all $r,s \in \mathbb R$, $\Pr(X = r \wedge Y = s) = \Pr(X = r) \cdot \Pr(Y = s)$.

Then any time we consider an object built out of $n$ such coin flips, we really have a repetition of $n$ mutually independent events. And whenever we have $n$ coin flips, we can view such an object as a binary string of length $n$. We can then interpret our sample space and probability distribution accordingly.

Suppose we have a probability space $(\Omega,\Pr)$ and we have a single indicator random variable over this probability space and we want to repeat this random variable over $n$ independent trials. We do exactly what we did before and view it as a string, but now we will make it explicit.

We define the probability space $(\Omega^n, \Pr_n)$, where $\Omega^n$ is the set of $n$-tuples $(a_1, a_2, \dots, a_n)$ with $a_i \in \Omega$ and $$\Pr_n((a_1, a_2, \dots, a_n)) = \Pr(a_1) \Pr(a_2) \cdots \Pr(a_n).$$ Here, each position $i$ corresponds to one of our trials and we can define the corresponding Bernoulli r.v. by $X_i((a_1, \dots, a_n)) = X(a_i)$. Since we're taking the product, none of the trials in our tuple depend on each other, so they're independent. We can use this to prove the following.

The probability of $k$ successes in $n$ independent Bernoulli trials with probability of success $p$ is $$\binom n k p^k (1-p)^{n-k}.$$

For each $n$-tuple $A = (a_1, \dots, a_n)$ of trials with Bernoulli random variable $X$, we can construct a string $w_A = b_1 b_2 \cdots b_n$ over the alphabet $\{S,F\}$ by $$b_i = \begin{cases} S & \text{if $X(a_i) = 1$,} \\ F & \text{otherwise.} \end{cases}$$ Then the number of $n$-tuples with $k$ successes is the same as the number of binary strings of length $n$ with $k$ $S$'s. Recall that there are exactly $\binom n k$ of these.

Then for each $n$-tuple with $k$ successes, we have $\Pr_n(A) = p^k (1-p)^{n-k}$. Putting this together, the probability of $k$ successes is \[\binom n k p^k (1-p)^{n-k}.\]

This is called the binomial distribution. It is appropropriately named since this is what you'd get if you plugged in the $x = p$ and $y = (1-p)$ into the Binomial Theorem. We can restate our theorem from above as

If $X$ is distributed according to the binomial distribution on $n$ trials and probability of success $p$, then \[\Pr(X = k) = \binom n k p^k (1-p)^{n-k}.\] for $k$ with $k \in \{0, \dots, n\}$ and $0$ otherwise.

Expectation

We have moved from talking about the probability of events to thinking about the probability of a property of events, in the form of random variables. Now, if we know these probabilities, we can ask the question of what kinds of values we might expect more often.

Let $(\Omega,\Pr)$ be a probability space and let $X$ be a random variable. The expected value of $X$ is \[E(X) = \sum_{\omega \in \Omega} X(\omega) \Pr(\omega).\]

Roughly speaking, the expectation of a random variable is the (weighted) average value for the random variable.

Consider a fair 6-sided die and let $X$ be the random variable for the number that was rolled. We have \[E(X) = 1 \cdot \frac 1 6 + 2 \cdot \frac 1 6 + 3 \cdot \frac 1 6 + 4 \cdot \frac 1 6 + 5 \cdot \frac 1 6 + 6 \cdot \frac 1 6 = \frac{21}{6} = 3.5.\] Now, consider a biased 6-sided die with \[\Pr(\omega) = \begin{cases} \frac 3 4 & \text{if $\omega = 1$,} \\ \frac{1}{20} & \text{otherwise}. \end{cases}\] Let $Y$ be the random variable for the number that was rolled with our biased die. Then we get \[E(Y) = 1 \cdot \frac 3 4 + 2 \cdot \frac{1}{20} + 3 \cdot \frac{1}{20} + 4 \cdot \frac{1}{20} + 5 \cdot \frac{1}{20} + 6 \cdot \frac{1}{20} = \frac 3 4 + \frac{20}{20} = 1.75.\] This indicates to us that the biased die will give us a 1 more often and therefore the average value that we should expect from a roll is much closer to 1.

It's important to note that what this tells us is not that we should expect a fair die roll to be 3.5 or 1.75 when we roll it, because obviously that will never happen. Rather, what expectation tells us is that, over time, the more we roll the die, the closer the value of the average roll will be to 3.5 or 1.75.

Recall that the motivation for defining random variables was so we could get away from considering the probabilities of elementary events. The definition of expectation is not very helpful in this regard. Luckily, it is not too difficult to reformulate expectation in terms of the values that the random variable takes on.

Let $\Pr(X = r) = \Pr(\{\omega \mid X(\omega) = r\})$ and let $\{r_1, \dots, r_k\} = \operatorname{range}(X)$. Then \[E(X) = \sum_{i=1}^k r_i \cdot \Pr(X = r_i).\]

Recall that the range of a function $X$ is all of the possible values that $X(\omega)$ can take over every $\omega \in \Omega$. Then we have \begin{align*} \sum_{i=1}^k r_i \cdot \Pr(X = r_i) &= \sum_{i=1}^k r_i \cdot \Pr(\{\omega \in \Omega \mid X(\omega) = r_i\}) \\ &= \sum_{i=1}^k r_i \cdot \sum_{\omega \in \Omega, X(\omega) = r_i} \Pr(\omega) \\ &= \sum_{i=1}^k \sum_{\omega \in \Omega, X(\omega) = r_i} X(\omega) \Pr(\omega) \\ &= \sum_{\omega \in \Omega} X(\omega) \Pr(\omega) \\ &= E(X) \end{align*}

This is a much more useful way of thinking about expectation. Consider the following example.

Let $X$ be a Bernoulli random variable with probability of success $p$. What is $E(X)$? Based on the original definition, we need to know about our sample space and probability distribution. But if we focus only on random variables, the problem becomes quite simple: \[E(X) = 1 \cdot \Pr(X = 1) + 0 \cdot \Pr(X = 0) = 1 \cdot p + 0 \cdot (1-p) = p.\] What this says is that we have two possible values, success and failure. If $p$, the probability of success is higher, then the expectation naturally rises. If we consider a large number of trials, then we would naturally expect to see the average of all of these trials grows.

Sometimes, we will need to deal with more than one random variable. For instance, suppose we want to consider the expected value of a roll of three dice. We would like to be able to use the expected value for a single die roll, which we already worked out above. The following result gives us a way to do this.

Let $(\Omega, \Pr)$ be a probability space, $c_1, c_2, \dots, c_n$ be real numbers, and $X_1, X_2, \dots, X_n$ be random variables over $\Omega$. Then \[E\left( \sum_{i=1}^n c_i X_i \right) = \sum_{i=1}^n c_i E(X_i).\]

Notice that this result is true for all random variables defined over the same sample space and that there is no requirement that the random variables are independent. That is, we can always add expectation of two random variables and multiply expectation by some constant factor. However, this does not include multiplying the expectation of two random variables.

Suppose we have three of our biased dice from the previous example, each with $\Pr(1) = \frac 3 4$ and probability $\frac{1}{20}$ for the other outcomes. What is the expected value of a roll of these three dice? Without the linearity property, we would need to consider the expected value for every 3-tuple of rolls $(\omega_1, \omega_2, \omega_3)$. However, we can apply the previous theorem to get a more direct answer. Let $X_1, X_2, X_3$ be random variables for the roll of each die. Then we have \[E(X_1 + X_2 + X_3) = E(X_1) + E(X_2) + E(X_3) = 1.75 + 1.75 + 1.75 = 5.25.\]

Let $X$ be a binomial random variable. We have \[E(X) = \sum_{k=0}^n k \cdot \Pr(X = k) = \sum_{k=0}^n k \binom n k p^k (1-p)^{n-k}.\] At this point, we have a fairly complicated formula. While it's possible to continue to work this out and simplify it using algebra, linearity of expectation gives us another, arguably more insightful, approach.

Recall that $X$ is a binomial random variable, so $X = k$ means we have $k$ successes on $n$ independent Bernoulli trials with probability of success $p$. Let $I_1, \dots, I_n$ be our Bernoulli trials with probability $p$. Then we can write \[X = I_1 + \cdots + I_n.\] So $X = k$ when exactly $k$ of the $I_j$'s is 1. Then we can apply linearity of expectation: \[E(X) = E(I_1 + \cdots + I_n) = E(I_1) + \cdots + E(I_n) = np.\]

CMSC 27100 — Lecture 25

The binomial distribution

Expectation