CMSC 27100 — Lecture 22

Probability

Now that we have spent some time on learning how to enumerate outcomes, we can start to think about the likelihood of particular outcomes. This is where probability theory comes into play. In particular, we are interested in discrete probability, which deals with countably many outcomes and is attacked using many of the tools that we've been using in this class. The "non-discrete" branch of probability theory deals with uncountably many outcomes and is approached using methods from calculus and analysis. For instance, where we can get away with summation, you will see a lot of integration when working with probability distributions that are continuous.

At first glance, even though we're primarily concerned with discrete probability, it seems a bit strange that we'd try to stuff probability theory into this course which is already chock full of other things to cover. But probability and randomness play an important role in computer science and particularly in theoretical computer science. Probabilistic methods are an important part of analysis of algorithms and algorithm design. There are some problems for which we may not know of an efficient deterministic algorithm, but we can come up with a provably efficient randomized algorithm. To tackle these problems, we need a basic understanding of discrete probability. The nice thing about this is that we don't really need too much fancy probability theory for this purpose.

A probability space $(\Omega,\Pr)$ is a pair of a finite set $\Omega \neq \emptyset$ together with a function $\Pr : \Omega \to \mathbb R$ such that

$\forall \omega \in \Omega, \Pr(\omega) \gt 0$,
$\sum\limits_{\omega \in \Omega} \Pr(\omega) = 1$.

We say that $\Omega$ is the sample space and $\Pr$ is the probability distribution.

The above two conditions are known as the probability axioms and is due to Andrey Kolmogorov in the 1930s. Kolmogorov was a Soviet mathematician and was particularly influential in computer science by developing notions of algorithmic information theory in the 1960s. It might be a bit surprising to learn (as I did) that the formalization of basic probability concepts didn't come about until the 20th century, despite there being analysis of games of chance dating back to 17th century France.

We can define more set theoretic notions for probability.

An event is a subset $A \subseteq \Omega$ of the sample space.

We extend the probability distribution $\Pr : \mathcal P(\Omega) \to \mathbb R$ to events, and define the probability of an event $A$ by $\Pr(A) = \sum\limits_{\omega \in A} \Pr(\omega)$.
The atomic events are singleton sets $\{\omega\}$ with $\Pr(\{\omega\}) = \Pr(\omega)$.
The trivial events are events with probability 0 ($\emptyset$) or 1 ($\Omega$).

Because the sample space and events are just sets, we can often give "schematic" visual representations of them in the same way as sets. For example, the following is the sample space $\Omega$, which we treat in the same way as a universe, and events $A$ and $B$.

Intuitively, if we "throw a dart" at $\Omega$, then the probability that it lands in $A$ should be the $\Pr(A)$.

A sample space for flipping a coin would be whether it lands on heads and tails $\{H,T\}$. A sample space for the outcome of a die roll would be $\{1,2,3,4,5,6\}$. An event can be something simple like flipping a coin and it landing on heads, in which case the event would be the set $\{H\}$. It can also be more complicated, like a die roll landing on a square, which would be the set $\{1,4\}$, or perhaps the positive divisors of 6, which would be the set $\{1,2,3,6\}$.

For each sample space, we define a probability distribution, which tells us the probability of a particular event $\omega \in \Omega$. The simplest one is the uniform distribution.

The uniform distribution over a sample space $\Omega$ defines $\Pr(\omega) = \frac 1 {|\Omega|}$ for all $\omega \in \Omega$. Thus, $\Pr(A) = \frac{|A|}{|\Omega|}$.

Typically, this is the distribution we assume when we have an arbitrary finite set and say pick an element at random from our set—every item has an equal probability of being chosen:$\frac 1 n$ for $n$ elements, or $\frac{1}{|\Omega|}$.

In some sense, our picture of $\Omega$ and events $A$ and $B$ above is a bit inaccurate, because it depicts a continuous sample space. It might be better to make an explicitly discrete sample space. We can visualize this as a sample space that is broken up into discrete chunks.

Here, we have the same idea, except we have discretized the space into 12 blocks, each of equal size. So now, we have $\Pr(A) = \frac 6{12} = \frac 1 2$ and $\Pr(B) = \frac{5}{12}$.

To determine the probability of an event (i.e. solve a problem), the textbook outlines a helpful process.

Find the sample space $\Omega$: what is the set of outcomes that we're considering?
Define events of interest. Here, we may need to define more than one event, but the process is the same: what are the outcomes associated with our event?
Determine outcome probabilities. Here, we need to determine what the probability of our outcomes are. If our distribution is uniform, this is pretty easy. If not, there's a bit more work involved.
Compute event probabilities. This is putting everything together.

It's clear from above that the sample space for rolling one die is $\{1,2,3,4,5,6\}$, but what about two dice? Suppose we want to know the probability that our roll of two dice sums to 8.

First, we need to figure out what our sample space is. One possibility is to take $\Omega = \{2, 3, \dots, 12\}$, the sum of the two rolls. One problem is that the outcomes aren't uniform. This is not the end of the world, but let's try to stick with uniform probability for now. Assuming the rolls of the two dice are distinguishable, another possibility is the space \[\{(i,j) \mid i,j \in \{1,2,3,4,5,6\}\}.\] This is just $\{1,2,3,4,5,6\} \times \{1,2,3,4,5,6\}$. Another way to interpret this is that instead of rolling two distinguishable dice, we may think of it as two rolls of one die, one after the other.
Now we want to figure out the probability that our two rolls sum up to 8. This would be the set \[E_8 = \{(i,j) \in \{1,\dots,6\}^2 \mid i+j = 8\}.\] In fact, we can list these out explicitly: \[E_8 = \{(2,6), (3,5), (4,4), (5,3), (6,2)\}.\]
Now, we need to compute the probabilities of the outcomes. Since we assume uniform probability (that is, the dice are fair), this is just the number of elements in $\Omega$. We get this from the product rule: $\{1, \dots, 6\}$ has six elements so the size of $\{1,\dots, 6\}^2$ is $6 \times 6 = 36$. So each individual outcome $\omega \in \Omega$ has probability $\frac{1}{36}$.
Finally, we can compute the probability of our event. We have have five outcomes in $E_8$, so this gives us \[\Pr(E_8) = \frac{|E_8|}{|\{1,\dots,6\}^2|} = \frac{5}{36}.\]

What about coins? It turns out that flipping a bunch of coins is probably the most important sample space for theoretical comptuer science. Why? Let's try to do the same exercise as with dice. What if we want to represent a series of three coin flips? We can play the same game as with the dice: we have $\{H,T\} \times \{H,T\} \times \{H,T\}$ for our sample space. Now, if we were to generalize this to $n$ coin flips, we would have something like $\{H,T\}^n$. This space looks a lot more familiar if we rename $T = 0$ and $H = 1$, giving us the space $\{0,1\}^n$. That's right—the sample space of $n$ coin flips can be represented as the set of length $n$ binary strings!

Suppose we'd like to know for a series of 12 coin flips what the probability that we have exactly 8 heads is.

First, what's our sample space? If we interpret this as a binary string over $\{H,T\}$, then we get that the sample space is $\{H,T\}^{12}$.
Next, what's our event? Plainly stated, this is \[E_8 = \{x \in \{H,T\}^{12} \mid |x|_H = 8\}.\]
Now, what are the outcome probabilities? Assuming we have a fair coin, we have uniform probability—every one of the sequences of length 12 is equally likely, so this is $\frac{1}{2^{12}}$ for each string.
Finally what is our event probability? Since we're dealing with uniform probability, we just need the number of strings in our event. How many elements are in this set? Well, this is just the number of binary strings of length 12 with exactly 8 H's, which we know is $\binom{12}{8}$. So we have \[\Pr(E_8) = \frac{|E_8|}{|\{H,T\}^{12}|} = \frac{\binom{12}{8}}{2^{12}} = \frac{495}{4096}.\]

The view of a series of coin flips as binary strings is nice because it gives a bit of insight into how we might think of generating a random number. Typically, we think of a random number as something that we just "choose" "at random", but most people do not make choices that are truly uniformly at random. This suggests a better way of thinking about random numbers. First, figure out how large of a random number you want. Then flip a coin $n$ times, where $n$ is the size of your number. Then you have a binary string that you can interpret as a number!

This also suggests that creating a larger random number requires a "more" randomness and it gives us a way to think about randomness as a resource. If we need more coin flips, we are using more bits of randomness. We can then treat this quantity as a resource just like we do with time or space and ask how much randomness we might need for an algorithm and talk about the amount of randomness as a size of the input to the algorithm.

Recall that a standard deck of playing cards consists of four suits, $\spadesuit, \heartsuit, \clubsuit, \diamondsuit$ with cards numbered from 2 through 10, and a Jack, Queen, King, and Ace. This gives us 52 cards in total ($4 \times 13$). Suppose we want to know when drawing five cards what the probability that we can draw a flush is—that is, five cards of the same suit. (Technically a flush will exclude things like straight flushes, but we'll ignore that at this point)

First, what is our sample space? We want to consider all possible draws of five cards. We'll additionally note that we do not consider order to matter in these draws.
Next, our event will be drawing a flush—specifically, a draw of five cards containing the same suit.
Then what are our outcome probabilities? Since the order in which we arrange these cards doesn't matter, our sample space is the set of combinations of five cards drawn from a set of 52, all equally likely. There are $\binom{52}5$ such possible draws, so the probability of one particular hand is $\frac{1}{\binom{52}{2}}$.
Finally, we need to count how many flushes there are. Since each suit has 13 cards, we have $\binom{13}5$ possible ways to get a flush from a single suit. Since there are four suits, we have $\binom 4 1 \binom{13}5$ flushes in total. So the probability that we have a flush is \[\frac{|F|}{|\Omega|} = \frac{\binom 4 1 \binom{13}{5}}{\binom{52}5}.\]

Since we are dealing with sets, we can combine our sets in the same way as we've been doing to consider the probability of multiple events together. First, the following is a simple, but important bound.

$\Pr(A \cup B) \leq \Pr(A) + \Pr(B)$.

This follows directly from our understanding of sets. We have \begin{align*} \Pr(A \cup B) &= \frac{|A \cup B|}{|\Omega|} \\ &= \frac{|A| + |B| - |A \cap B|}{|\Omega|} \\ &= \frac{|A|}{|\Omega|} + \frac{|B|}{|\Omega|} - \frac{|A \cap B|}{|\Omega|} \\ &= \Pr(A) + \Pr(B) - \Pr(A \cap B) \\ &\leq \Pr(A) + \Pr(B) \end{align*}

Just like with sets, we can define the notion of disjointness for events.

Events $A$ and $B$ are disjoint if $A \cap B = \emptyset$.

$\Pr \left( \bigcup\limits_{i=1}^k A_i \right) \leq \sum_{i=1}^k \Pr(A_i)$. Furthermore, this is an equality if and only if the $A_i$ are pairwise disjoint.

This is just a generalization of the union bound and the proof comes straight from applying it. Alternatively, you can use induction to prove this. If you consider the events to be disjoint, this is really pretty much the sum rule. As you might have noticed, this result is due to George Boole, who is the same Boole that gave us Boolean algebra. Often times, this is also called the union bound.

There are a number of other useful properties that we can derive from set theory.

Let $A$ and $B$ be events. Then

$\Pr(\overline A) = 1 - \Pr(A)$.
$\Pr(B \setminus A) = \Pr(B) - \Pr(A\cap B)$.
$\Pr(A \cup B) = \Pr(A) + \Pr(B) - \Pr(A \cap B)$.
If $A \subseteq B$, then $\Pr(A) \leq \Pr(B)$.

Conditional probability

Something that doesn't quite have a direct analogue from our discussions on set theory is the notion of conditional events. Arguably, this is what makes probability theory more than just fancy combinatorics.

If $A$ and $B$ are events, then the conditional probability of $A$ given $B$ is $$\Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)}.$$ Note that by this definition, $\Pr(A \cap B) = \Pr(A \mid B) \cdot \Pr(B)$.

One way of looking at this is if we restrict our attention to the event $B$, treat it like a sample space, and ask, what is the probability of $A$ in this space? In other words, what we're saying here is that if we know that $B$ already happened, then what is the probability that $A$ will happen? If we assume that our outcomes are uniformly distributed, then we can think about this in terms of sets: $$\Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)} = \frac{\frac{|A \cap B|}{|\Omega|}}{\frac{|B|}{|\Omega|}} = \frac{|A \cap B|}{|B|}.$$

If we look back at our diagram, we have the following.

We can view this "globally" with respect to $\Omega$ and compute the probabilities: \[\Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)} = \frac{\frac{2}{12}}{\frac{5}{12}} = \frac 2 5.\] Or, if we jump straight into our observation above about "recentering" our space on $B$, we get that $\Pr(A \mid B) = \frac{|A \cap B|}{|B|} = \frac 2 5$ all the same.