MPCS 50103 — Lecture 6

Probability

Now that we have spent some time on learning how to enumerate outcomes, we can start to think about the likelihood of particular outcomes. This is where probability theory comes into play. In particular, we are interested in discrete probability, which deals with countably many outcomes and is attacked using many of the tools that we've been using. The other branch of probability theory deals with uncountably many outcomes and is approached using methods from calculus and analysis. For instance, where we can get away with summation, you will see a lot of integration when working in continuous probability theory.

At first glance, even though we're primarily concerned with discrete probability, it seems a bit strange that we'd try to stuff probability theory into this course which is already chock full of other things to cover. But probability and randomness play an important role in computer science and particularly in theoretical computer science. Probabilistic methods are an important part of analysis of algorithms and algorithm design. There are some problems for which we may not know of an efficient deterministic algorithm, but we can come up with a provably efficient randomized algorithm. To tackle these problems, we need a basic understanding of discrete probability.

A probability space $(\Omega,\Pr)$ is a finite set $\Omega \neq \emptyset$ together with a function $\Pr : \Omega \to \mathbb R^+$ such that

  1. $\forall \omega \in \Omega, \Pr(\omega) \gt 0$,
  2. $\sum_{\omega \in \Omega} \Pr(\omega) = 1$.

We say that $\Omega$ is the sample space and $\Pr$ is the probability distribution.

The above two conditions are known as the probability axioms and is due to Andrey Kolmogorov in the 1930s. Kolmogorov was a Soviet mathematician and was particularly influential in computer science by developing notions of algorithmic information theory in the 1960s. It might be a bit surprising to learn (as I did) that the formalization of basic probability concepts didn't come about until the 20th century, despite there being analysis of games of chance dating back to 17th century France.

We can define more set theoretic notions for probability.

An event is a subset $A \subseteq \Omega$ and $\Pr : \mathcal P(\Omega) \to \mathbb R$ defined by $\Pr(A) = \sum_{\omega \in A} \Pr(\omega)$ is the probability of $A$. The atomic events are singleton sets $\{\omega\}$ with $\Pr(\{\omega\}) = \Pr(\omega)$ and the trivial events are events with probability 0 ($\emptyset$) or 1 ($\Omega$).

A sample space for flipping a coin would be whether it lands on heads and tails $\{H,T\}$. A sample space for the outcome of a die roll would be $\{1,2,3,4,5,6\}$. An event can be something simple like flipping a coin and it landing on heads, in which case the event would be the set $\{H\}$. It can also be more complicated, like a die roll landing on a square, which would be the set $\{1,4\}$, or perhaps the positive divisors of 6, which would be the set $\{1,2,3,6\}$.

For each sample space, we define a probability distribution, which tells us the probability of a particular event $\omega \in \Omega$. The simplest one is the uniform distribution.

The uniform distribution over a sample space $\Omega$ defines $\Pr(\omega) = \frac 1 {|\Omega|}$ for all $\omega \in \Omega$. Thus, $\Pr(A) = \frac{|A|}{|\Omega|}$.

Typically, this is the distribution we assume when we have an arbitrary finite set and say pick an element at random from our set—every item has an equal probability of being chosen:$\frac 1 n$ for $n$ elements, or $\frac{1}{|\Omega|}$.

It's clear from above that the sample space for rolling one die is $\{1,2,3,4,5,6\}$, but what about two dice? Assuming they're distinguishable, we have the space \[\{(i,j) \mid i,j \in \{1,2,3,4,5,6\}\}.\] Of course, we can go a bit further than this: this is just $\{1,2,3,4,5,6\} \times \{1,2,3,4,5,6\}$.

What about coins? It turns out that flipping a bunch of coins is probably the most important sample space for theoretical comptuer science. Why? Let's try to do the same exercise as with dice. What if we want to represent a series of three coin flips? We can play the same game as with the dice: we have $\{H,T\} \times \{H,T\} \times \{H,T\}$ for our sample space. Now, if we were to generalize this to $n$ coin flips, we would have something like $\{H,T\}^n$. This space looks a lot more familiar if we rename $T = 0$ and $H = 1$, giving us the space $\{0,1\}^n$. That's right—the sample space of $n$ coin flips can be represented as the set of length $n$ binary strings!

Recall that a standard deck of playing cards consists of four suits, $\spadesuit, \heartsuit, \clubsuit, \diamondsuit$ with cards numbered from 2 through 10, and a Jack, Queen, King, and Ace. This gives us 52 cards in total ($4 \times 13$). Suppose we deal a hand of 13 cards from a standard 52 card deck to four players: North, East, South, and West. What is the probability that North holds all the aces?

First, we need to think about what our sample space should be. Since we are concerned with hands, this will be the result of the deal: all of the possible ways to distribute 13 card hands to four people. Combinatorially, this is distributing 52 distinguishable objects into four distinguishable boxes with 13 objects each. Then the size of our sample space is $$|\Omega| = \binom{52}{13,13,13,13}.$$ Now, we want to count the number of such hands where North holds all the aces. In this case, North holds four cards that are fixed already, so we need to count how the other 48 cards are distributed. Let's call this event $N_{4A}$. This gives us $$|N_{4A}| = \binom{48}{9,13,13,13}.$$ Then the probability of our event is just \begin{align*} \Pr(N_{4A}) &= \frac{|N_{4A}|}{|\Omega|} \\ &= \frac{\binom{48}{9,13,13,13}}{\binom{52}{13,13,13,13}} \\ &= \frac{13 \cdot 12 \cdot 11 \cdot 10}{52 \cdot 51 \cdot 50 \cdot 49} \\ &\approx 0.00264 \end{align*}

However, sometimes, a uniform distribution may not cut it. These are situations in which events are not all equally likely.

Suppose we wanted to discuss the probabilty of a roll of a single die. For a fair die, this would be quite simple: our sample space would be $\{1,2,3,4,5,6\}$ and each outcome would be equally likely, $\frac 1 6$. However, if we wanted to model a fixed or biased die, or one in which one outcome is more likely, then our probability distribution for such a die would no longer use the uniform distribution. We would have to define our own probabilty distribution, ensuring that $\Pr(\Omega) = 1$. One such possibility would be to define the following distribution: \[\Pr(\omega) = \begin{cases} \frac 3 4 & \text{if $\omega = 1$}, \\ \frac 1{20} & \text{if $\omega = 2,3,4,5,6$}. \end{cases}\]

Since we are dealing with sets, we can combine our sets in the same way as we've been doing to consider the probability of multiple events together. First, the following is a simple, but important bound.

$\Pr(A \cup B) \leq \Pr(A) + \Pr(B)$.

This follows from the inclusion-exclusion princple. By definition, \begin{align*} \Pr(A \cup B) &= \frac{|A \cup B|}{|\Omega|} \\ &= \frac{|A| + |B| - |A \cap B|}{|\Omega|} \\ &= \frac{|A|}{|\Omega|} + \frac{|B|}{|\Omega|} - \frac{|A \cap B|}{|\Omega|} \\ &= \Pr(A) + \Pr(B) - \Pr(A \cap B) \\ &\leq \Pr(A) + \Pr(B) \end{align*}

 

Just like with sets, we can define the notion of disjointness for events.

Events $A$ and $B$ are disjoint if $A \cap B = \emptyset$.

$\Pr \left( \bigcup_{i=1}^k A_i \right) \leq \sum_{i=1}^k \Pr(A_i)$. Furthermore, this is an equality if and only if the $A_i$ are pairwise disjoint.

This is just a generalization of the union bound and the proof comes straight from applying it. Alternatively, you can use induction to prove this. If you consider the events to be disjoint, this is really pretty much the sum rule. As you might have noticed, this result is due to George Boole, who is the same Boole that gave us Boolean algebra.

Conditional probability

Something that doesn't quite have a direct analogue from our discussions on set theory is the notion of conditional events. Arguably, this is what makes probability theory more than just fancy combinatorics.

If $A$ and $B$ are events, then the conditional probability of $A$ given $B$ is $$\Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)}.$$ Note that by this definition, $\Pr(A \cap B) = \Pr(A \mid B) \cdot \Pr(B)$.

One way of looking at this is if we restrict our attention to the event $B$, treat it like a sample space, and ask, what is the probability of $A$ in this space? This is reflected in the following observation: $$\Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)} = \frac{\frac{|A \cap B|}{|\Omega|}}{\frac{|B|}{|\Omega|}} = \frac{|A \cap B|}{|B|}.$$

Consider a roll of three dice.

  1. What is the probability that the sum of the dice is 9?
  2. What is the probability that the first die shows 5?
  3. What is the probability that the first die shows 5, given that the sum of the dice is 9?

Here, if we want to talk about the "first" die, we need our dice to be distinguishable. This means that our sample space is an arrangement of the outcome of the three dice after a roll. Each of the three dice can take on a value from 1 through 6, giving us $6^3 = 216$ possible outcomes.

  1. What we want here is the number of solutions $x_1 + x_2 + x_3 = 9$, with $1 \leq x_1, x_2, x_3 \leq 6$. This is similar to a problem we saw about the number of solutions to the equation $x+y+z=9$. However, the problem is that in that problem, we are allowed to set $x,y,z = 0$, which we can't do with dice. To solve this, we rephrase the problem as trying to find solutions to the equation $x+y+z=6$ and count the solutions for which $x,y,z$ take on values from 0 through 5. This gets us most of the way there: there are $\binom 8 2$ such solutions. However, this includes a few solutions that we don't want: $(6,0,0), (0,6,0), (0,0,6)$. So we need to subtract these from our total. This gives us $\binom 8 2 - 3 = 25$ and then the probability is $\frac{25}{216} \approx 0.0116$.
  2. Here, we fix one of the die, leaving the other two free. This is exactly $1 \cdot 6 \cdot 6$, so the probability of this event is $\frac{6^2}{6^3} = \frac 1 6$.
  3. This is a conditional probability question. For convenience, let $A$ be the event "the sum of the dice is 9" and let $B$ be the event "the first die is 5". We already know that there are 25 instances where the dice sum up to 9. What's left is to compute the number of outcomes where the dice add up to 9 and the first die is 5. Fixing the first die to 5, we have that the other two dice must add up to 4, giving us the following outcomes: $(5,1,3),(5,2,2),(5,1,3)$. Then the probability is computed by the following: $$\Pr(B \mid A) = \frac{\Pr(A \cap B)}{\Pr(A)} = \frac{|A \cap B|}{|A|} = \frac{3}{25}.$$

Suppose I know $\Pr(A \mid B)$ and I know $\Pr(A \mid \overline B)$. Then that should just give me $\Pr(A)$ when I put them together, along with $\Pr(B)$. The Law of Total Probability generalizes this idea.

A partition of $\Omega$ is a family of pairwise disjoint events $H_1, \dots, H_m$ covering $\Omega$. That is, $\bigcup_{i=1}^m H_i = \Omega$ and $H_i \cap H_j = \emptyset$ for all $i \neq j$.

Given a partition $H_1, \dots, H_m$ of $\Omega$ and an event $A$, $$\Pr(A) = \sum_{i=1}^m \Pr(A \mid H_i) \cdot \Pr(H_i).$$

Since we have a partition $H_1, \dots, H_m$, we have $A = A \cap \Omega = A \cap \bigcup_{i = 1}^m H_i$. Then, \begin{align*} \Pr(A) &= \Pr\left( A \cap \bigcup_{i=1}^m H_i \right) \\ &= \Pr \left( \bigcup_{i=1}^m (A \cap H_i) \right) \\ &= \sum_{i=1}^m \Pr(A \cap H_i) &\text{since $H_i$ are pairwise disjoint} \\ &= \sum_{i=1}^m \Pr(A \mid H_i) \Pr(H_i) \end{align*}

 

Suppose that two delivery services $S$ and $T$ hold a duopoly and perform all deliveries in the same city. If $S$ has an on-time rate of 94% and carries out 70% of deliveries, while $T$ has an on-time rate of 98% and carries out 30% of deliveries, what is the overall on-time rate for deliveries in the city?

First, we need to determine our events.

Based on our information above, we know the on-time rates for each delivery company.

Then the Law of Total Probability gives us the following: \begin{align*} \Pr(O) &= \Pr(O \mid D_S) \cdot \Pr(D_S) + \Pr(O \mid D_T) \cdot \Pr(D_T) \\ &= \frac{94}{100} \cdot \frac{7}{10} + \frac{98}{100} \cdot \frac{3}{10} \\ &= 95.2\% \end{align*}

A common scenario that we may run into is that we have calculated the probabilities of some events based on some data we have. However, we will often receive new information and we would like to incorporate that new information to compute the probability based on the information we already have together with the new information. This is where the result of Thomas Bayes, Presbyterian minister and statistician, comes into play.

If $A$ and $B$ are events with $\Pr(A), \Pr(B) \gt 0$, then \[\Pr(A \mid B) = \frac{\Pr(B\mid A) \cdot \Pr(A)}{\Pr(B)}.\]

By definition, we have \[\Pr(A \mid B) = \frac{\Pr(A \cap B)}{\Pr(B)}.\] Rearranging this equation, we get $\Pr(A \cap B) = \Pr(A \mid B) \Pr(B)$. We can do something similar with $\Pr(B \mid A)$ to get $\Pr(A \cap B) = \Pr(B \mid A) \Pr(A)$. Then this gives us the equality $\Pr(A \mid B) \Pr(B) = \Pr(B \mid A) \Pr(A)$. Then we can rearrange this equation to get \[\Pr(A \mid B) = \frac{\Pr(B \mid A) \Pr(A)}{\Pr(B)}.\]

 

Consider again our biased die with $\Pr(1) = \frac 3 4$ and $\Pr(\omega) = \frac1{20}$ for $\omega = 2,3,4,5,6$. Suppose you go to a dice thrift store to pick up some second-hand dice. Unfortunately, there is a tendency for about 5% of the dice available to be biased in the way we described above and the remaining 95% of the dice are regular old fair dice. Because of this, the store will let you test-roll some dice. If you pick a die at random and roll two 1s, what is the probability that you picked up a biased die?

What we essentially want to do is compute the probability that, given that we rolled two 1s, we picked up a biased die. If we let $B$ denote the event that we picked up a biased die and we denote by $R_{1,1}$ the event that we rolled two 1s, then we want to compute $\Pr(B \mid R_{1,1})$. Bayes' Theorem tells us that $$\Pr(B \mid R_{1,1}) = \frac{\Pr(R_{1,1} \mid B) \cdot \Pr(B)}{\Pr(R_{1,1})}.$$ We know $\Pr(B) = 5\% = \frac 1 {20}$ and we know that $\Pr(R_{1,1} \mid B) = \left( \frac 3 4 \right)^2 = \frac{9}{16}$. All we need to do is to calculate $\Pr(R_{1,1})$.

Recall that the Law of Total Probability tells us that $$\Pr(R_{1,1}) = \Pr(R_{1,1} \mid B) \cdot \Pr(B) + \Pr(R_{1,1} \mid \overline B) \cdot \Pr(\overline B)$$ where $\overline B$ is the non-biased die event, or in other words, the fair dice. This gives us \begin{align*} \Pr(R_{1,1}) &= \Pr(R_{1,1} \mid B) \cdot \Pr(B) + \Pr(R_{1,1} \mid \overline B) \cdot \Pr(\overline B) \\ &= \frac{9}{16} \cdot \frac 1{20} + \frac 1{36} \cdot \frac{19}{20} \\ &= \frac{157}{2880} \end{align*} Then plugging these into the formula for Bayes' Theorem gives us $$\Pr(B \mid R_{1,1}) = \frac{\frac{9}{320}}{\frac{157}{2880}} = \frac{81}{157} \approx 0.516.$$

Independence

What we might notice with the introduction of conditional probability is that there are some events that don't affect each other. That is, we can take two events, condition one of them against the other and the outcome doesn't change.

Events $A$ and $B$ are independent if $\Pr(A \cap B) = \Pr(A) \cdot \Pr(B)$.

From the definitions we've seen, by rearranging the equation for conditional probability to get $\Pr(A \cap B) = \Pr(A \mid B) \cdot \Pr(B)$, we can make the following observations:

It's worth pointing out here the difference between independence and disjointedness. The key lies in correctly interpreting the definition for disjointedness, which is that $\Pr(A \cap B) = 0$. If we go back to the definition, this says that $|A \cap B| = 0$. In other words, events $A$ and $B$ cannot and do not ever occur at the same time. On the other hand, independence tells us that $A$ and $B$ can occur at the same time, but they do not have any effect on each other.

We say events $A$ and $B$ are positively correlated if $\Pr(A \cap B) \gt \Pr(A) \cdot \Pr(B)$ and they are negatively correlated if $\Pr(A \cap B) \lt \Pr(A) \cdot \Pr(B)$.

Consider a roll of a six-sided die and the events

Are these events independent? If not, how are they related? First, our sample space is $\Omega = \{1,2,3,4,5,6\}$, so our events are going to be $A = \{1,3,5\}$ and $B = \{2,3,5\}$. Then $\Pr(A) = \frac 3 6 = \frac 1 2$ and $\Pr(B) = \frac 3 6 = \frac 1 2$. We want to consider the probability of the intersection of our events. This gives us $$\Pr(A \cap B) = \Pr(\{1,3,5\} \cap \{2,3,5\}) = \Pr(\{3,5\}) = \frac 2 6 = \frac 1 3.$$ But $\Pr(A) \cdot \Pr(B) = \frac 1 2 \cdot \frac 1 2 = \frac 1 4$, which is not $\frac 1 3$, so $A$ and $B$ are not independent. Furthermore, we have $$\Pr(A \cap B) = \frac 1 3 \gt \frac 1 4 = \Pr(A) \cdot \Pr(B),$$ so $A$ and $B$ are positively correlated. Of course, this makes sense intuitively, because we know the only even prime number is 2.

Events $A_1, A_2, \dots, A_k$ are pairwise independent if for all $i \neq j$, $A_i$ and $A_j$ are independent. The events are mutually independent if for every $I \subseteq \{1, \dots, k\}$, $$\Pr \left( \bigcap_{i \in I} A_i \right) = \prod_{i \in I} \Pr(A_i).$$

This definition suggests that there can be events $A_1, A_2, \dots, A_k$ which are pairwise independent, but not mutually independent. Coming up with examples of such events ($A,B,C$) which are pairwise but not mutually independent) makes for a nice exercise.

The Probabilistic Method

Recall that the Ramsey numbers $R(m,n)$ is the smallest number of people required to guarantee either a group of $m$ mutually friendly people or $n$ mutually unfriendly people. Proving Ramsey numbers is actually quite difficult, but we can use probability theory to get some nice bounds. The following proof technique is called the Probabilistic Method, due to Paul Erdős in 1947.

Very often, we want to know that some object exists. We've seen some ways to deal with this. The most straightforward way is to simply show how to find or construct it via a direct, constructive proof. But we've also seen indirect methods, like contradiction, where we can argue that the object must exist, but we don't provide a way to actually find it.

The probabilistic method is another indirect method. Here's the idea: we want to consider a set of objects and show that there's some object in this set with the property that we want. If we consider a probability space of these objects and consider one at random, we can think about the probability that this object may or may not have our property. If we can argue that our property appears with non-zero probability, then it must exist!

For any $k \geq 3$, $R(k,k) \gt 2^{\frac k 2 - 1}$.

We consider an arbitrary group of $n$ people and assume that the probability whether each pair is friendly or not is $\frac 1 2$, independent of any other pairs. One way to think about this is to take each pair of people and flip a coin and assign whether they are friendly or not based on the outcome.

Then whether there are $k$ people who are all mutually friendly or unfriendly has a total probability of $\frac 1 {2^{\binom k 2}}$. To see this, we consider an arbitrary group of $k$ people. Since their relationship status is assigned independently of each other with $\frac 1 2$, this becomes $\left(\frac 1 2\right)^p$ for $p$ pairs. But since there are $k$ people, there are $\binom k 2$ possible pairs, so we get $p = \binom k 2$.

Now, we note that there are two possible configurations of interest: all $\binom k 2$ pairs are friendly or they are all unfriendly. This means that we can double our probability to $2 \cdot \frac 1 {2^{\binom k 2}}$.

But this is for an arbitrary group of $k$ people out of $n$. We must now consider all possible such groups. There are $\binom n k$ of these. If we enuemerate these groups, we can let $K_i$ be the event that the $i$th group of $k$ people are mutually friendly or unfriendly. We have $\Pr(K_i) = 2 \cdot \frac 1 {2^{\binom k 2}}$ for $1 \leq i \leq \binom n k$. Then by the union bound, we have \[\Pr\left( \bigcup_{i=1}^{\binom n k} K_i \right) \leq \sum_{i=1}^{\binom n k} \Pr(K_i) = \binom n k \cdot 2 \cdot \frac 1 {2^{\binom k 2}}.\]

Note that if the left hand side of the inequality is 1, then this says that for our choice of $n$, we are guaranteed to have a group of $k$ mutually friendly or unfriendly people. So what we would like is to find $n$ such that our probability is less than 1. Then if we can argue that such a choice of $n$ means that we must have $n \leq 2^{\frac k 2 -1}$, we've proved our theorem.

First, we observe that $\binom n k \leq n^k$, which is a fairly rough estimate, and recall that $\binom k 2 = \frac 1 2 k(k-1)$. Then when we take $\binom n k \cdot 2 \cdot \frac 1 {2^{\binom k 2}} \lt 1$, we have \[2n^k \leq 2\binom n k \lt 2^{\binom k 2} = 2^{\frac 1 2 k(k-1)}.\] This gives us $n \leq 2^{\frac k 2 -1}$. This says that when we have groups of size $2^{\frac k 2 - 1}$, there is a nonzero probability that they do not contain $k$ mutually friendly or unfriendly people. In other words, to guarantee that our group contains such a subrgroup, we need strictly more than this.

Random variables

Recall at the beginning of the course, when we were first discussing logic, that we could define propositional statements, but at some point we hit a wall in terms of being able to express properties of objects. At that point, we had to introduce predicates into our language to do be able to express such notions.

We have a very similar problem at this point, where we have been discussing and defining events explicitly whenever we want to discuss some particular set of outcomes. What would be nice is if there were a way to take some events from our sample space and assign it some value based on the properties of the event. Of course, this describes what a function would do, and so, we will introduce a particular kind of function for this purpose.

A random variable is a function $f : \Omega \to \mathbb R$ where $\Omega$ is the sample space of a probability space.

Note that the choice of co-domain $\mathbb R$ is more permissive than prescriptive. A lot of the time, we'll be mapping objects to a particular subset of $\mathbb R$, like $\mathbb N$ or even $\{0,1\}$. And while the standard definition of a random variable is for $\mathbb R$, this can be extended to more exotic co-domains than just $\mathbb R$. In our case, since we're primarily concerned with discrete sample spaces, we are likely to consider random variables that map to discrete spaces.

Now, you may have noticed from our discussion that random variables are not random and are not variables (a random variable is a function, as we can see from the definition above). This naming is due to Francesco Cantelli, from the early 1900s.

The simplest example of a random variable is the random variable that just describes the outcome of the event. These are called indicator random variables or just indicator variables.

For an event $A$, the indicator variable $I_A$ is a function $I_A : \Omega \to \{0,1\}$ which is defined by \[I_A(\omega) = \begin{cases} 1 & \text{if $\omega \in A$}, \\ 0 & \text{if $\omega \not \in A$}. \end{cases} \]

Having set up this mechanism, we can define the probability of a particular outcome of a random variable.

If $X$ is a random variable, we define $\Pr(X = r) = \Pr(\{\omega \in \Omega \mid X(\omega) = r\})$.

Suppose we flip a coin three times. Let $X(\omega)$ be the random variable that equals the number of heads that appear when $\omega$ is the outcome. So we have \begin{align*} X(HHH) &= 3, \\ X(HHT) = X(HTH) = X(THH) &= 2, \\ X(TTH) = X(THT) = X(HHT) &= 1, \\ X(TTT) &= 0. \end{align*} Now, suppose that the coin is fair. Since we assume that each of the coin flips is mutually independent, this gives us a probability of $\frac 1 2 \frac 1 2 \frac 1 2 = \frac 1 8$ for any one of the above outcomes. Now, let's consider $\Pr(X = k)$:

$k$ $\{\omega \in \Omega \mid X(\omega) = k\}$ $\Pr(X = k)$
3 $\{HHH\}$ $\frac 1 8$
2 $\{HHT,HTH,THH\}$ $\frac 3 8$
1 $\{TTH,THT,HTT\}$ $\frac 3 8$
0 $\{TTT\}$ $\frac 1 8$

This looks suspiciously like something we've seen before. Here, we'll introduce the notion of Bernoulli trials.

A Bernoulli Trial is a random variable $B$ whose range is restricted to $\{0,1\}$, where the event $\{\omega \in \Omega \mid B(\omega) = 1\}$ is the success event, and $\{\omega \in \Omega \mid B(\omega) = 0\}$ is the failure event.

Bernoulli trials are named after Jacob Bernoulli who is also sometimes referred to as James Bernoulli in the late 1600s. There are lots of other mathematicians in the Bernoulli family but the lion's share of mathematical results named after Bernoulli are really named after this particular Bernoulli.

Bernoulli trials are a repetition of $n$ mutually independent events. How should we interpret this in our probability framework? In other words, what do the sample space and the probability distribution look like?

If we have a probability space $(\Omega,\Pr)$ and we want to conduct $n$ independent trials, then what we really want is to take the product of a bunch of these outcomes. So we define the space $(\Omega^n, \Pr_n)$, where $\Omega^n$ is the set of $n$-tuples $(a_1, a_2, \dots, a_n)$ with $a_i \in \Omega$ and $$\Pr_n((a_1, a_2, \dots, a_n)) = \Pr(a_1) \Pr(a_2) \cdots \Pr(a_n).$$ Here, each position $i$ corresponds to one of our trials and we can define the corresponding Bernoulli trial by $X_i((a_1, \dots, a_n)) = X(a_i)$. Since we're taking the product, none of the trials in our tuple depend on each other, so they're independent. We can use this idea to prove the following.

The probability of $k$ successes in $n$ independent Bernoulli trials with probability of success $p$ and probability of failure $q = 1-p$ is $$\binom n k p^k q^{n-k} = \binom n k p^k (1-p)^{n-k}.$$

For each $n$-tuple $A = (a_1, \dots, a_n)$ of trials with Bernoulli random variable $X$, we can construct a string $w_A = b_1 b_2 \cdots b_n$ over the alphabet $\{S,F\}$ by $$b_i = \begin{cases} S & \text{if $X(a_i) = 1$,} \\ F & \text{otherwise.} \end{cases}$$ Then the number of $n$-tuples with $k$ successes is the same as the number of binary strings of length $n$ with $k$ $S$'s. Recall that there are exactly $\binom n k$ of these.

Then for each $n$-tuple with $k$ successes, we have $\Pr_n(A) = p^k (1-p)^{n-k}$. Putting this together, the probability of $k$ successes is \[\binom n k p^k (1-p)^{n-k}.\]

 

This is called the binomial distribution, for the reason that this is what you'd get if you plugged in the $x = p$ and $y = (1-p)$ into the Binomial Theorem. It is from this that we get the definition $$b(k;n,p) = \binom n k p^k (1-p)^{n-k}.$$ We can then reinterpret our results from the coin-flipping example as a Bernoulli trial. Since the probability for a fair coin flip to land on heads is $p = \frac 1 2$, we have

$k$ $b(k;n,p)$
3 $$\binom 3 3 \left( \frac 1 2 \right)^3 \left( \frac 1 2 \right)^0$$ $$\frac 1 8$$
2 $$\binom 3 2 \left( \frac 1 2 \right)^2 \left( \frac 1 2 \right)^1$$ $$\frac 3 8$$
1 $$\binom 3 1 \left( \frac 1 2 \right)^1 \left( \frac 1 2 \right)^2$$ $$\frac 3 8$$
0 $$\binom 3 0 \left( \frac 1 2 \right)^0 \left( \frac 1 2 \right)^3$$ $$\frac 1 8$$