It should have been pretty obvious from the themes of the course that we are eventually going to run into languages that we can't recognize using a finite automaton. From the perspective of computation, this means that there are problems that we can't compute using a DFA. Of course, this is obvious: after all, DFAs have a finite amount of memory.
However, before trying to figure out how we can modify finite automata to give them more power in order to recognize such non-regular languges, we're going to study what non-regular languages might look like. After all, while reaching for a fancier model of computation might solve our problems, it doesn't help us identify and understand what makes a non-regular language impossible for a finite automaton to recognize.
We'll describe a condition that all regular languages satisfy: the Pumping Lemma. First, I'll give you the statement of the lemma, and then reassure you that it's not really as bad as it looks.
Lemma (Pumping lemma for regular languages). Let $L \subseteq \Sigma^*$ be a regular language. Then there exists a positive integer $n$ such that for every string $w \in L$ with $|w| \geq n$, there exists a factorization $w = xyz$ with $x,y,z \in \Sigma^*$ such that
- $y \neq \varepsilon$,
- $|xy| \leq n$, and
- $xy^iz \in L$ for all $i \in \mathbb N$.
To anyone who isn't familiar with it already, the statement of the pumping lemma can be very intimidating, but the idea behind it is quite simple. Here, $n$ is often referred to as the pumping length of $L$ and it is a property of the regular language $L$. What the pumping lemma intends to say is that for a long enough string, there are some things that we know about it and other similar strings in $L$ if $L$ is regular.
Specifically, if $L$ is regular, then it is recognized by a DFA and the DFA only has so many states, say $n$. If we read a long enough string, with more than $n$ symbols, then it has to reach at least one of the $n$ states more than once some point. This means that there's a loop in the DFA and we should be able to repeat the part of the word that goes through the loop as much as we like, if it really is recognized by a DFA. So let's walk through this in more detail as we try to prove the lemma.
Proof. Let $A = (Q,\Sigma,\delta,q_0,F)$ be the DFA that recognizes $L$ and let $n = |Q|$ be the number of states of $A$. If there are no strings that are of length at least $n$, then we're done (But what does this say about the language $L$?).
So, we assume that there is a string in $L$ with length at least $n$, say $w = a_1 a_2 \cdots a_\ell$, with $a_i \in \Sigma$ and $1 \leq i \leq \ell = |w| \geq n$. Since $w \in L$, we know that $w$ is accepted by $A$, meaning that there is a sequence of states $p_0, p_1, \dots, p_\ell \in Q$ where $p_0 = q_0$, $p_\ell \in F$, and $\delta(p_i, a_{i+1}) = p_{i+1}$ for $0 \leq i \leq \ell - 1$.
Note that since $\ell \geq n$, we have a sequence of states $p_0, \dots, p_n$ with $n+1$ states. However, by our definition of $A$, the state set $Q$ only contains $n$ distinct elements. Therefore, at least one of the states is repeated in the sequence. In other words, we have $p_s = p_t$ for some $0 \leq s < t \leq n$.
Now factor $w$ into three words
Here, we see that $y$ is non-empty, as required by condition (1) of the lemma, since $|y| = t-s$ by our requirement that $s < t$. Then $xy$ satisfies condition (2) since $|xy| = t$ and we set $t \leq n$.
To see that this factorization of $w$ also satisfies condition (3), we recall the sequence of states $p_0, \dots, p_\ell$. Then it is clear that by reading $x$, $y$, and $z$, we have the following path through states of $A$, $$q_0 = p_0 \xrightarrow{x} p_s \xrightarrow{y} p_t \xrightarrow{z} p_\ell \in F.$$ But since $p_s = p_t$, it is clear that we have $$q_0 = p_0 \xrightarrow{x} p_s \xrightarrow{y^i} p_t \xrightarrow{z} p_\ell \in F$$ for all $i \in \mathbb N$.
Thus, we have shown that any string $w \in L$ of length greater than $n$ satisfies conditions (1-3). $$\tag*{$\Box$}$$
It's important to note that what the Pumping lemma says is that all regular languages will satisfy these conditions. What it doesn't say is that if a language satisfies these conditions, it must be regular. Basically, the Pumping lemma does not give a characterization for the regular languages.
The pumping lemma is another result that was shown by Rabin and Scott in 1959 (along with the definition of NFAs and the subset construction that we saw earlier). As we'll see, next, this result seems to be a fairly important tool, since it has a name and everything. But it turns out that in the paper, this was a fairly minor characterization, sandwiched in between two other theorems. Hence, why this seemingly important result is only a lemma.
Now, we're going to take a look at some examples of languages that are not regular and use the pumping lemma to prove that they're not regular. The pumping lemma is a property that regular languages must satisfy. We can show a language isn't regular by making use of this fact.
Suppose we have a language $L$ that we suspect may not be regular and we would like to confirm our suspicion. Well, if it is regular, then it must satisfy the pumping lemma. What we will do is assume that $L$ is regular and show that it can't satisfy the pumping lemma, obtaining a contradiction to our assumption that $L$ was regular.
Proposition. Let $L = \{a^k b^k \mid k \geq 0\}$. Then $L$ is not regular.
Proof. Suppose $L$ is regular and let $n > 0$ be the pumping length for $L$. Let $w = a^n b^n$. Since $|w| = 2n \geq n$, we can factor $w = xyz$ with $x = a^s$ and $y = a^t$ with $t > 0$ so $y \neq \varepsilon$ and $s+t \leq n$ to satisfy the conditions of the pumping lemma. Then we can write $w = a^s a^t a^{n-s-t} b^n$. To satisfy the conditions of the pumping lemma, we must have $a^s a^{i \cdot t} a^{n-s-t} b^n \in L$ for all $i \in \mathbb N$. However, $$s + i \cdot t + n - s - t = n + (i-1)t$$ which is not equal to $n$ when $i \neq 1$. Therefore $xy^iz \not\in L$ for $i \neq 1$ and $L$ does not satisfy the conditions of the pumping lemma. Thus, $L$ is not regular. $$\tag*{$\Box$}$$
Here, we note that the choice of $w$ makes a difference in how easy the rest of the proof goes. It's important to go over what the pumping lemma says again: the conditions apply for any string with length greater than $n$. In other words, we can choose any string we wish as long as it has length greater than $n$. An easy mistake to make is thinking that $w$ must have length exactly $n$.
The reason your choice of $w$ matters is because of the conditions on the factorization of $w$ into $xyz$. Notice that the pumping lemma says that there exists a factorization such that the conditions hold. In other words, in order to defeat the pumping lemma, you must show that for your chosen word $w$, every factorization of $w$ cannot satisfy the conditions of the pumping lemma.
To see this, suppose we have the language $L = (ab)^*$, which is regular. For any $n > 0$, choose $w = (ab)^n \in L$. Now choose the factorization $x = a, y = b, z = (ab)^{n-1}$. Then $xy^i z = ab^i (ab)^{n-1}$, which is not in $L$ and we might mistakenly conclude that $L$ is not regular. However, choosing $x = \varepsilon, y = ab, z = (ab)^{n-1}$ gives us a factorization that satisfies the pumping lemma.
In the proof of the proposition above, by choosing $w = a^n b^n$, we simplified the rest of the proof, since all possible factorizations $xyz$ had $y = a^t$ for some $t > 0$. On the other hand, if we had chosen, say, $w = a^{\frac n 2} b^{\frac n 2}$, then we would have had to show that the pumping lemma would not be satisfied for factorizations with $y = a^t$, $y = a^s b^t$, and $y = b^t$ for $s,t > 0$. Of course, it is possible to prove that the pumping lemma doesn't hold via $w = a^{\frac n 2} b^{\frac n 2}$, but it is a lot more work.
Proposition. Let $L = \{a^p \mid \text{$p$ is a prime number} \}$. Then $L$ is not regular.
Proof. Suppose $L$ is regular and let $n > 0$ be the pumping length for $L$. Since there are infinitely many primes, for any possible pumping length $n$, we can always choose a prime $p \geq n$. Let $w = a^p$.
Since $|w| = p \geq n$, we can factor $w = xyz$ with $|xy| \leq n$ and $y \neq \varepsilon$ and we have $y = a^k$ with $1 \leq k \leq n$, satisfying the conditions of the pumping lemma. Then if $L$ is regular, we must have $a^{p+(i-1)k} \in L$ for every $i \in \mathbb N$. However, if $i = p+1$, we have $p+pk = (k+1)p$, which is not prime. Therefore, $a^{p+(i-1)k} \not \in L$ for $i = p+1$ and therefore $w$ does not satisfy the conditions of the pumping lemma. Thus, $L$ is not regular. $$\tag*{$\Box$}$$
The non-regularity of the language of binary strings representing prime numbers was shown by Hartmanis and Shank in 1967. While they mention the pumping lemma, they actually go on to show something even stronger, that this language is not even context-free.
Proposition. Let $L = \{ w \in \{a,b\}^* \mid w = w^R \}$. Then $L$ is not regular.
Proof. Suppose $L$ is regular and let $n > 0$ be the pumping length for $L$. Let $w = a^n b a^n$. We have $|w| = 2n+1 \geq n$, so we factor $w = xyz$ such that $x = a^s$ and $y = a^t$ with $t > 0$ and $s+t \leq n$ as required by the pumping lemma. Then we have $w = a^s a^t a^{n-s-t} b a^n$. To satisfy the conditions of the pumping lemma, we must have $w = a^s a^{i \cdot t} a^{n-s-t} b a^n \in L$ for all $i \in \mathbb N$. However, for $i = 2$, we have $$xy^2z = a^{n+t} b a^n.$$ Since $t \geq 1$, it is clear that $xy^2 z \neq (xy^2 z)^R$ and therefore $xy^2 z \not \in L$. This contradicts the pumping lemma, since there exists $i \in \mathbb N$ for which $xy^i z \not \in L$. Thus, $L$ is not regular. $$\tag*{$\Box$}$$
We've already seen that there are some language operations for which, when performed on a regular language also gives us a regular language. Specifically, we showed that regular languages are closed under the regular operations of union, concatenation, and Kleene star.
We say that a class of languages is closed under an operation if the operation on languages of in the class produces a language that's also in the class. Showing that a class is closed under various operations is helpful because it gives us a way to construct new languages within the class using languages that we already know belong to the class. We call these closure properties of the language class.
Recall the complement operation discussed at the beginning of the course. We will show that the regular languages are closed under complement.
Proposition. Let $L \subseteq \Sigma^*$ be a regular language. Then $\overline L = \Sigma^* \setminus L$ is a regular language.
Proof. Since $L$ is a regular language, it is recognized by some DFA $A = (Q,\Sigma,\delta,q_0,F)$. We will construct a DFA $\overline A$ which recognizes $\overline L$ as follows: let $\overline A = (Q,\Sigma,\delta,q_0,Q \setminus F)$. In other words, we take the DFA $A$, and swap the set of non-final and final states. Then $\delta(q,w) \in F$ is an accepting state in $A$ and non-accepting in $\overline A$, while $\delta(q,w) \not \in F$ is an accepting state in $\overline A$ and is non-accepting in $A$. $$\tag*{$\Box$}$$
Proposition. Let $L \subseteq \Sigma^*$ be a regular language. Then $L^R$ is a regular language.
Proof. There are two ways to show this, either via construction of an NFA or by working through the definition of a regular language. We'll outline the construction of the NFA first but won't formally prove its correctness. Since $L$ is a regular language, it is recognized by a DFA $A = (Q,\Sigma,\delta,q_0,F)$. We will construct an $\varepsilon$-NFA $A' = (Q \cup \{q_0'\}, \Sigma, \eta, q_0', \{q_0\}$, where $q_0'$ is a new state not in $Q$ and the transition function $\eta$ is defined as follows:
Conceptually, what $Q'$ does is it begins reading words at the final states of $A$, moving backwards along the transitions of $A$, and accepting a word if it reaches the initial state of $A$.
Now, we'll see how to show this formally via regular expressions. We can define the reverse of a regular expression $E$ by $E^R$ as follows:
Most of this looks the same, but the key to the reversal is in the concatenation step. From this, it is clear that $L(E^R) = L(E)^R$. Since $E$ is a regular expression, and we have shown that $E^R$ is a regular expression, $L(E)^R$ must be a regular language. $$\tag*{$\Box$}$$
Finally, consider the intersection of two languages $L_1$ and $L_2$. While we can certainly construct an $\varepsilon$-NFA similar to what we did to show that regular languages are closed under union, we can make use of closure properties we've already proved.
Proposition. Let $L_1,L_2 \subseteq \Sigma^*$ be regular languages. Then $L_1 \cap L_2$ is a regular language.
Proof. By De Morgan's laws, we have $$L_1 \cap L_2 = \overline{\overline{L_1} \cup \overline{L_2}}.$$ If $L_1$ and $L_2$ are regular, then $\overline{L_1}$ and $\overline{L_2}$ are also regular, as we've shown earlier. Then if $\overline{L_1}$ and $\overline{L_2}$ are regular, then $\overline{L_1} \cup \overline{L_2}$ is regular. And if this is regular, then $\overline{\overline{L_1} \cup \overline{L_2}}$ is regular and therefore $L_1 \cap L_2$ is regular. $$\tag*{$\Box$}$$
Alternatively, we can show this directly by constructing a DFA and proving that it accepts the intersection of the two languages. Let $A_1 = (Q_1,\Sigma,\delta_1,q_{0,1},F_1)$ and $A_2 = (Q_2,\Sigma,\delta_2,q_{0,2},F_2)$ be two DFAs. Then we can construct a DFA $A' = (Q',\Sigma,\delta',q_0',F')$, where
Conceptually, this machine processes words through both machines $A_1$ and $A_2$ simultaneously. States are pairs of states of $A_1$ and $A_2$ and move via both transition functions. Then $A'$ accepts a word $w$ if it reaches a pair $\langle p,q \rangle$ where $p \in F_1$ and $q \in F_2$ and this happens only when there is a path in both machines on $w$ from their respective initial states to the final states.
So now that we've seen a few ways to use the pumping lemma, we'll try something a bit different in the following examples. Here, we'll make use of the fact that we know that certain operations, when applied to regular languages, guarantee that the result is also regular. By assuming that our languages are regular, we can obtain a contradiction.
Proposition. Let $L = \{ w \in \{a,b\}^* \mid w \neq w^R \}$. Then $L$ is not regular.
Proof. Suppose $L$ is regular. Since the regular languages are closed under complementation, $\overline L$ is also regular. However, $$\overline L = \{ w \in \{a,b\} \mid w = w^R \}.$$ We showed that this language is not a regular language above. Thus, this contradicts our assumption that $L$ is regular and $L$ must be non-regular. $$\tag*{$\Box$}$$
In this example, you can see how it may have been challenging to come up with a good string to use with the pumping lemma. Sometimes, taking advantage of closure properties allows us to avoid the trouble of going through an entire pumping argument again and lets us make use of results we've already proved.
Proposition. Let $L = \{ a^i b^j \mid i \neq j \}$. Then $L$ is not regular.
Proof. Suppose $L$ is regular. Then $\overline L$ is regular, since regular languages are closed under complement. Consider the language $a^* b^*$. We have $$\overline L \cap a^* b^* = \{a^k b^k \mid k \geq 0\},$$ which we showed was non-regular above. However, $a^* b^*$ is a regular language and the class of regular languages is closed under intersection, a contradiction. Therefore, $L$ is not regular. $$\tag*{$\Box$}$$
In this example, it is important to note that simply taking the complement would not have given us the result we wanted.