There is a theme emerging here: finite automata are characterized by the fact that they can only remember a fixed, finite amount of information, which are represented in its states. This suggests that if there are problems that require an unbounded amount of memory, finite automata will be unable to accept them. What might such problems look like?
Consider the following languages over $\{a,b\}$. Some of them are recognizable and some of them are not.
It turns out that $L_1$ and $L_4$ are recognizable but $L_2$ and $L_3$ are not. How can we tell which is which?
The way to show that a language is recognizable is straightforward: construct a finite automaton that recognizes it. But what about showing that a language isn’t recognizable? We need to show that there is no finite automaton that recognizes it. So far, we only have some intuition about “needing to remember too much”. Rather than tackle this directly, we’ll try to describe a condition that all recognizable languages must satisfy based around this notion of “finite recognizability”.
Let $L \subseteq \Sigma^*$ be a recognizable language. Then there exists a positive integer $n$ such that for every string $w \in L$ with $|w| \geq n$, there exists a factorization $w = xyz$ with $x,y,z \in \Sigma^*$ such that
On first contact, the statement of the pumping lemma can be very intimidating since it really layers on the quantifiers, but the idea it tries to capture is fairly straightforward.
If $L$ is recognizable, then it is recognized by a DFA and the DFA only has so many states, say $n$. If the DFA accepts a long enough string, longer than the number of states $n$, then it has to enter at least one of the states more than once some point. This means that there’s a loop in the DFA and we should be able to repeat the part of the string that goes through the loop as much as we like before continuing with the rest of the string. Any such string should also be accepted.
Let’s make this argument more formal.
Since $L$ is recognizable, there exists a DFA $A = (Q,\Sigma,\delta,q_0,F)$ that recognizes $L$. Let $n = |Q|$ be the number of states of $A$. If there are no strings that are of length at least $n$, then we’re done (But what does this say about the language $L$?).
So, we assume that there is a string in $L$ with length at least $n$, say $w = a_1 a_2 \cdots a_\ell$, where $\ell \geq n$ and $a_i \in \Sigma$ for $1 \leq i \leq \ell$. Since $w \in L$, we know that $w$ is accepted by $A$, meaning that there is a sequence of states $p_0, p_1, \dots, p_\ell \in Q$ where $p_0 = q_0$, $p_\ell \in F$, and $\delta(p_i, a_{i+1}) = p_{i+1}$ for $0 \leq i \leq \ell - 1$.
Note that since $\ell \geq n$, we have at least a sequence of states $p_0, \dots, p_n$ at the start of our sequence and there are exactly $n+1$ states in this sequence. Recall that the state set $Q$ only contains $n$ distinct elements, so by the pigeonhole principle, at least one of the states in this sequence is repeated. In other words, we have $p_s = p_t$ for some $0 \leq s \lt t \leq n$.
Now factor $w$ into three words
Here, we see that $y$ is non-empty, as required by condition (1) of the lemma, since $|y| = t-s$ by our observation that $s \lt t$. Then $xy$ satisfies condition (2) since $|xy| = t$ and $t \leq n$ since we found it in the sequence $p_0, \dots, p_n$.
To see that this factorization of $w$ also satisfies condition (3), we recall the sequence of states $p_0, \dots, p_\ell$. Then it is clear that by reading $x$, $y$, and $z$, we have the following path through states of $A$, $$q_0 = p_0 \xrightarrow{x} p_s \xrightarrow{y} p_t \xrightarrow{z} p_\ell \in F.$$ But since $p_s = p_t$, it is clear that we have $$q_0 = p_0 \xrightarrow{x} \overbrace{p_s \xrightarrow{y} p_t \xrightarrow{y} p_t \xrightarrow{y} \cdots \xrightarrow{y} p_t}^{\text{$i$ times}} \xrightarrow{z} p_\ell \in F$$ for all $i \in \mathbb N$.
Thus, we have shown that any string $w \in L$ of length greater than $n$ satisfies conditions (1–3).
The Pumping Lemma is named so because of the analogy of “pumping” with repeating a portion of the string. This is the case in most other languages, except for French, in which the lemma is referred to as the lemme de l’étoile, the star lemma.
It’s important to note that what the Pumping lemma says is that all recognizable languages will satisfy these conditions. What it doesn’t say is that if a language satisfies these conditions, it must be recognizable—in fact, there are languages out there that do satisfy the pumping lemma, but aren’t recognizable.
The pumping lemma is another result that was shown by Rabin and Scott in 1959 (along with the definition of NFAs, the product construction, and the subset construction that we saw earlier). As we’ll see, next, this result seems to be a fairly important tool, since it has a name and everything. But it turns out that in the paper, this was a fairly minor characterization, sandwiched in between two other theorems. Hence, why this seemingly important result is considered only a lemma.
Now, we’re going to take a look at some examples of languages that are not regular and use the pumping lemma to prove that they’re not regular. We will be using contradiction to do this, which means that we’ll need to negate that very heavily quantified statement. First, an example. Let’s use the pumping lemma to show that our earlier example is not recognizable.
Let $L = \{a^k b^k \mid k \geq 0\}$. Then $L$ is not recognizable.
Suppose $L$ is recognizable and let $n > 0$ be the pumping constant for $L$. Let $w = a^n b^n$. Since $|w| = 2n \geq n$, we can factor $w = xyz$ with $x = a^s$ and $y = a^t$ with $t > 0$ so $y \neq \varepsilon$ and $s+t \leq n$ to satisfy the conditions of the pumping lemma. Then we can write $w = a^s a^t a^{n-s-t} b^n$.
To satisfy the final condition of the pumping lemma, we must have $a^s a^{i \cdot t} a^{n-s-t} b^n \in L$ for all $i \in \mathbb N$. However, taking $i = 2$, we have \[xy^2z = a^s a^{2t} a^{n-s-t} b^n = a^{n+t}b^n.\] Since $n+t \neq n$, $xy^2z \not \in L$. Therefore $xy^iz \not\in L$ for all $i \in \mathbb N$ and $L$ does not satisfy the conditions of the pumping lemma. Thus, $L$ is not recognizable.
Let’s use this to break down what choices were being made in this proof. This involves looking carefully at the quantification of the statement and its negation.
| Pumping lemma | First-order sentence | Negation |
|---|---|---|
| There exists a positive integer $n$ | $\exists n, n \gt 0$ | $\forall n, n \gt 0$ |
| for every string $w \in L$ with $|w| \geq n$ | $\forall w, w \in L \wedge |w| \geq n$ | $\exists w, w \in L \wedge |w| \geq n$, |
| there exists a factorization $w = xyz$ such that $y \neq \varepsilon$, $|xy| \leq n$, | $\exists xyz, w = xyz \wedge y \neq \varepsilon \wedge |xy| \leq n$ | $\forall xyz, w = xyz \wedge y \neq \varepsilon \wedge |xy| \leq n$ |
| and $xy^iz \in L$ for all $n \in \mathbb N$. | $\forall i, xy^iz \in L$ | $\exists i, xy^iz \not \in L$. |
The negated form of the pumping lemma gives us a proof template. As is usual with statements that involve multiple levels of quantifiers, we can interpret this as a quantifier game, where your opponent (classically, “the demon”) chooses the $\forall$s and you choose the $\exists$s. We then view a proof in the following way:
Then we win if $xy^iz \not \in L$.
Here, we note that because we get to choose $w$, the choice of $w$ can make a difference in how easy the rest of the proof goes. Notice that we can choose any string in the language that we wish as long as it has length greater than $n$. An easy mistake to make is thinking that $w$ must have length exactly $n$.
The reason the choice of $w$ matters is because it sets the conditions on the factorization of $w$ into $xyz$. In order to defeat the pumping lemma, you must show that for your chosen word $w$, every factorization of $w$ cannot satisfy the conditions of the pumping lemma.
Consider the language $L = \{(ab)^k \mid k \geq 0 \}$. For any $n \gt 0$, choose $w = (ab)^n \in L$, which has length $2n$. Now choose the factorization $x = a, y = b, z = (ab)^{n-1}$. Then $xy^i z = ab^i (ab)^{n-1}$, which is not in $L$ for any $i$.
At this point, we might mistakenly conclude that $L$ is not recognizable. However, choosing $x = \varepsilon, y = ab, z = (ab)^{n-1}$ gives us a factorization that satisfies the pumping lemma. It turns out that $L$ is actually recognizable.
The mistake that’s made here is that we chose a specific $xyz$ for $w$ but that is not our choice to make. Rather, we have to consider all of the possible ways that the opponent can split up $w$ and show that none of them can work in their favour.
In the proof of Proposition 7.3, by choosing $w = a^n b^n$, we simplified the rest of the proof, since all possible factorizations $xyz$ had $y = a^t$ for some $t \gt 0$. On the other hand, if we had chosen, say, $w = a^{\frac n 2} b^{\frac n 2}$, then we would have had to show that the pumping lemma would not be satisfied for factorizations with $y = a^t$, $y = a^s b^t$, and $y = b^t$ for $s,t \gt 0$. Of course, it is possible to prove that the pumping lemma doesn’t hold via $w = a^{\frac n 2} b^{\frac n 2}$, but it is a lot more work.
What other kinds of problems are too difficult for finite automata? It turns out even some unary languages are too difficult for finite automata to solve.
Let $L = \{a^p \mid \text{$p$ is a prime number} \}$. Then $L$ is not recognizable.
Suppose $L$ is recognizable and let $n > 0$ be the pumping constant for $L$. Since there are infinitely many primes, for any possible pumping constant $n$, we can always choose a prime $p \geq n$. Let $w = a^p$.
Since $|w| = p \geq n$, we can factor $w = xyz$ with $|xy| \leq n$ and $y \neq \varepsilon$ and we have $y = a^k$ with $1 \leq k \leq n$, satisfying the conditions of the pumping lemma. Then if $L$ is recognizable, we must have $a^{p+(i-1)k} \in L$ for every $i \in \mathbb N$. However, if $i = p+1$, we have $p+pk = (k+1)p$, which is not prime. Therefore, $a^{p+(i-1)k} \not \in L$ for $i = p+1$ and therefore $w$ does not satisfy the conditions of the pumping lemma. Thus, $L$ is not recognizable.
What can we take away from this? One way of viewing unary languages is simply as a set of natural numbers, where $a^n$ is the unary encoding of the natural number $n$. So in fact, we can quite naturally describe decision problems on natural numbers in this way.
The non-recognizability of the language of binary strings representing prime numbers was shown by Hartmanis and Shank in 1967.
We can also make use of closure properties. Recall the work we spent in showing that certain operations, when applied to recognizable languages, guarantee that the result is also recognizable. Though these results are stated positively, we can use these to derive contradictions.
Let $L = \{ a^i b^j \mid i \neq j \}$. Then $L$ is not recognizable.
Suppose $L$ is recognizable. Then $\overline L$ is recognizable, since recognizable languages are closed under complement. Consider the language $L_{ab} = \{a^i b^j \mid i,j \geq 0\}$, the language of strings of $a$’s followed by $b$’s. We have $$\overline L \cap L_{ab} = \{a^k b^k \mid k \geq 0\},$$ which we showed was non-recognizable above. However, $L_{ab}$ is a recognizable language (try coming up with the DFA for it) and the class of recognizable languages is closed under intersection, which this contradicts. Therefore, $L$ is not recognizable.
In this example, it is important to note that simply taking the complement would not have given us the result we wanted—$\overline L$ is not the strings $a^k b^k$ because there are many other strings that are not of the form $a^i b^j$ with $i \neq j$. However, we can “filter” out some strings by taking advantage of the fact that recognizable languages are closed under intersection.
Admittedly, the use of recognizable languages in this way is a bit cumbersome. In our case, we could see pretty quickly that the language was recognizable because the DFA isn’t too complicated. But if we really did need a more complicated language, then constructing the DFA and showing that it works may turn out to be far more work than it’s worth. If only there were a better way to describe such languages...