CMSC 28000 — Lecture 10

Using the Pumping lemma

Now, we're going to take a look at some examples of languages that are not regular and use the pumping lemma to prove that they're not regular. We will be using contradiction to do this, which means that we'll need to negate that very heavily quantified statement. First, an example.

Let $L = \{a^k b^k \mid k \geq 0\}$. Then $L$ is not regular.

Suppose $L$ is regular and let $n > 0$ be the pumping length for $L$. Let $w = a^n b^n$. Since $|w| = 2n \geq n$, we can factor $w = xyz$ with $x = a^s$ and $y = a^t$ with $t > 0$ so $y \neq \varepsilon$ and $s+t \leq n$ to satisfy the conditions of the pumping lemma. Then we can write $w = a^s a^t a^{n-s-t} b^n$.

To satisfy the final condition of the pumping lemma, we must have $a^s a^{i \cdot t} a^{n-s-t} b^n \in L$ for all $i \in \mathbb N$. However, taking $i = 2$, we have \[xy^2z = a^s a^{2t} a^{n-s-t} b^n = a^{n+t}b^n.\] Since $n+t \neq n$, $xy^2z \not \in L$. Therefore $xy^iz \not\in L$ for all $i \in \mathbb N$ and $L$ does not satisfy the conditions of the pumping lemma. Thus, $L$ is not regular.

Let's use this to break down what choices were being made in this proof. This involves looking carefully at the quantification of the statement and its negation.

Pumping lemma First-order sentence Negation
There exists a positive integer $n$ $\exists n, n \gt 0$ $\forall n, n \gt 0$
for every string $w \in L$ with $|w| \geq n$ $\forall w, w \in L \wedge |w| \geq n$ $\exists w, w \in L \wedge |w| \geq n$,
there exists a factorization $w = xyz$ such that $y \neq \varepsilon$, $|xy| \leq n$, $\exists xyz, w = xyz \wedge y \neq \varepsilon \wedge |xy| \leq n$ $\forall xyz, w = xyz \wedge y \neq \varepsilon \wedge |xy| \leq n$
and $xy^iz \in L$ for all $n \in \mathbb N$. $\forall i, xy^iz \in L$ $\exists i, xy^iz \not \in L$.

The negated form of the pumping lemma gives us a proof template in the form of a quantifier game, where your opponent chooses the $\forall$s and you choose the $\exists$s. We then view a proof in the following way:

  1. The opponent chooses a pumping constant $n \gt 0$.
  2. We choose a word $w \in L$ with $|w| \geq n$.
  3. The opponent chooses a factorization $w = xyz$ with $|xy| \leq n$ and $y \neq \varepsilon$.
  4. We choose $i \in \mathbb N$.

Then we win if $xy^iz \not \in L$.

Here, we note that because we get to choose $w$, the choice of $w$ can make a difference in how easy the rest of the proof goes. Notice that we can choose any string in the language that we wish as long as it has length greater than $n$. An easy mistake to make is thinking that $w$ must have length exactly $n$.

The reason the choice of $w$ matters is because it sets the conditions on the factorization of $w$ into $xyz$. In order to defeat the pumping lemma, you must show that for your chosen word $w$, every factorization of $w$ cannot satisfy the conditions of the pumping lemma.

Suppose we have the language $L = (ab)^*$, which is regular. For any $n \gt 0$, choose $w = (ab)^n \in L$. Now choose the factorization $x = a, y = b, z = (ab)^{n-1}$. Then $xy^i z = ab^i (ab)^{n-1}$, which is not in $L$ and we might mistakenly conclude that $L$ is not regular. However, choosing $x = \varepsilon, y = ab, z = (ab)^{n-1}$ gives us a factorization that satisfies the pumping lemma.

In the proof of Proposition 10.1, by choosing $w = a^n b^n$, we simplified the rest of the proof, since all possible factorizations $xyz$ had $y = a^t$ for some $t \gt 0$. On the other hand, if we had chosen, say, $w = a^{\frac n 2} b^{\frac n 2}$, then we would have had to show that the pumping lemma would not be satisfied for factorizations with $y = a^t$, $y = a^s b^t$, and $y = b^t$ for $s,t \gt 0$. Of course, it is possible to prove that the pumping lemma doesn't hold via $w = a^{\frac n 2} b^{\frac n 2}$, but it is a lot more work.

Let $L = \{a^p \mid \text{$p$ is a prime number} \}$. Then $L$ is not regular.

Suppose $L$ is regular and let $n > 0$ be the pumping length for $L$. Since there are infinitely many primes, for any possible pumping length $n$, we can always choose a prime $p \geq n$. Let $w = a^p$.

Since $|w| = p \geq n$, we can factor $w = xyz$ with $|xy| \leq n$ and $y \neq \varepsilon$ and we have $y = a^k$ with $1 \leq k \leq n$, satisfying the conditions of the pumping lemma. Then if $L$ is regular, we must have $a^{p+(i-1)k} \in L$ for every $i \in \mathbb N$. However, if $i = p+1$, we have $p+pk = (k+1)p$, which is not prime. Therefore, $a^{p+(i-1)k} \not \in L$ for $i = p+1$ and therefore $w$ does not satisfy the conditions of the pumping lemma. Thus, $L$ is not regular.

The non-regularity of the language of binary strings representing prime numbers was shown by Hartmanis and Shank in 1967. While they mention the pumping lemma, they actually go on to show something even stronger, that this language is not even context-free.

Finally, we can also make use of the fact that we know that certain operations, when applied to regular languages, guarantee that the result is also regular. By assuming that our languages are regular, we can obtain a contradiction.

Let $L = \{ a^i b^j \mid i \neq j \}$. Then $L$ is not regular.

Suppose $L$ is regular. Then $\overline L$ is regular, since regular languages are closed under complement. Consider the language $a^* b^*$. We have $$\overline L \cap a^* b^* = \{a^k b^k \mid k \geq 0\},$$ which we showed was non-regular above. However, $a^* b^*$ is a regular language and the class of regular languages is closed under intersection, a contradiction. Therefore, $L$ is not regular.

In this example, it is important to note that simply taking the complement would not have given us the result we wanted—$\overline L$ is not the strings $a^k b^k$ because there are many other strings that are not of the form $a^i b^j$ with $i \neq j$. However, we can "filter" out some strings by taking advantage of the fact that regular languages are closed under intersection.

The Myhill–Nerode Theorem

We've seen now that regular languages have some finiteness property that we've waved our hands around. This property is clearly exhibited by the states of a finite automaton. We see that this property is present in the statement of the pumping lemma and that if a language doesn't have this property then we can conclude it is not regular.

What is this property? It's time to try to clearly articulate it. Recall that when we wanted to formally prove the correctness of a finite automaton, we had to argue that its states try to "group" strings together and every string belongs to one of these groups. Suppose $L$ is a regular language. Then there's a DFA $A$ that accepts $L$ and we know for any string $w \in \Sigma^*$, we always have that $\delta(q_0, w) \in Q$. This suggests that we should start thinking of some kind of equivalence between these strings that is related to our language $L$.

Let's recall some definitions surrounding equivalence.

A relation $R$ over a set $S$ is a subset $R \subseteq S \times S$ and we write $x \mathrel R y$ if $(x,y) \in R$. A relation $R$ is an equivalence relation if it satisfies the following:

  1. Reflexivity. $\forall x, x \mathrel R x$.
  2. Symmetry. $\forall x,y, x \mathrel R y \rightarrow y \mathrel R x$.
  3. Transitivity. $\forall x,y,z, (x \mathrel R y) \wedge (y \mathrel R z) \rightarrow x \mathrel R z$.

An equivalence relation partitions $S$ into equivalence classes. An equivalence class with representative $x$ is $$[x] = \{y \in S \mid x \mathrel R y\}.$$

The equivalence relation that most of you should be familiar with is $\equiv_m$, or equivalence modulo $m$ for the integers. For instance, we know that $\equiv_2$ has two equivalence classes: the odd and even integers. This is a partition because every integer must be odd or even.

Now, let's define an equivalence relation based on a DFA.

Let $A$ be a DFA. We define the relation $\sim_A$ for words $u,v \in \Sigma^*$ by $u \sim_A v$ if and only if $\delta(q_0,u) = \delta(q_0,v)$.

In other words, we can consider two words to be equivalent under $\sim_A$ if they reach the same state. Notice that a DFA naturally partitions all strings—every string gets taken to some state of the DFA. And the notion that the DFA considers two strings equivalent if they reach the same state makes sense: all it knows is that the two strings reached a particular state, but once a string enters the state, it doesn't know anything more about it, like how it got there.

The relation $\sim_A$ is an equivalence relation on $\Sigma^*$.

  1. $\sim_A$ is reflexive, since $x \sim_A x$ if and only if $\delta(q_0,x) = \delta(q_0,x)$.
  2. $\sim_A$ is symmetric: $$x \sim_A y \iff \delta(q_0,x) = \delta(q_0,y) \iff \delta(q_0,y) = \delta(q_0,x) \iff y \sim_A x.$$
  3. $\sim_A$ is transitive: if $x \sim_A y$ and $y \sim_A z$, then $\delta(q_0,x) = \delta(q_0,y)$ and $\delta(q_0,y) = \delta(q_0,z)$, and therefore, $\delta(q_0,x) = \delta(q_0,z)$ and $x \sim_A z$.

Therefore, $\sim_A$ is an equivalence relation.