CMSC 28000 — Lecture 19

Since we showed that the pushdown languages (language recognized by pushdown automata) are exactly the same as the context-free languages, we can also make use of PDAs to exhibit closure properties for context-free languages. Just like with regular expressions and finite automata, one model may be more convenient to work with than the other for certain operations. Here, we’ll show that CFLs are closed under intersection with a regular language. This might seem strange at first, but it turns out to be quite handy.

Let $L \subseteq \Sigma$ be a pushdown language and let $R \subseteq \Sigma^*$ be a regular language. Then $L \cap R$ is a pushdown language.

Let $A = (Q_A,\Sigma,\Gamma,\delta_A,s_A,Z_0,F_A)$ be a PDA that accepts $L$ and let $B = (Q_B, \Sigma, \delta_B, s_B, F_B)$ be a DFA that recognizes $R$. We will construct a PDA $C = (Q_C, \Sigma, \Gamma, \delta_C, s_C, Z_0, F_C)$ that recognizes $L \cap R$. We have

$Q_C = Q_A \times Q_B$,
$s_C = \langle s_A, s_B \rangle$,
$F_C = F_A \times F_B$,
and the transition function $\delta_C$ is defined by $$\delta_C(\langle p,q \rangle, a, X) = \begin{cases} \{ (\langle p',q \rangle, \gamma) \mid (p',\gamma) \in \delta_A(p,a,X) \} & \text{if $a = \varepsilon$,} \\ \{ (\langle p',\delta_B(q,a) \rangle, \gamma) \mid (p',\gamma) \in \delta_A(p,a,X) \} & \text{if $a \neq \varepsilon$.} \end{cases}$$

The PDA $C$ works by simultaneously moving through a computation of $A$ and $B$. When a transition is made, $C$ will keep track of the current state of $A$ via the first component of each state of $C$ and by using the stack as normal. The current state of $B$ is kept track of as the second component of the states of $C$. In addition, when $A$ makes $\varepsilon$-transitions, the state of $B$ does not change.

It is not too difficult to see that in this way, $C$ can move simultaneously through $A$ and $B$. Then, the set of accepting states of $C$ is defined to be pairs of states in $F_A$ and $F_B$. In other words, a word accepted by $C$ must have a computation that ends in a state $\langle p,q \rangle$ with $p \in F_A$ and $q \in F_B$. In other words, there is a computation for $w$ on $A$ that ends in a final state and there is a computation for $w$ on $B$ that ends in a final state. Thus, $C$ accepts $w$ if and only if $w$ is accepted by both $A$ and $B$ and this is only the case if and only if $w \in L \cap R$. Thus, $C$ is a PDA that accepts $L \cap R$ and $L \cap R$ is a pushdown language.

Why would you want to take an intersection with a regular language? One way to think about intersection is as kind of a filter. Regular languages happen to be quite handy for this. Using regular expressions, they’re simple enough to describe, a lot of common “patterns” can be expressed as regular languages, and it’s very obvious that the language is regular. Let’s consider the following result.

The language of squares $L_{\mathsf{sq}} = \{xx \mid x \in \{a,b\}^*\}$ is not context-free.

We’ve seen in examples that the language on non-squares is context-free. However, the language of squares is not. Intuitively, what’s the difference between something like $xx$ and $xx^R$? One way to think about it is that squares are not “balanced” in the way that palindromes are. Similarly, the “history” that we need for a square is relatively deep—we need to know the very beginning of the string at the halfway point compared to the most recent things we saw.

At first glance, this appears to be a language for which you can apply the pumping lemma directly. For instance, take $w = a^n b^n a^n b^n$. If we proceed with the proof, one of the cases will maybe look like $uv^2xy^2z = a^{n+s} b^n a^n b^n$. A common claim is that this is not a square since we have $w = w_1 w_2$ where $w_1 = a^{n+s} b^n$ and $w_2 = a^n b^n$, but this is not the actual argument you need to make. What you need to show is that $w_1 \neq w_2$ where $|w_1| = |w_2| = \frac{4n+s}2$. This is not so bad, but imagine working through the cases where $vy = a^s b^t$ or $b^s a^t$.

Using an intersection with a regular language allows us to impose additional structure to a language that makes it easier to “break”.

Suppose $L_{\mathsf{sq}}$ is context-free. Then \[L' = L_{\mathsf{sq}} \cap L(\mathtt{a^*b^*a^*b^*}) = \{a^k b^\ell a^k b^\ell \mid k, \ell \geq 0 \}.\] Since $L(\mathtt{a^*b^*a^*b^*})$ is regular (it’s defined by a regular expression), $L'$ must be context-free. Since $L'$ is context-free, let $n$ be the pumping constant for $L'$ and choose $w = a^nb^na^nb^n$. We have that $|w| = 4n$ and $w \in L'$.

Now consider factorizations of $w = uvxyz$ such that $vy \neq \varepsilon$ and $|vxy| \leq n$. We have a total of seven different cases to consider. However, we will find that we do not need to detail all seven.

Let $vy = a^k b^\ell$ such that $0 < k+\ell \leq n$ and $vxy$ lies in the first half of $w$. We have $uv^0xy^0z = a^{n-k} b^{n-\ell} a^n b^n \not \in L'$, since $n-k, n-\ell \neq n$.
Let $vy = a^k$ such that $0 < k \leq n$ and $vxy$ lies in the first half of $w$. This is just the above case with $\ell = 0$.
Let $vy = b^\ell$ such that $0 < \ell \leq n$ and $vxy$ lies in the first half of $w$. This is just Case 1 with $k = 0$.
Let $vy = a^k b^\ell$ such that $0 < k+\ell \leq n$ and lies in the second half of $w$. We apply the same argument as Case 1.
Let $vy = a^k$ such that $0 < k \leq n$ and $vxy$ lies in the second half of $w$. This is just the above case with $\ell = 0$.
Let $vy = b^\ell$ such that $0 < \ell \leq n$ and $vxy$ lies in the second half of $w$. This is just Case 4 with $k = 0$.
Let $vy = b^\ell a^k$ such that $0 < k+\ell \leq n$. We apply a similar argument as Case 1.

So in each case, there exists $i \in \mathbb N$ such that $uv^ixy^iz \not \in L'$, so $L'$ is not context-free. This is a contradiction, so $L_{\mathsf{sq}}$ is not context-free.

Notice how the use of the intersection with regular languages changed the argument we could employ to one that just involves comparing the numbers of $a$’s and $b$’s.

A corollary of this result is that context-free languages are not closed under complement: we’ve already seen that the language of non-squares is context-free, so if the context-free languages were closed under complement, the language of squares must also be context-free, but it isn’t.

So we have seen that context-free languages are closed under intersection with regular languages, but what about intersection of two context-free languages? That poses a bit more of a challenge. For one thing, our product construction approach will fail—there’s no obvious way to “share” a stack.

There exist context-free languages $L_1, L_2 \subseteq \Sigma$ such that $L_1 \cap L_2$ is not a context-free language.

Let $L_1 = \{a^i b^i c^j \mid i,j \geq 0\}$ and let $L_2 = \{a^i b^j c^j \mid i,j \geq 0\}$. It is easy to see that $L_1$ and $L_2$ are both context-free. Consider for $\sigma_1, \sigma_2 \in \Sigma$ the languages \begin{align*} L_{\sigma_1,\sigma_2} = \{\sigma_1^m \sigma_2^m \mid m \geq 0 \}, \\ K_{\sigma_1} = \{\sigma_1^i \mid i \geq 0\}. \end{align*} We have already seen that $L_{\sigma_1,\sigma_2}$ is context-free and $K_{\sigma_1}$ is just $\sigma_1^*$, which is regular (and therefore, context-free). Since context-free languages are closed under concatenation, we have $L_1 = L_{a,b} K_c$ and $L_2 = K_a L_{b,c}$. Thus, $L_1$ and $L_2$ are context-free languages. Then, we have $$L_1 \cap L_2 = \{a^i b^i c^i \mid i \geq 0\}.$$ To see this, we note that a string in $L_1 \cap L_2$ must be a string of the form $a^i b^j c^k$ that satisfies $i = j$ (i.e. is in $L_1$) and $j = k$ (i.e. is in $L_2$). This necessarily means that $i = j = k$.

We have already shown that this language is not context-free.

Thus, the class of context-free languages is not closed under intersection. Note that if we had proved this earlier, we could use this to show that the context-free languages are not closed under complementation either, via De Morgan’s laws—since CFLs are closed under union, if they were also closed under complement, that would give us closure under intersection, which we just showed isn’t the case.