CMSC 28000 — Lecture 24

Emptiness

Another common problem you can ask is whether your language device actually describes any strings at all. This is the emptiness problem: Given a device $A$, is $L(A) = \emptyset$?

We will consider the following language $$ \mathsf E_{\mathsf{DFA}} = \{ \llbracket A \rrbracket \mid \text{$A$ is a DFA and $L(A) = \emptyset$}\}. $$

$\mathsf E_{\mathsf{DFA}}$ is a decidable language.

To solve this problem, we simply treat it like a graph problem. A DFA accepts a word if there is an accepting state that is reachable from the initial state. Then we can use a similar idea to the graph connectedness algorithm from last week for this problem. The TM looks like this:

  1. Mark the initial state of $A$.
  2. Repeat until no new states get marked:
    Mark any state that has a transition coming into it from any state that is already marked.
  3. If no accepting state is marked, accept. Otherwise, reject.
        \begin{algorithmic}
        \PROCEDURE{dfa-empty}{$\llbracket A \rrbracket$}
            \STATE Mark $q_0$
            \REPEAT
                \FOR{$q \in Q$}
                    \IF{$q$ is unmarked}
                        \FOR{$\delta(p,a) = q$}
                            \IF{$p$ is marked}
                                \STATE mark $q$
                            \ENDIF
                        \ENDFOR
                    \ENDIF
                \ENDFOR
            \UNTIL{no new states have been marked}
            \FOR{$q \in F$}
                \IF{$q$ is marked}
                    \RETURN Reject
                \ENDIF
            \ENDFOR
            \RETURN Accept
        \ENDPROCEDURE
        \end{algorithmic}
    

We can also show that the emptiness problem for CFGs is also decidable. Again, it’ll look a bit different from what we’ve done with DFAs. First, here’s the language $$\mathsf E_{\mathsf{CFG}} = \{\llbracket G \rrbracket \mid \text{$G$ is a CFG and $L(G) = \emptyset$} \}.$$

$\mathsf E_{\mathsf{CFG}}$ is a decidable language.

In a way, we can apply the same kind of idea from the DFA case. Of course, a grammar isn’t a graph, so it’s not exactly the same. However, in essence, what we want to do is check that there’s a derivation that will generate some terminal beginning from the start variable. To do this, we check each variable to see if it can generate a string of terminals. We begin by marking terminals and determining whether each variable can generate a string of variables and terminals that are all marked. Once the algorithm is finished, we can simply check whether the start variable is marked or not.

  1. On input $\llbracket G \rrbracket$ where $G$ is a CFG, mark all terminal symbols in $G$.
  2. Repeat until no new variables get marked:
    Mark any variable $A$ where $G$ has a rule $A \to U_1 U_2 \cdots U_k$ and each symbol $U_1, \dots, U_k$ has already been marked.
  3. If the start variable is not marked, accept; otherwise reject.
        \begin{algorithmic}
        \PROCEDURE{cfg-empty}{$\llbracket G \rrbracket$}
            \FOR{$a \in \Sigma$}
                \STATE Mark $a$
            \ENDFOR
            \REPEAT
                \FOR{$A \in V$}
                    \IF{$A$ is unmarked}
                        \FOR{$A \to X_1 X_2 \cdots X_k \in P$}
                            \IF{$X_1, X_2, \dots, X_k$ are all marked}
                                \STATE mark $A$
                            \ENDIF
                        \ENDFOR
                    \ENDIF
                \ENDFOR
            \UNTIL{no new variables have been marked}
            \IF{$S$ is marked}
                \RETURN Reject
            \ELSE
                \RETURN Accept
            \ENDIF
        \ENDPROCEDURE
        \end{algorithmic}
    

Containment

Another common problem is if we have two language devices, whether we can compare them somehow. The most basic variant of these is asking whether one is contained in the other. This is the containment problem: Given two devices $A$ and $B$, is $L(A) \subseteq L(B)$?

Consider the following language, $$ \mathsf C_{\mathsf{DFA}} = \{ \llbracket A,B \rrbracket \mid \text{$A$ and $B$ are DFAs and $L(A) \subseteq L(B)$} \}.$$ Of course, this is another decidable property, so we have the following theorem.

$\mathsf C_{\mathsf{DFA}}$ is a decidable language.

Again, we make use of the machine that we constructed in the previous theorem. First, let’s consider what it means for $L(A)$ to be contained in $L(B)$. We want to check whether every word in $L(A)$ is also in $L(B)$. However, an equivalent way of checking this property is to check whether there are any words in $L(A)$ that are not in $L(B)$. So we construct a DFA $C$ such that $$ L(C) = L(A) \cap \overline{L(B)}$$ and check whether or not $L(C) = \emptyset$. Luckily, we’ve just shown how to check the emptiness of a DFA.

  1. On input $\llbracket A,B \rrbracket$, construct the DFA $C$ as described above.
  2. Run the TM from Theorem 24.5 on the input $\llbracket C \rrbracket$ to check if $L(C) = \emptyset$.
  3. If the machine accepts, accept. Otherwise, reject.
        \begin{algorithmic}
        \PROCEDURE{dfa-contain}{$\llbracket A, B\rrbracket$}
            \STATE Construct complement DFA $\overline B$
            \STATE Construct the product automaton $C = A \times \overline B$
            \RETURN \CALL{dfa-empty}{$\llbracket C \rrbracket$}
        \ENDPROCEDURE
        \end{algorithmic}
    

Solving containment lets us solve other similar problems. For instance, we can solve the equality problem: Given two devices $A$ and $B$, is $L(A) = L(B)$?

The language $$\mathsf{EQ}_{\mathsf{DFA}} = \{ \llbracket A,B \rrbracket \mid \text{$A$ and $B$ are DFAs and $L(A) = L(B)$} \}.$$ is a decidable language.

Of course, we have the opportunity to make use of the machine that we constructed in the previous theorem. Recall that for two sets $S$ and $T$, $S = T$ if and only if $S \subseteq T$ and $T \subseteq S$. This fact combined with the containment machine we constructed gives us a rather simple algorithm for checking equality.

  1. On input $\llbracket A,B \rrbracket$, run the TM from the previous theorem on the input $\llbracket A,B \rrbracket$ to check if $L(A) \subseteq L(B)$ and run it again on $\llbracket B,A \rrbracket$ to check if $L(B) \subseteq L(A)$.
  2. If both machines accept, accept. Otherwise, reject.
        \begin{algorithmic}
        \PROCEDURE{dfa-equal}{$\llbracket A, B\rrbracket$}
            \RETURN \CALL{dfa-contain}{$\llbracket A, B \rrbracket$} and \CALL{dfa-contain}{$\llbracket B, A \rrbracket$}
        \ENDPROCEDURE
        \end{algorithmic}
    

Then this allows us to solve the universality problem: Given a device $A$, does $L(A) = \Sigma^*$?

The language \[\mathsf U_{\mathsf{DFA}} = \{\llbracket A \rrbracket \mid \text{$A$ is a DFA and $L(A) = \Sigma^*$} \} \] is a decidable language.

Again, this is fairly straightforward based on the results we’ve already shown. One approach is to construct a simple DFA that recognizes $\Sigma^*$ and test whether $\Sigma^* \subseteq L(A)$. Another approach is the check whether $\overline{L(A)} = \emptyset$.

But what about context-free languages? Suppose we have two context-free grammars $G_1$ and $G_2$. Then $L(G_1) \subseteq L(G_2)$ if and only if $L(G_1) \cap \overline{L(G_2)} = \emptyset$, just as above. But there’s a problem with this: context-free languages aren’t closed under intersection and complement, so we’re not guaranteed to end up with a context-free language after this process.

Maybe we can try something more clever? Unfortunately, even if we tried harder, we won’t be able to solve this problem, as it turns out to be our first encounter with an undecidable problem. And since containment for CFGs is undecidable, this means that equality and universality are also going to be undecidable.

Well, that’s the claim at least, but can we prove it?

Some problems about context-free languages are undecidable

There is a similar result characterizing undecidable properties of languages for subclasses of decidable languages (i.e. context-free languages, context-sensitive languages, etc.) called Greibach’s theorem, due to Sheila Greibach in 1963 (you may remember Greibach being mentioned when were talking about normal forms for grammars).

We won’t prove Greibach’s theorem, but we will show that some problems about context-free languages are undecidable. The proofs for these kinds of problems are going to be much trickier, since we don’t have the benefit of conjuring up Turing machines. However, we can do something that may be surprising, which is make use of computations of Turing machines.

Let $M$ be a Turing machine and $w$ and input string. An accepting computation path for $M$ on $w$ is a sequence of configurations $C_1, \dots, C_\ell$, where $C_1$ is the start configuration of $M$ on $w$ and $C_\ell$ is an accepting configuration of $M$ on $w$ and each $C_i$ follows directly from $C_{i-1}$ (that is, $C_{i-1} \vdash C_i$) as defined by the transition function of $M$. (We can define a rejecting computation path similarly.)

This gives us a very strict format for the computation of a word $w$ when run on a Turing machine $M$. We can consider the language of encodings of all such computation paths. For a fixed Turing machine $M$ and input string $w$ that we would like to run on $M$, we have the language of accepting configurations \[\mathsf A(M,w) = \{\llbracket C \rrbracket \mid \text{$C$ is an accepting computation path for $M$ on $w$}\}.\] What does such a string look like? It must satisfy the following conditions:

  1. $\llbracket C \rrbracket$ is a string of the form $\llbracket C \rrbracket = \# C_1 \# C_2^R \# C_3 \# C_4^R \# \cdots \# C_\ell \#$, where $C_i$ is a configuration,
  2. $C_1$ is a starting configuration,
  3. $C_\ell$ is an accepting configuration,
  4. For each $C_i$, $C_i \vdash C_{i+1}$ according to the definition of the transition function $\delta$ of $M$.

Notice that our encoding of the computation path reverses every other configuration. This is a deliberate choice—we will get to it shortly. However, for each of the items in the previous list, we can define a language $A_i$ for them. Then the language of valid encodings of computation paths can be expressed as the intersection $$\mathsf A(M,w) = A_1 \cap A_2 \cap A_3 \cap A_4,$$ which is all strings which satisfy all of our conditions.

At this point, the approach seems straightforward: if $M$ accepts $w$, then the language $\mathsf A(M,w)$ should not be empty. We saw that we can do this if $\mathsf A(M,w)$ is context-free. Unfortunately, this is not true: the set of strings described by Condition 4 is not actually context-free.

However, we can take a more indirect route: we’ll show instead that the strings that don’t satisfy all of these conditions—the strings that aren’t valid accepting computation paths for $M$ on $w$—are a context-free language.

The language $\overline{\mathsf A(M,w)}$ is a context-free language.

We note the language of valid encodings of computation paths expressed as the intersection $$\mathsf A(M,w) = A_1 \cap A_2 \cap A_3 \cap A_4,$$ where $A_i$ is the language of strings that satisfy condition (i) above. Then by De Morgan’s laws, we have $$\overline{\mathsf A(M,w)} = \overline{A_1} \cup \overline{A_2} \cup \overline{A_3} \cup \overline{A_4}.$$ Then, we must show that each $\overline{A_i}$ are context-free languages. The first three are easy: $A_1, A_2, A_3$ are actually regular. Let $\Sigma = \Gamma \cup Q \cup \{\#\}$.

It’s not hard to see that we can come up with regular expressions for each of these languages. Once we have that, then we get that their complements are also regular and therefore they are context-free.

The not so obvious part is showing that $\overline{A_4}$ is context-free. Many strings that are obviously not valid accepting computation paths are going to be in this language. The main case we have to watch out for is when we have a string that is very close to being or almost a valid computation path.

Suppose that we have such a string that is “almost” correct—in other words, it satisfies conditions 1 through 3. To fail condition 4, we must find a substring $\#C_i\#C_{i+1}^R\#$ such that $C_i \not\vdash C_{i+1}$. Suppose we are using a PDA to do this. When reading $C_i$, we push every symbol of $C_i$ onto the stack. Then when reading $C_{i+1}^R$, we pop every symbol of the stack to compare.

What do we look for? The crucial thing to notice here is that the change to the configuration by following a single transition of a Turing machine is very localized around the position of the tape head. This means that most of the configuration is going to be the same and we can simply make sure the symbols we see are the same.

However, we need to be careful around the location of the tape head. There are two possibilities:

Notice that we can have a difference that is valid around the tape head, since this indicates that a new symbol was written and the tape head has moved. But since the tape head can only move left or right one cell, we know that the tape head can only be at most two cells away from any difference. So when comparing configurations, we simply need to check that every substring of length 3 is valid—either it’s the same or it is a valid change according to the transition function.

Since our language is specific to $M$, we can compare the two configurations against the transition function of $M$ and there are only finitely many possible changes that one can make.

Then since each of $\overline{A_i}$ are context-free, their union, $\overline{\mathsf A(M,w)}$, is also context-free.

Why does this help? Recall that our original approach was to ask whether there were any valid accepting computation paths—whether $\mathsf A(M,w)$ was non-empty. But since we know $\overline{\mathsf A(M,w)}$ is context-free, we ask the opposite: Is $\overline{\mathsf A(M,w)}$ every string? If it is, then $\mathsf A(M,w)$ is going to be empty.

But if we can answer this question, then we can determine whether $M$ accepts $w$. But this is exactly the universality problem for context-free languages: $$\mathsf U_{\mathsf{CFG}} = \{\llbracket G \rrbracket \mid \text{$G$ is a CFG and $L(G) = \Sigma^*$} \}$$

So undecidability follows immediately.

$\mathsf U_{\mathsf{CFG}}$ is undecidable.

We will assume that $\mathsf U_{\mathsf{CFG}}$ is decidable and show that doing so will allow us to decide $A_{\mathsf{TM}}$. Consider the language $\overline{\mathsf A(M,w)}$ of strings that are not accepting computation paths for $M$ when run on $w$. Observe that $M$ accepts $w$ if and only if there is an accepting computation path $C$ for $M$ when run on $w$. So $C \in \mathsf A(M,w)$ and therefore $C \not \in \overline{\mathsf A(M,w)}$. Then $\mathsf A(M,w) = \emptyset$ if and only if $M$ does not accept $w$, or equivalently, $M$ does not accept $w$ if and only if $\overline{\mathsf A(M,w)} = \Sigma^*$.

Now, suppose we have a Turing machine, say $T$ that decides $\mathsf U_{\mathsf{CFG}}$. Then we can construct a Turing machine $S$ that decides $\mathsf A_{\mathsf{TM}}$ that does the following:

  1. On input $\llbracket M,w \rrbracket$, construct a context-free grammar $G$ for $\overline{\mathsf A(M,w)}$, which we have shown is context-free by Lemma 25.1.
  2. Run $T$ on $\llbracket G \rrbracket$ to decide whether $L(G) = \Sigma^*$.
  3. If $T$ accepts, then reject. If $T$ rejects, then accept.
        \begin{algorithmic}
        \PROCEDURE{S}{$\llbracket M, w\rrbracket$}
            \STATE Construct CFG $G$ for $\overline{\mathsf A(M,w)}$
            \STATE result $\gets$ \CALL{T}{$\llbracket G \rrbracket$}
            \IF{result $=$ Accept}
                \RETURN Reject
            \ELSE
                \Return Accept
            \ENDIF
        \ENDPROCEDURE
        \end{algorithmic}
    

Since $\mathsf A_{\mathsf{TM}}$ is undecidable, we can conclude that $\mathsf U_{\mathsf{CFG}}$ is undecidable.

We can then use this to show that containment for CFGs is undecidable.

$C_{\mathsf{CFG}}$ is undecidable.

Suppose $C_{\mathsf{CFG}}$ is decidable and let $T$ be the Turing machine that decides it. We will define a TM $S$ that will decide $\mathsf U_{\mathsf{CFG}}$ that does the following:

  1. On input $\llbracket $G$ \rrbracket$, where $G$ is a CFG, construct the CFG $G'$ \[S \to a_1 S \mid \cdots a_n S \mid \varepsilon\] where $\Sigma = \{a_1, \dots, a_n\}$. The grammar $G'$ generates every string in $\Sigma^*$.
  2. Run the TM $T$ on $\llbracket G', G \rrbracket$.
  3. If $T$ accepts, then accept. If $T$ rejects, then reject.

Our TM $S$ will accept if and only if $L(G') \subseteq L(G)$. But $L(G') = \Sigma^*$, so our TM accepts if and only if $\Sigma^* \subseteq L(G)$. Since $L(G) \subseteq \Sigma^*$, $S$ accepts $\llbracket G \rrbracket$ if and only if $L(G) = \Sigma^*$. But $\mathsf U_{\mathsf{CFG}}$ is undecidable, so this is a contradiction and therefore $C_{\mathsf{CFG}}$ is not decidable.

We can use almost the same proof to show that $\mathsf{EQ}_{\mathsf{CFG}}$ is undecidable.

$\mathsf{EQ}_{\mathsf{CFG}}$ is undecidable.

This technique of defining CFLs to generate Turing machine computations (or not Turing machine computations, as it turns out) was shown by Hartmanis in 1967. He shows that several other decision problems regarding CFGs are undecidable using the same approach.

The situation for CFLs is a bit surprising—they seem to be far less powerful than even decidable languages, but are complex enough to encode behaviours of these more powerful devices. As a result, a surprising number of problems about CFLs are undecidable. In fact, it’s not until we descend all the way down to a class of (non-regular) languages called the visibly pushdown languages, a subclass of deterministic context-free languages, that we get decidability for all of our decision problems again.

Class Membership Emptiness Inclusion Equivalence Universality
Regular Decidable Decidable Decidable Decidable Decidable
Visibly pushdown Decidable Decidable Decidable Decidable Decidable
Deterministic context-free Decidable Decidable Undecidable Decidable Decidable
Linear Decidable Decidable Undecidable Undecidable Undecidable
Context-free Decidable Decidable Undecidable Undecidable Undecidable