CMSC 28000 — Lecture 24

Decision problems on formal languages

We've seen that we can ask questions about Turing machines, like membership, emptiness, and equality. We can also ask these same questions about less powerful models, like DFAs and CFGs and get corresponding decision problems for them. Let's consider such problems.

While we'll see that a lot of problems are decidable, there are some very surprising examples of problems which are undecidable. Many decidability properties that we'll be discussing were shown by Rabin and Scott in 1959 and Bar-Hillel, Perles, and Shamir in 1961.

Here are the problems we're interested in:

Membership

The simplest problem we can ask about a language device and a word is whether the word belongs in the language. Again, this is the language membership problem: Given a device $A$ and a word $w$, is $w \in L(A)$?

We define the following language for the DFA acceptance problem: $$ A_{\mathsf{DFA}} = \{\llbracket B, w \rrbracket \mid \text{$B$ is a DFA that accepts $w$}\}.$$

$A_{\mathsf{DFA}}$ is a decidable language.

We can do the same thing we did for $A_{\mathsf{TM}}$: simulate the DFA. Recall that such a machine sets up three tapes:

  1. One tape containing the description of the DFA $\llbracket B \rrbracket$.
  2. One tape containing the input string $w$.
  3. One tape containing the current state.

In fact, this is even easier since we only need to keep track of how far we've read on the second tape.

We can ask the same question for NFAs: $$ A_{\mathsf{NFA}} = \{\llbracket B, w \rrbracket \mid \text{$B$ is an NFA that accepts $w$}\}.$$

$A_{\mathsf{NFA}}$ is a decidable language.

There are a few ways to approach this. One can consider simulating the NFA directly and it's not hard to come up with a scheme to do this. But we can also transform our NFA into a DFA and just run the Turing machine for $A_{\mathsf{DFA}}$ on it.

Now, we'll turn our attention to context-free languages. Again, from our Turing machien simulation, we can very easily get a Turing machine for \[A_{\mathsf{PDA}} = \{\llbracket P, w \rrbracket \mid \text{$P$ is a PDA, $w \in L(P)$}\}.\]

$A_{\mathsf{PDA}}$ is a decidable language.

What about grammars? Consider the following language. $$ A_{\mathsf{CFG}} = \{\llbracket G, w \rrbracket \mid \text{$G$ is a CFG that generates $w$}\}.$$

$A_{\mathsf{CFG}}$ is a decidable language.

This is also decidable; we even gave an efficient algorithm for it (CYK)! So we can pretend that we have a CYK Turing machine.

Emptiness

Another common problem you can ask is whether your language device actually describes any strings at all. This is the emptiness problem: Given a device $A$, is $L(A) = \emptyset$?

We will consider the following language $$ E_{\mathsf{DFA}} = \{ \llbracket A \rrbracket \mid \text{$A$ is a DFA and $L(A) = \emptyset$}\}. $$

$E_{\mathsf{DFA}}$ is a decidable language.

To solve this problem, we simply treat it like a graph problem. A DFA accepts a word if there is an accepting state that is reachable from the initial state. Then we can use a similar idea to the graph connectedness algorithm from last week for this problem. The TM looks like this:

  1. Mark the initial state of $A$.
  2. Repeat until no new states get marked:
    Mark any state that has a transition coming into it from any state that is already marked.
  3. If no accepting state is marked, accept. Otherwise, reject.

We can also show that the emptiness problem for CFGs is also decidable. Again, it'll look a bit different from what we've done with DFAs. First, here's the language $$E_{\mathsf{CFG}} = \{\llbracket G \rrbracket \mid \text{$G$ is a CFG and $L(G) = \emptyset$} \}.$$

$E_{\mathsf{CFG}}$ is a decidable language.

In a way, we can apply the same kind of idea from the DFA case. Of course, a grammar isn't a graph, so it's not exactly the same. However, in essence, what we want to do is check that there's a derivation that will generate some terminal beginning from the start variable. To do this, we check each variable to see if it can generate a string of terminals. We begin by marking terminals and determining whether each variable can generate a string of variables and terminals that are all marked. Once the algorithm is finished, we can simply check whether the start variable is marked or not.

  1. On input $\llbracket G \rrbracket$ where $G$ is a CFG, mark all terminal symbols in $G$.
  2. Repeat until no new variables get marked:
    Mark any variable $A$ where $G$ has a rule $A \to U_1 U_2 \cdots U_k$ and each symbol $U_1, \dots, U_k$ has already been marked.
  3. If the start variable is not marked, accept; otherwise reject.

Containment

Another common problem is if we have two language devices, whether we can compare them somehow. The most basic variant of these is asking whether one is contained in the other. This is the containment problem: Given two devices $A$ and $B$, is $L(A) \subseteq L(B)$?

Consider the following language, $$ C_{\mathsf{DFA}} = \{ \llbracket A,B \rrbracket \mid \text{$A$ and $B$ are DFAs and $L(A) \subseteq L(B)$} \}.$$ Of course, this is another decidable property, so we have the following theorem.

$C_{\mathsf{DFA}}$ is a decidable language.

Again, we make use of the machine that we constructed in the previous theorem. First, let's consider what it means for $L(A)$ to be contained in $L(B)$. We want to check whether every word in $L(A)$ is also in $L(B)$. However, an equivalent way of checking this property is to check whether there are any words in $L(A)$ that are not in $L(B)$. So we construct a DFA $C$ such that $$ L(C) = L(A) \cap \overline{L(B)}$$ and check whether or not $L(C) = \emptyset$. Luckily, we've just shown how to check the emptiness of a DFA.

  1. On input $\llbracket A,B \rrbracket$, construct the DFA $C$ as described above.
  2. Run the TM from Theorem 24.5 on the input $\llbracket C \rrbracket$ to check if $L(C) = \emptyset$.
  3. If the machine accepts, accept. Otherwise, reject.

Solving containment lets us solve other similar problems. For instance, we can solve the equality problem: Given two devices $A$ and $B$, is $L(A) = L(B)$?

The language $$\mathsf{EQ}_{\mathsf{DFA}} = \{ \llbracket A,B \rrbracket \mid \text{$A$ and $B$ are DFAs and $L(A) = L(B)$} \}.$$ is a decidable language.

Of course, we have the opportunity to make use of the machine that we constructed in the previous theorem. Recall that for two sets $S$ and $T$, $S = T$ if and only if $S \subseteq T$ and $T \subseteq S$. This fact combined with the containment machine we constructed gives us a rather simple algorithm for checking equality.

  1. On input $\llbracket A,B \rrbracket$, run the TM from the previous theorem on the input $\llbracket A,B \rrbracket$ to check if $L(A) \subseteq L(B)$ and run it again on $\llbracket B,A \rrbracket$ to check if $L(B) \subseteq L(A)$.
  2. If both machines accept, accept. Otherwise, reject.

Then this allows us to solve the universality problem: Given a device $A$, does $L(A) = \Sigma^*$?

The language \[U_{\mathsf{DFA}} = \{\llbracket A \rrbracket \mid \text{$A$ is a DFA and $L(A) = \Sigma^*$} \} \] is a decidable language.

Again, this is fairly straightforward based on the results we've already shown. One approach is to construct a simple DFA that recognizes $\Sigma^*$ and test whether $\Sigma^* \subseteq L(A)$. Another approach is the check whether $\overline{L(A)} = \emptyset$.

But what about context-free languages? Suppose we have two context-free grammars $G_1$ and $G_2$. Then $L(G_1) \subseteq L(G_2)$ if and only if $L(G_1) \cap \overline{L(G_2)} = \emptyset$, just as above. But there's a problem with this: context-free languages aren't closed under intersection and complement, so we're not guaranteed to end up with a context-free language after this process.

Maybe we can try something more clever? Unfortunately, even if we tried harder, we won't be able to solve this problem, as it turns out to be our first encounter with an undecidable problem. And since containment for CFGs is undecidable, this means that equality and universality are also going to be undecidable.

Well, that's the claim at least, but can we prove it?

Undecidable problems about context-free languages

There is a similar result characterizing undecidable properties of languages for subclasses of decidable languages (i.e. context-free languages, context-sensitive languages, etc.) called Greibach's theorem, due to Sheila Greibach in 1963 (you may remember Greibach being mentioned when were talking about normal forms for grammars).

We won't prove Greibach's theorem, but we will show that some problems about context-free languages are undecidable. The proofs for these kinds of problems are going to be much trickier, since we don't have the benefit of conjuring up Turing machines. However, we can do something that may be surprising, which is make use of computations of Turing machines.

Let $M$ be a Turing machine and $w$ and input string. An accepting computation path for $M$ on $w$ is a sequence of configurations $C_1, \dots, C_\ell$, where $C_1$ is the start configuration of $M$ on $w$ and $C_\ell$ is an accepting configuration of $M$ on $w$ and each $C_i$ follows directly from $C_{i-1}$ (that is, $C_{i-1} \vdash C_i$) as defined by the transition function of $M$. (We can define a rejecting computation path similarly.)

This gives us a very strict format for the computation of a word $w$ when run on a Turing machine $M$. We can consider the language of encodings of all such computation paths. For a fixed Turing machine $M$ and input string $w$ that we would like to run on $M$, we have the language of accepting configurations \[A(M,w) = \{\llbracket C \rrbracket \mid \text{$C$ is an accepting computation path for $M$ on $w$}\}.\] What does such a string look like? It must satisfy the following conditions:

  1. $\llbracket C \rrbracket$ is a string of the form $\llbracket C \rrbracket = \# C_1 \# C_2^R \# C_3 \# C_4^R \# \cdots \# C_\ell \#$, where $C_i$ is a configuration,
  2. $C_1$ is a starting configuration,
  3. $C_\ell$ is an accepting configuration,
  4. For each $C_i$, $C_i \vdash C_{i+1}$ according to $M$.

Notice that our encoding of the computation path reverses every other configuration. This is a deliberate choice—we will get to it shortly. However, for each of the items in the previous list, we can define a language $A_i$ for them. Then the language of valid encodings of computation paths can be expressed as the intersection $$A(M,w) = A_1 \cap A_2 \cap A_3 \cap A_4,$$ which is all strings which satisfy all of our conditions.

We will show that we can compute all of the strings that aren't accepting computations.