CMSC 28000 — Lecture 16

The fact that this is described as a normal form hints at the fact that every context-free grammar has an equivalent grammar in CNF, and therefore, every context-free language has a grammar in CNF.

As an aside, Chomsky Normal Form is not the only normal form for grammars. Another one covered in the textbook is Greibach normal form, defined by Sheila Greibach in 1965. Production rules in Greibach normal form must be of the form $A \to aB_1 \cdots B_m$, where $a \in \Sigma$ and $A, B_1, \dots, B_m \in V$. We will not be covering Greibach normal form.

Let $L \subseteq \Sigma^*$ be a context-free language. Then there exists a CFG $G$ in Chomsky normal form such that $L(G) = L$.

We will begin with an arbitrary CFG $H$ and perform the following algorithm to construct an equivalent CFG $G$ in CNF.

START: Add a new start variable $S_0$ and production $S_0 \to S$. This allows us to guarantee that the starting variable of $G$ never occurs on the right hand side of a production.
TERM: Add variables $A_a$ for each terminal $a \in \Sigma$ and add the production $A_a \to a$.
BIN: Split up productions of the form $A \to B_1 \cdots B_m$. We do this by replacing these productions with productions $A \to B_1 A_1$, $A_1 \to B_2 A_2, \dots, A_{m-2} \to B_{m-1} B_m$, where $A_i$ for $1 \leq i \leq m-2$ are new variables.
DEL: Eliminate productions of the form $A \to \varepsilon$ for $A \neq S_0$. For each occurrence of $A$ in a production, we add a new production rule with $A$ removed.
UNIT: Eliminate productions of the form $A \to B$. Then for all productions $B \to \alpha$, we add the production $A \to \alpha$.

This completes the construction.

Most of the procedures described above are fairly straightforward. The only two that pose some real questions are the final two: the elimination of $\varepsilon$ and unit productions. A constructive proof that these two transformations work is given by Kozen in Chapter 21.

We also note that the order in which these transformations are applied can change, but can also create more work. For example, eliminating $\varepsilon$-productions (DEL) before ensuring that all productions have at most two variables on the right hand side (BIN) can create exponentially many productions. Observe also that some rules can create unit productions, so eliminating unit productions (UNIT) is best left as the final transformation.

Consider the grammar $G$, defined by $$ S \to \mathtt ( S \mathtt ) S \mid \varepsilon $$ Here, $L(G)$ is the language of well-balanced parentheses.

As you might guess from the construction, grammars in CNF can be quite large. One point of interest is that the order in which the CNF transformations are applied can affect the blowup of the number of rules. The classical presentation due to Hopcroft and Ullman that is usually taught can lead to exponentially many rules. The current order which we've presented now constructs an equivalent grammar in CNF as large as $O(n^2\mathtt )$. This observation was discussed as part of a larger investigation into the complexity of the CNF procedure by Lange and Leiß in 2009.

Recognition for grammars

As presented, it's not clear that there's a fast way to determine whether a string is generated by a given grammar $G$. For an aribtrary CFG, it's not clear where to even begin with this. If our grammar is in CNF, we can at least bound the length of the derivation of any string $w$ to $2|w| - 1$. This gives us an easy but bad algorithm: just check every derivation of length $2|w| - 1$ until you can or can't find one for $w$. Of course, this gives us exponentially many derivations to check.

However, we can exploit the predictable structure of a CNF grammar to come up with a systematic way of building a derivation. We know that every symbol will have a production that generates it directly and we know that every non-terminal production will have exactly two variables on its right hand side. This means that every derivation of a string longer than a single symbol can be broken up into derivations of exactly two strings.

This naturally gives us some recursive structure to work with. Suppose we have a grammar $G = (V, \Sigma, P, S)$ and a string $w$ that we would like to figure out whether $w \in L(G)$. Recall that this means that $S \Rightarrow^* w$, but as usual, we'll ask the broader question of whether there exists a variable $A$ such that $A \Rightarrow^* w$. We consider this based on the length of $w$.

For each string, what we'll do is remember the set of variables $A$ such that $A \Rightarrow^* w$. Let's call this set $T(w)$. Our discussion above gives us the folowing recurrence. \[T(w) = \begin{cases} \{A \in V \mid A \to w \in P\} & \text{if $|w| = 1$}, \\ \bigcup\limits_{w = uv} \{A \mid A \to BC \in P \wedge B\in T(u) \wedge C \in T(v)\} & \text{if $|w| > 1$}. \end{cases}\]

We can make this more explicit, for the algorithm, by referring to $u$ and $v$ as indexed substrings of $w = a_1 \cdots a_n$. \[T(i,j) = \begin{cases} \{A \in V \mid A \to a_i \in P \} & \text{if $i=j$}, \\ \bigcup\limits_{k=i}^{j-1} \{A \mid A \to BC \in P \wedge B\in T(i,k) \wedge C \in T(k+1,j)\} & \text{if $i < j$}. \end{cases}\]

This recurrence is a Bellman equation, so what we have is a dynamic programming algorithm. This algorithm is called the Cocke–Younger–Kasami (CYK) algorithm, named after the three individuals who independently proposed it.

    \begin{algorithmic}
    \PROCEDURE{cyk}{$G = (V,\Sigma,P,S), w=a_1 a_2 \cdots a_n$}
        \FOR{$i$ from 1 to $n$}
            \STATE $T[i,i] = \{A \in V \mid A \to a_i \in P\}$
        \ENDFOR
        \FOR{$m$ from 1 to $n-1$}
            \FOR{$i$ from 1 to $n-m$}
                \STATE $j = i+m$
                \STATE $T[i,j] = \emptyset$
                \FOR{$k$ from 1 to $j-1$}
                    \STATE $T[i,j] = T[i,j] \cup \{A \in V \mid A \to BC \in P \wedge B \in T[i,k] \wedge C \in T[k+1,j]\}$
                \ENDFOR
            \ENDFOR
        \ENDFOR
    \RETURN $S \in T[1,n]$
    \ENDPROCEDURE
    \end{algorithmic}

Consider the following grammar $G$ in CNF. \begin{align*} S &\rightarrow BC \mid b \\ A &\rightarrow BC \mid b \\ B &\rightarrow DC \mid BB \mid a \\ C &\rightarrow BA \mid b \\ D &\rightarrow CA \mid c \end{align*}

We want to determine whether the word $cabab$ is generated by $G$. We will store our results in a table. First, we consider every substring of length 1 and the productions that could have generated each substring. We record which variables are on the left hand side of these productions.

Next, we consider substrings of length 2. A production can generate substrings of length 2 if they lead to two variables that each generate substrings of length 1. For example, we have that entry $1,2$ doesn't have any variables that can generate it, because there is no production of the form $X \rightarrow DB$. However, we can determine that $ab$ is generated by a production that has either $BA$ or $BC$ on the right hand side. There are such productions: $S \rightarrow BC$, $A \rightarrow BC$, and $C \rightarrow BA$.

	$c$	$a$	$b$	$a$	$b$
	1	2	3	4	5
1	$D$
2		$B$
3			$S,A,C$
4				$B$
5					$S,A,C$

	$c$	$a$	$b$	$a$	$b$
	1	2	3	4	5
1	$D$	$\emptyset$
2		$B$	$S,A,C$
3			$S,A,C$	$\emptyset$
4				$B$	$S,A,C$
5					$S,A,C$

CMSC 28000 — Lecture 16

Chomsky Normal Form, continued

Recognition for grammars