CMSC 28000 — Lecture 15

Chomsky Normal Form

One of the problems of having so much flexibility in the production rules and how to apply them is that it makes it difficult to reason about. Much of the time, when we run into situations like these, we want to find some sort of standard form which has some structure that we can rely on. For CFGs, this idea is captured in the Chomsky normal form. Unsurprisingly, this normal form and many of its properties are due to Chomsky (1959). That this one is named after Chomsky also implies that there are other normal forms (Greibach, Kuroda).

Definition 15.1. A context-free grammar $G$ is in Chomsky normal form if every production of $G$ has one of the following forms:

$A \to BC$, where $A,B,C \in V$ and $B,C \neq S$,
$A \to a$, for $A \in V$ and $a \in \Sigma$,
$S \to \varepsilon$, where $S$ is the start variable.

There are a few things to note about grammars in CNF. The first is the very useful property that every parse tree for a string $w \in L(G)$ will have exactly $2|w| - 1$ nodes which are variables and exactly $|w|$ leaf nodes. In other words, every derivation of a non-empty string $w$ via a grammar in CNF has exactly $2|w| - 1$ substitutions.

Theorem 15.2. Let $G$ be a context-free grammar in Chomsky normal form. Then for any non-empty word $w \in L(G)$, a derivation of $w$ in $G$ has $2|w| - 1$ steps.

Proof. We will show something more general: that for any variable $A$ in $G$, a derivation $A \Rightarrow^* w$ has $2|w| - 1$ steps. We will show this by induction on $|w|$. For the base case, we have $|w| = 1$. Then the only step in the derivation is $A \to w$. Otherwise, a production rule of the form $A \to BC$ leads to a word of length at least two, since there are no rules of the form $B \to \varepsilon$. Thus, we have a derivation of length 1 and $1 = 2 \cdot 1 - 1$ so the base case holds.

Now we consider $|w| > 1$. Our induction hypothesis is that for all words $w'$ with $A \Rightarrow^* w'$ and $|w'| < |w|$, the derivation of $w'$ has length $2|w'| - 1$. Since $|w| > 1$, the first step of the derivation must be on a rule $A \to BC$. This means we can write $w = w_1 w_2$ and we have $B \Rightarrow^* w_1$ and $C \Rightarrow^* w_2$ and $w_1, w_2$ are both non-empty. Since both $w_1$ and $w_2$ are shorter than $w$, by the induction hypothesis, they have derivations of length $2|w_1| - 1$ and $2|w_2| - 1$. Then a derivation of $w$ has length $$2|w_1| - 1 + 2|w_2| - 1 + 1 = 2(|w_1|+|w_2|) - 1 = 2|w| - 1.$$ $$\tag*{$\Box$}$$

The second fact is that every context-free language has a grammar in CNF.

Theorem 15.3. Let $L \subseteq \Sigma^*$ be a context-free language. Then there exists a CFG $G$ in Chomsky normal form such that $L(G) = L$.

Proof. We will begin with an arbitrary CFG $H$ and perform the following algorithm to construct an equivalent CFG $G$ in CNF.

Add a new start variable $S_0$ and production $S_0 \to S$. This allows us to guarantee that the starting variable of $G$ never occurs on the right hand side of a production.
Eliminate productions of the form $A \to \varepsilon$ for $A \neq S_0$. For each occurrence of $A$ in a production, we add a new production rule with $A$ removed. For example, for $B \to \alpha A \beta A \gamma$, we add the production $B \to \alpha \beta A \gamma$ and $B \to \alpha A \beta \gamma$.
Eliminate productions of the form $A \to B$. Then for all productions $B \to \alpha$, we add the production $A \to \alpha$.
Add variables $A_a$ for each terminal $a \in \Sigma$ and add the production $A_a \to a$ except when $a$ already appears by itself on the right hand side of a production (since we don't want to introduce any unit productions).
Split up productions of the form $A \to B_1 \cdots B_m$. We do this by replacing these productions with productions $A \to B_1 A_1$, $A_1 \to B_2 A_2, \dots, A_{m-2} \to B_{m-1} B_m$, where $A_i$ for $1 \leq i \leq m-2$ are new variables.

This completes the construction. $$\tag*{$\Box$}$$

So what does this algorithm look like in action? Let's try it with an example.

Example 15.4. Consider the grammar $G$, defined by $$ S \to ( S ) S \mid \varepsilon $$ Here, $L(G)$ is the language of well-balanced parentheses, the Dyck language, again but with a slightly different grammar than we saw before. Applying the first step, adding a new start variable, we obtain \begin{align*} S_0 &\to S \\ S &\to (S)S \mid \varepsilon \end{align*} Then eliminating the $\varepsilon$-production, we have \begin{align*} S_0 &\to S \mid \varepsilon \\ S &\to (S)S \mid () S \mid (S) \mid () \end{align*} Then, we eliminate unit productions to obtain \begin{align*} S_0 &\to (S)S \mid () S \mid (S) \mid () \mid \varepsilon \\ S &\to (S)S \mid () S \mid (S) \mid () \end{align*} Then, we introduce productions for each terminal, \begin{align*} S_0 &\to A_LSA_RS \mid A_LA_R S \mid A_LSA_R \mid A_LA_R \mid \varepsilon \\ S &\to A_LSA_RS \mid A_LA_R S \mid A_LSA_R \mid A_LA_R \\ A_L &\to ( \\ A_R &\to ) \end{align*} Then final step is to ensure every production is broken up into the form $A \to BC$.

A few things to note. As you might guess from the construction, grammars in CNF can be quite large. In fact, if you have a CFG of size $n$, an equivalent grammar in CNF can be as large as $O(n^2)$. This means that while CNF grammars have some nice properties for proofs, it doesn't seem so great for implementation. Secondly, grammars in CNF can be ambiguous. Since inherently ambiguous context-free languages exist, that must mean that any grammar in CNF that generates those languages must also be ambiguous.

Recognition for grammars

As presented, it's not clear that there's a fast way to parse a string given a grammar $G$. If we have a word $w$ and we want to know whether it's generated by a grammar $G$ in CNF, then we know that such a word will have a derivation of length exactly $2|w| - 1$. This gives us an easy but bad algorithm: just check every derivation of length $2|w| - 1$ until you can or can't find one for $w$. Of course, this gives us exponentially many derivations to check.

However, we can exploit the predictable structure of a CNF grammar by considering each substring. For instance, we know that every symbol will have a production that generates it directly and we know that every non-terminal production will have exactly two variables on its right hand side.

This lends itself to a dynamic programming algorithm and such an algorithm was independently proposed by three different individuals, which gives the algorithm its name: the Cocke-Younger-Kasami algorithm.

Theorem 15.5 (Cocke-Younger-Kasami). Given a CFG $G$ in Chomsky Normal Form and string $w$ of length $n$, we can determine whether $w \in L(G)$ in $O(n^3 |G|)$ time.

We'll walk through an example.

Example 15.6. Consider the following grammar $G$ in CNF. \begin{align*} S &\rightarrow AB \mid b \\ A &\rightarrow CB \mid AA \mid a \\ B &\rightarrow AS \mid b \\ C &\rightarrow BS \mid c \end{align*}

We want to determine whether the word $cabab$ is generated by $G$. We will store our results in a table. First, we consider every substring of length 1 and the productions that could have generated each substring. We record which variables are on the left hand side of these productions.

	$c$	$a$	$b$	$a$	$b$
	1	2	3	4	5
1	$C$
2		$A$
3			$S,B$
4				$A$
5					$S,B$

Next, we consider substrings of length 2. A production can generate substrings of length 2 if they lead to two variables that each generate substrings of length 1. For example, we have that entry $1,2$ doesn't have any variables that can generate it, because there is no production of the form $X \rightarrow CA$. However, we can determine that $ab$ is generated by a production that has either $AS$ or $AB$. There are such productions: $S \rightarrow AB$ and $B \rightarrow AS$.

	$c$	$a$	$b$	$a$	$b$
	1	2	3	4	5
1	$C$	$\emptyset$
2		$A$	$S,B$
3			$S,B$	$\emptyset$
4				$A$	$S,B$
5					$S,B$

Next, we consider strings of length 3. Here, we have to do a bit more work, because strings of length 3, say $xyz$ can be generated by two variables by $xy$ and $z$ or $x$ and $yz$. We must check both possibilities. For example, for the string $cab$, we can generate it either by $ca$ and $b$ or by $c$ and $ab$. In the first case, $ca$ has no derivation, but the second case gives us a valid derivation, since $ab$ can be generated by $S$ or $B$ and we have $A \Rightarrow CB \Rightarrow cab$.

	$c$	$a$	$b$	$a$	$b$
	1	2	3	4	5
1	$C$	$\emptyset$	$A$
2		$A$	$S,B$	$\emptyset$
3			$S,B$	$\emptyset$	$C$
4				$A$	$S,B$
5					$S,B$

We continue with strings of length 4. Again, we have to consider all of the ways to split up a string of length 4 into two substrings.

	$c$	$a$	$b$	$a$	$b$
	1	2	3	4	5
1	$C$	$\emptyset$	$A$	$A$
2		$A$	$S,B$	$\emptyset$	$C$
3			$S,B$	$\emptyset$	$C$
4				$A$	$S,B$
5					$S,B$

Finally, we finish with the string of length 5, our original string $w$. We have constructed a table that tells us which variables can generate each substring $w$. To determine whether $w \in L(G)$, we simply need to see whether the substring corresponding to $w$ (i.e. $w[1..5]$ has a derivation that begins with $S$. We see that it does, so $w \in L(G)$.

	$c$	$a$	$b$	$a$	$b$
	1	2	3	4	5
1	$C$	$\emptyset$	$A$	$A$	$S,B$
2		$A$	$S,B$	$\emptyset$	$C$
3			$S,B$	$\emptyset$	$C$
4				$A$	$S,B$
5					$S,B$

This algorithm takes $O(n^3)$ time since it has three loops: for each substring length $\ell$ from 1 to $n$, consider each substring of length $\ell$ (of which there are $O(n)$), and consider all of the ways to split this substring into two (again, $O(n)$). This puts us into polynomial time, which is much better than our "try every derivation" approach. Of course, there are some issues with this.

First as a side note, Leslie Valiant showed in 1975 that computing the CYK table can be reduced to matrix multiplication, which at the time that most current formal languages books were printed had a time complexity of $O(n^{2.376})$-ish, due to Coppersmith and Winograd in 1990. This was true until about 2010 (i.e. 20 years later), when there was a bunch of results from Stothers, Vassilevska-Williams, and Le Gall improved this to $O(n^{2.373})$. These algorithms are not actually feasible, but it does provide a general lower bound. The consequence of this is that we don't really know if we can do parsing better than $O(n^2)$.

The more immediate problem is that $O(n^3)$ is still really big, if you think about your typical large software codebase and the amount of time it takes to build a project and how mad everyone gets when you break the build. Ideally, we would like something that runs in linear time, or in other words, something where we can essentially read our input once and parse it correctly without having to go back over it. This can't be applied to all context-free languages because of inherent ambiguity of some languages, so instead, attention is restricted to various subclasses of context-free languages.