CMSC 28000 — Lecture 16

One kind of problem you will likely be asked to solve is of the form “here’s a language, give us a grammar for it”. Here is an example of what a proof of this looks like.

Let $G$ be the context-free grammar \[S \to aSb \mid \varepsilon.\] Then $L(G) = \{a^k b^k \mid k \geq 0\}$.

Let $L = \{a^k b^k \mid k \geq 0\}$. First, we will show that $L \subseteq L(G)$. Let $w \in L$. We will show that $w$ can be generated by $G$ by induction on $|w|$.

The base case is $|w| = 0$, so $w = \varepsilon$. Then since $S \to \varepsilon$ is a production in $G$, we have $S \Rightarrow w$ and therefore $w \in L(G)$.

Now consider $|w| \geq 1$ and thus $w \neq \varepsilon$. For our inductive hypothesis, we assume that for any $x \in L$ with $|x| < |w|$, we have $x \in L(G)$. That is, $S \Rightarrow^* x$.

Since $w \in L$ and $|w| > 0$, we have $w = a^n b^n$ for some $n \geq 1$. We can write $w = a (a^{n-1} b^{n-1}) b$. Let $x = a^{n-1} b^{n-1}$ and we note that $x \in L$. Since $|x| < |w|$, we have $x \in L(G)$ by our induction hypothesis and thus there is a derivation $S \Rightarrow^* x$. Then we can use this to construct a derivation for $w$ by $$S \Rightarrow aSb \Rightarrow^* axb = a(a^{n-1} b^{n-1})b = a^n b^n = w.$$ Thus, we have $w \in L(G)$ and $L \subseteq L(G)$ as desired.

Now, we will show that $L(G) \subseteq L$. Let $w \in L(G)$. Then there exists a derivation $S \Rightarrow^* w$. We will show $w \in L$ by induction on $k$, the number of steps in the derivation of $w$.

For our base case, we have $k = 1$. The only derivation of length one is by applying the rule $S \to \varepsilon$, so $w = \varepsilon$. Since $\varepsilon = a^0 b^0 \in L$, the base case holds.

Now consider $k \geq 2$ and assume that every word $x \in L(G)$ with fewer than $k$ steps in its derivation is in $L$. Since $k \geq 2$, the first step in the derivation of $w$ must be $S \Rightarrow aSb$. So our derivation of $w$ is $S \Rightarrow aSb \Rightarrow^* axb = w$, where $x$ is a string generated by $S$. In other words, $S \Rightarrow^* x$.

Since $S \Rightarrow^* w$ is a derivation of length $k$, we have that $S \Rightarrow^* x$ is of length $k-1$. Therefore, $x \in L$ by our inductive hypothesis. So $x = a^j b^j$ for some $j \in \mathbb N$. Then we have \[S \Rightarrow aSb \Rightarrow axb = a(a^j b^j)b = a^{j+1} b^{j+1} = w.\] Since $j+1 \in \mathbb N$, we have that $w \in L$.

Thus, $L = L(G)$.

Notice that the structure of this proof mirrors our proofs for DFAs. In one direction, we start with a string $w$ from $L$ and try to construct a derivation based on $w$. In the other direction, we begin with a string $w$ generated by $G$ and we try to show that it has the property that it belongs to $L$. Kozen Chapter 20 gives another example of this kind of proof involving a grammar that generates strings of balanced brackets.

Though we will not ask for formal proofs of correctness for “give a grammar” type problems, the general structure will be important for other kinds of proofs involving grammars. In particular, you will need to be comfortable with proofs involving derivations.

Chomsky Normal Form

One challenging aspect of reasoning about grammars is that the only constraint on production rules $A \to \alpha$ is only that a single variable can be on the left hand side, while $\alpha$ can be any finite string over $V \cup \Sigma$. Compounding the issue is that productions can be applied nondeterministically. Thinking back to finite automata or derivatives, we were able clearly say something about the result of reading a particular word. Here, there seems to be many different ways of getting to the same word.

When we run into situations like these, we typically want to find some sort of standard form which has some structure that we can rely on. For CFGs, this idea is captured in the Chomsky normal form. Unsurprisingly, this normal form and many of its properties are due to Chomsky (1959).

Ironically, Chomsky normal form has a number of variants. The one here allows for $\varepsilon$ to be generated by the grammar. Kozen’s presentation disallows $\varepsilon$ from being in the language. This is not a huge issue, but it does mean that the rules can be slightly different across presentations.

Chomsky normal form seems a bit arbitrary, but think about the parse tree that results from a grammar in CNF. If every variable corresponds to an internal node of the tree and every terminal corresponds to a leaf, then grammars in CNF will give us parse trees that are binary trees. From this point of view, CNF gives us very clear base and inductive cases: productions of the form $A \to a$ correspond to base cases and productions of the form $A \to BC$ are our inductive cases.

One very useful property of CNF grammars is that every derivation of a non-empty string $w$ via a grammar in CNF has exactly $2|w| - 1$ steps.

Let $G$ be a context-free grammar in Chomsky normal form. Then for any non-empty word $w \in L(G)$, a derivation of $w$ in $G$ has $2|w| - 1$ steps.

We will show something more general: that for any variable $A$ in $G$, a derivation $A \Rightarrow^* w$ has $2|w| - 1$ steps. We will show this by induction on $|w|$.

For the base case, we have $|w| = 1$. Then the only step in the derivation is $A \Rightarrow w$, for some production $A \to w$. Otherwise, we must apply a production rule of the form $A \to BC$, but this leads to a word of length at least two since there are no rules of the form $B \to \varepsilon$. Thus, we have a derivation of length 1 and $1 = 2 \cdot 1 - 1 = 2|w| - 1$ so the base case holds.

Now consider $|w| > 1$. Our induction hypothesis is that for all words $x$ with derivation $A \Rightarrow^* x$ and $|x| < |w|$, the derivation of $x$ has length $2|x| - 1$. Since $|w| > 1$, the first step of the derivation must be via a production $A \to BC$. So we must have a derivation \[A \Rightarrow BC \Rightarrow^* uC \Rightarrow^* uv = w,\] for some nonempty strings $u$ and $v$ where $B \Rightarrow^* u$ and $C \Rightarrow^* v$. As there are no $\varepsilon$-rules, both $u$ and $v$ are shorter than $w$ and by the induction hypothesis, they have derivations of length $2|u| - 1$ and $2|v| - 1$. Then a derivation of $w$ has length \[2|u| - 1 + 2|v| - 1 + 1 = 2(|u|+|v|) - 1 = 2|w| - 1.\]

The fact that this is described as a normal form hints at the fact that every context-free grammar has an equivalent grammar in CNF, and therefore, every context-free language has a grammar in CNF.

Let $L \subseteq \Sigma^*$ be a context-free language. Then there exists a CFG $G$ in Chomsky normal form such that $L(G) = L$.

We will begin with an arbitrary CFG $H$ and perform the following algorithm to construct an equivalent CFG $G$ in CNF.

START: Add a new start variable $S_0$ and production $S_0 \to S$. This allows us to guarantee that the starting variable of $G$ never occurs on the right hand side of a production.
TERM: Add variables $A_a$ for each terminal $a \in \Sigma$ and add the production $A_a \to a$.
BIN: Split up productions of the form $A \to B_1 \cdots B_m$. We do this by replacing these productions with productions $A \to B_1 A_1$, $A_1 \to B_2 A_2, \dots, A_{m-2} \to B_{m-1} B_m$, where $A_i$ for $1 \leq i \leq m-2$ are new variables.
DEL: Eliminate productions of the form $A \to \varepsilon$ for $A \neq S_0$. For each occurrence of $A$ in a production, we add a new production rule with $A$ removed.
UNIT: Eliminate productions of the form $A \to B$. Then for all productions $B \to \alpha$, we add the production $A \to \alpha$.

This completes the construction.

Most of the procedures described above are fairly straightforward. The only two that pose some real questions are the final two: the elimination of $\varepsilon$ and unit productions. A constructive proof that these two transformations work is given by Kozen in Chapter 21.

Consider the grammar $G$, defined by $$ S \to \mathtt ( S \mathtt ) S \mid \varepsilon $$ Here, $L(G)$ is the language of well-balanced parentheses.

Applying the first step (START), adding a new start variable, we obtain \begin{align*} S_0 &\to S \\ S &\to \mathtt (S\mathtt )S \mid \varepsilon \end{align*}
Next, we create replace terminals with new productions (TERM). \begin{align*} S_0 &\to S \\ S &\to ASBS \mid \varepsilon \\ A & \to \mathtt ( \\ B & \to \mathtt ) \end{align*}
Then, we split up productions with more than two variables on the right hand side (BIN). \begin{align*} S_0 &\to S \\ S &\to AC_1 \mid \varepsilon \\ C_1 &\to SC_2 \\ C_2 &\to BS \\ A & \to \mathtt ( \\ B & \to \mathtt ) \end{align*}
Now, we remove $\varepsilon$-rules (DEL). \begin{align*} S_0 &\to S \mid \varepsilon \\ S &\to AC_1 \\ C_1 &\to SC_2 \mid C_2 \\ C_2 &\to BS \mid B \\ A & \to \mathtt ( \\ B & \to \mathtt ) \end{align*}
Finally, we remove unit productions (UNIT). \begin{align*} S_0 &\to AC_1 \mid \varepsilon \\ S &\to AC_1 \\ C_1 &\to SC_2 \mid BS \mid \mathtt ) \\ C_2 &\to BS \mid \mathtt ) \\ A & \to \mathtt ( \\ B & \to \mathtt ) \end{align*}

As you might guess from the construction, grammars in CNF can be quite large. One point of interest is that the order in which the CNF transformations are applied can affect the blowup of the number of rules. For example, eliminating $\varepsilon$-productions (DEL) before ensuring that all productions have at most two variables on the right hand side (BIN) can create exponentially many productions. Observe also that some rules can create unit productions, so eliminating unit productions (UNIT) is best left as the final transformation.

It turns out that the classical presentation due to Hopcroft and Ullman that is usually taught can lead to exponentially many rules. The current order which we’ve presented now constructs an equivalent grammar in CNF as large as $O(|G|^2)$. Here, $|G|$, the size of the grammar $G$ is taken to be the sum of the length of all productions in $G$, where the length of a production $A \to \alpha$ is $|A\alpha|$. This observation was discussed as part of a larger investigation into the complexity of the CNF procedure by Lange and Leiß in 2009.