CS 360 — Lecture 7

Context-free grammars

Now that we've seen some examples of non-regular languages, we'll begin looking beyond the class of regular languages in more detail. We'll be focusing on the class of context-free languages, which are characterized by context-free grammars. You'll have seen this already in CS 241 in the context of parsing.

Like finite automata, context-free grammars were developed in a different context than computation. As you might guess, formal grammars were designed as a way to formally specify grammars for natural language and were developed by Noam Chomsky in 1956. But while the modern treatment of formal grammars was developed relatively recently, the notion of a formal grammar can be traced back to the Indian subcontinent, as far as the 4th century BCE. Here's a simple grammar for a sentence in English: \begin{align*} S & \to NP \enspace VP \\ NP & \to N \mid Art \enspace N \mid Art \enspace Adj \enspace N \\ VP & \to V \mid V \enspace NP \end{align*}

Formal definitions

As always, we will define the notion of a context-free grammar formally.

A context-free grammar (CFG) is a 4-tuple $G = (V, \Sigma, P, S)$, where

$V$ is a finite set of variables, often denoted by uppercase letters,
$\Sigma$ is a finite alphabet, called terminals or terminal symbols, with $\Sigma \cap V = \emptyset$,
$P$ is a set of productions or rules of the form $ A \to \beta$ for some $A \in V$ and $\beta \in (V \cup \Sigma)^*$,
$S$ is the start variable.

Usually, it's enough when specifying the grammar to give the set of productions, since any variables and terminals that are used will appear as part of a production.

Let's take a look at an example. Let $G = (V,\Sigma,P,S)$, where $V = \{S\}$, $\Sigma = \{a,b\}$, $S$ is the start variable, and $P$ contains the following productions: \begin{align*} S &\to aSb \\ S &\to \varepsilon \end{align*}

It's not too hard to see that the grammar $G$ generates the language $\{a^k b^k \mid k \geq 0\}$. We probably already have an intuitive understanding of how grammars are supposed to generate a language: the idea is that we simply keep on applying as many rules as we like, transforming the word until we end up with a word that contains no variables. Of course, we would like to formalize such a notion.

Let $G = (V,\Sigma,P,S)$ be a context-free grammar and let $\beta, \gamma \in (V \cup \Sigma)^*$. We say that $\beta$ yields $\gamma$, denoted $\beta \Rightarrow_G \gamma$, if $\beta = x A y$ and $\gamma = x \alpha y$ for any $x,y \in (V \cup \Sigma)^*$ and there is a production $A \to \alpha \in P$.

We can define the $k$-step version of this relation $\beta \Rightarrow_G^k \gamma$ if there is a sequence of $k$ productions in $P$ that can be applied from $\beta$ to yield $\gamma$. We define this recursively by

$\beta \Rightarrow_G^0 \beta$ for all $\beta \in (V \cup \Sigma)^*$,
$\beta \Rightarrow_G^k \gamma$ if there exists $\alpha \in (V \cup \Sigma)^*$ such that $\beta \Rightarrow_G^{k-1} \alpha$ and $\alpha \Rightarrow \gamma$.

Then we define $\beta \Rightarrow_G^* \gamma$ if $\beta \Rightarrow_G^k \gamma$ for some finite $k$. Formally, for $\beta, \gamma \in (V \cup \Sigma)^*$, $\beta \Rightarrow_G^* \gamma$ if there exists a sequence of strings $\alpha_1, \alpha_2, \dots, \alpha_k \in (V \cup \Sigma)^*$ for some $k \geq 1$ such that $\beta = \alpha_1$, $\gamma = \alpha_k$, and $\alpha_{i-1} \Rightarrow_G \alpha_i$ for $2 \leq i \leq k$. Usually, we don't write $G$ if it is clear from context which grammar we're talking about.

In other words, we have $\beta \Rightarrow^* \gamma$ if we can transform $\beta$ into $\gamma$ by applying a finite number of production rules of $G$.

Let $G = (V,\Sigma,P,S)$ be a context-free grammar. The language generated by $G$ is $$L(G) = \{w \in \Sigma^* \mid S \Rightarrow_G^* w\}.$$

For a context-free grammar $G = (V,\Sigma,P,S)$, if we have $w \in L(G)$ and a sequence of strings $\alpha_1, \dots, \alpha_k \in (V \cup \Sigma)^*$ such that $\alpha_1 = S$, $\alpha_k = w$, and $\alpha_i \Rightarrow_G \alpha_{i+1}$ for $1 \leq i \leq k-1$, then we say that the sequence $\alpha_1, \dots, \alpha_k$ is a derivation of $w$. This definition implies that there is at least one derivation for every word $w \in L(G)$.

A language $L$ is a context-free language if there exists a context-free grammar $G$ such that $L(G) = L$.

Unlike for regular languages and finite automata, this is the definition and characterization for context-free languages. One might ask why these are called context-free languages and grammars. The name has to do with how the grammar works: for any production $A \to \alpha$, and any string $\beta A \gamma$, we can always apply the production $A \to \alpha$. The context of $A$, whatever is around $A$, has no bearing on what rule gets applied, only that the variable is present.

A familiar example

Now, we'll have a look at some more context-free languages. First, we'll have a look at something that might be a bit familiar. Let $G = (V,\Sigma,P,S)$, where $\Sigma = \{0,1\}$ and is given by the following productions, \begin{align*} S &\to 0A \mid 1S \\ A &\to 0A \mid 1S \mid \varepsilon \end{align*}

This grammar generates all binary strings which represent even numbers.

A context-free grammar $G = (V,\Sigma,P,S)$ is a (right) regular grammar if productions in $P$ are of the form

$A \to a$, where $A \in V$ is a variable and $a \in \Sigma$ is a terminal symbol,
$A \to aB$, where $A,B \in V$ are variables and $a \in \Sigma$ is a terminal symbol,
$A \to \varepsilon$, where $A \in V$ is a variable.

In other words, productions are restricted to a single symbol and at most a single variable, which must be placed at the right hand side of the production.

As you might guess from the name, these grammars are special. We can show that regular grammars generate exactly the class of regular languages. This was shown by Chomsky in 1959, among many other fundamental results on grammars. As a quick informal proof, we can observe that the variables correspond to states of a finite automaton and the productions correspond to transitions.

It's important to note that this doesn't mean that a non-regular grammar can't generate a regular language. Here's an example: \begin{align*} S & \to AB \\ A & \to 0A \mid \varepsilon \\ B & \to B1 \mid \varepsilon \end{align*}

This grammar generates the language $0^* 1^*$. So we can conclude that while a regular grammar must generate a regular language, regular languages don't have to be generated by regular grammars.

Right regular grammars are also called right linear grammars. One can show that left linear grammars, which restrict the placement of variables in the productions to the left side, also generate the regular languages. However, if we lift the restriction of the placement of the single variable in our productions, we get linear grammars, which can generate non-regular languages, which we'll see in the following example.

A non-regular example

We'll return to our example from above, the grammar $G$, defined as follows, $$S \to aSb \mid \varepsilon.$$ We claimed that this grammar generates the language $L = \{a^k b^k \mid k \geq 0\}$. So now, we'll prove it.

Proposition. $L(G) = L$.

Proof. First, suppose that $L \subseteq L(G)$ and consider a word $x \in L$. We will show that $x$ can be generated by $G$ by induction on $|x|$.

We begin with the base case, $|x| = 0$, which means that $x = \varepsilon$. However, $S \to \varepsilon$ is a production in $G$, so $S \Rightarrow x$ and $x \in L(G)$.

Now for our inductive step, we consider $|x| \geq 1$ and thus $x \neq \varepsilon$. For our induction hypothesis, we suppose that for any $w \in L$ with $|w| < |x|$, we have $w \in L(G)$. That is, $S \Rightarrow^* w$.

Since $x \in L$ and $|x| > 0$, we have $x = a^n b^n$ for some $n \geq 1$. Then this means that $w = a^{n-1} b^{n-1} \in L$. Since $|w| < |x|$, we have $w \in L(G)$ by our induction hypothesis and thus there is a derivation $S \Rightarrow^* w$. Then we can use this to derive $x$ by $$S \Rightarrow aSb \Rightarrow^* awb = x.$$ Thus, we have $x \in L(G)$ and $L \subseteq L(G)$ as desired.

Now, we'll show $L(G) \subseteq L$. Let $x \in L(G)$. We will show $x \in L$ by $k$, the number of steps in the derivation of $x$.

For our base case, we have $k = 1$, which means $x = \varepsilon$. Since $\varepsilon \in L$, the base case holds.

Now we consider $k \geq 2$ and assume that every word $w \in L(G)$ with fewer than $k$ steps in its derivation is in $L$. Since $k \geq 2$, the first step in the derivation of $x$ must be $S \to aSb$. Let $w \in L(G)$ be a word that can be derived in fewer than $k$ steps and let $x = awb$. We then have $w \in L$ by our induction hypothesis and thus $w = a^n b^n$ for some $n \geq 0$. Then $x = awb = a(a^n b^n)b = a^{n+1} b^{n+1} \in L$ and $L(G) \subseteq L$ as desired.

Thus, $L = L(G)$. $$\tag*{$\Box$}$$

Balanced words

Let $\Sigma = \{0,1\}$. We say that a word $w \in \Sigma^*$ is balanced if $|w|_0 = |w|_1$. Let $L = \{w \in \Sigma^* \mid |w|_0 = |w|_1 \}$ be the language of binary balanced words. We claim that the grammar $G$, defined by $$S \to 0S1S \mid 1S0S \mid \varepsilon$$ generates $L$.