CMSC 28000 — Lecture 14

Grammars

At last, we're ready to step beyond the comfortable world of regular languages and finite automata. Despite having sold this course to you as a course about computation and machines, we're going to be taking a different tack at first and saving the machine model for later. Instead, we'll be looking at grammars.

Like finite automata, grammars were developed in a different context than computation. As you might guess, formal grammars were designed as a way to formally specify grammars for natural language and were developed by Noam Chomsky in 1956 as phrase-structure grammars. But while the modern treatment of formal grammars was developed relatively recently, the notion of a formal grammar can be traced back to the Indian subcontinent, as far as the 4th century BCE.

Here's a simple grammar for English taken from Jurafsky and Martin (2020): \begin{align*} S & \to NP \enspace VP \\ NP & \to Pronoun \mid ProperNoun \mid Det \enspace Nominal \\ Nominal & \to Nominal \enspace Noun \mid Noun \\ VP & \to Verb \mid Verb \enspace NP \mid Verb \enspace NP \enspace PP \mid Verb \enspace PP \\ PP &\to Preposition \enspace NP \end{align*}

However, despite their origins as a tool to define the grammatical structure of natural language, where they're most commonly found now is in defining the grammatical structure of programming languages. For example, the Python Language Reference describes the language using snippets that look like


    for_stmt ::=  "for" target_list "in" starred_list ":" suite
                  ["else" ":" suite]

This says that a for_stmt is a statement that begins with the literal string for, followed by a string that follows the rules of a target_list, followed by the literal string in, followed by a string that follows the rules of starred_list, followed by the literal string :, followed by a block of code defined by suite. Here are those rules


    suite         ::=  stmt_list NEWLINE | NEWLINE INDENT statement+ DEDENT
    stmt_list     ::=  simple_stmt (";" simple_stmt)* [";"]

In Section 10, we can find the full grammar specification that is the formally specified grammar that is used in the compiler.

I mentioned before that in the process of lexical analysis, regular languages are only able to tell us what kinds of tokens it sees, but not whether they're in the correct order. The phase which enforces the correct order of tokens, or whether a string of tokens is "grammatically correct", is called parsing. As you might expect, parsers parse strings and determine their grammatical validity, and as you might guess, parsers are built on context-free grammars.

Now is a good time to have a look at the Chomsky Hierarchy. The hierarchy is due to Chomsky, and categorizes languages by the complexity of their grammars. An important point about the hierarchy is that each class is contained in the more powerful class. This makes sense: every language that can be generated by a right linear grammar can be generated by a more powerful grammar.

Grammar	Language class	Machine model
Type-0 (Unrestricted)	Recursively enumerable	Turing machines
Type-1 (Context-sensitive)	Context-sensitive	Linear-bounded automata
Type-2 (Context-free)	Context-free	Pushdown automata
Type-3 (Right/left linear)	Regular	Finite automata

Interestingly, the hierarchy, although intended for grammars, happens to map very nicely onto automata models.

Context-free grammars

A context-free grammar (CFG) is a 4-tuple $G = (V, \Sigma, P, S)$, where

$V$ is a finite set of variables (sometimes called nonterminals), often denoted by uppercase letters,
$\Sigma$ is a finite alphabet, called terminals or terminal symbols, with $\Sigma \cap V = \emptyset$,
$P \subseteq V \times (V \cup \Sigma)^*$ is a finite set of productions or rules, written in the form $A \to \beta$ for $A \in V$ and $\beta \in (V \cup \Sigma)^*$,
$S$ is the start variable.

Let's take a look at an example.

Let $G = (V,\Sigma,P,S)$, where $V = \{S\}$, $\Sigma = \{a,b\}$, $S$ is the start variable, and $P = \{(S, aSb), (S, \varepsilon)\}$.

However, we generally do not write grammars, even formally, like above. More typically, we write grammars with the variable on the left hand side of an arrow with the right hand side on the other side of the arrow, like so: \begin{align*} S &\to aSb \\ S &\to \varepsilon \end{align*} We also write grammars that have the same left hand side with the right hand sides on one line, separated by $\mid$, like: \begin{align*} S &\to aSb \mid \varepsilon \end{align*} This is sufficient because it contains all of the information of the formal definition: every variable appears on the left hand side of some production and every alphabet symbol appears in some right hand side of a production. If they don't show up, they're not ever used and can be safely omitted.

Of course, when doing this, we are implicitly relying on a number of stylistic conventions to be able to interpret the grammar without explicitly specifying its parts:

The first left-hand side variable is assumed to be the start variable and is typically denoted $S$.
Variables are typically denoted by uppercase Latin characters (i.e. $A, B, C, \dots$).
Alphabet symbols are typically denoted by lowercase Latin characters (i.e. $a, b, c, \dots$).

If your set of variables or alphabet symbols is not obvious, then you should explicitly state what they are or some stylistic convention you plan to use. For instance, grammars used in linguistics or programming languages will use more descriptive variable names and a much larger set of terminals.

How are grammars used? We have an intuitive understanding of how grammars are supposed to generate a language: start with the start variable $S$ and keep on applying as many rules as we like, transforming the string until we end up with a string that contains no variables.

Consider the grammar above. We start with the start variable $S$. We can replace $S$ with a right hand side of a production that has $S$ on its left hand side. So we can replace it with $\varepsilon$: \[S \Rightarrow \varepsilon.\] However, we could alternatively replace it with $aSb$: \[S \Rightarrow aSb.\] In this case, our string contains a variable that we can replace. We can apply another rule. \[S \Rightarrow aSb \Rightarrow aaSbb.\] We apply rules until there are no variables left in our string to replace: \[S \Rightarrow aSb \Rightarrow aaSbb \Rightarrow aaaSbbb \Rightarrow aaaaSbbbb \Rightarrow aaaabbbb.\]

Just like we think of states in finite automata as describing some property about the states that reach them, we should think of variables as describing some property of the strings that get generated by them.

Our grammar in this case is quite simple: it consists of a single variable $S$, so the language of the grammar $G$ consists of strings that are generated via the variable $S$. We also notice that the production rule $S \to aSb$ refers to itself—that is, it's recursive. A plain reading of this is that strings generated by $S$ are strings generated by $S$ but with an $a$ on the left and a $b$ on the right.

As we've seen plenty of times, such a definition can go on forever, so we have another rule, that $S \to \varepsilon$. So we have two cases: one recursive and one that's not. And so we can arrive at the conclusion that strings generated by $S$ are either empty (our base case) or $awb$, where $w$ results from $S$. It turns out that this grammar generates the language $\{a^k b^k \mid k \geq 0\}$.

One might ask why these are called context-free languages and grammars. The name has to do with how the grammar works: for any production $A \to \alpha$, and any string $\beta A \gamma$, we can always apply the production $A \to \alpha$. The context of $A$, whatever is around $A$, has no bearing on what rule gets applied, only that the variable is present. Consider, for example, the following definition for context-sensitive grammars.

A grammar $G = (V, \Sigma, P, S)$ is said to be context-sensitive if every production in $P$ is of the form $\alpha B \gamma \to \alpha \beta \gamma$, for $\alpha, \gamma \in (V \cup \Sigma)^*$, $\beta \in (V \cup \Sigma)^+$, and $B \in V$.

A consequence of this definition is that a rule of the form $\alpha B \gamma \to \alpha \beta \gamma$ means that replacing $B$ with $\beta$ is only allowed within the context $(\alpha,\gamma)$.

Recall that palindromes are strings $w$ such that $w = w^R$. We claim that the following grammar generates the language of palindromes over the alphabet $\{a,b\}$. \[S \to aSa \mid bSb \mid a \mid b \mid \varepsilon\] To see why that is, we can view palindromes recursively. A palindrome is either

A single symbol, like $a$ or $b$, or the empty string $\varepsilon$, or
If $w$ is a palindrome, then $awa$ or $bwb$ is a palindrome.

Does this definition work? Consider a string $cwc$, where $w$ is a palindrome—so $w = w^R$. If we apply the definition of reversal, we have \[(cwc)^R = (wc)^R c = c^R w^R c = cw^Rc = cwc.\]

One thing you might have noticed with some of these examples is that they implictly use the regular operations:

When we choose between two rules $S \to A_1 \mid A_2$, we're really describing a union of strings generated from $A_1$ or those generated from $A_2$.
When we take a rule $S \to A_1 A_2$, we're really describing a concatenation of strings generated from $A_1$ and those generated from $A_2$.
When we have a rule $S \to \alpha S \beta$, we're really describing some repetition of strings generated from $S$.

We can use these observations to conclude that the context-free languages are closed under the regular operations.

The class of context-free languages is closed under the regular operations.

Let $G_1 = (V_1,\Sigma,P_1,S_1)$ and $G_2 = (V_2, \Sigma,P_2,S_2)$ be context-free grammars. We can construct new grammars for each regular operation.

For a grammar that generates $L(G_1) \cup L(G_2)$, we add new production rules \[S \to S_1 \mid S_2.\]
For a grammar that generates $L(G_1) L(G_2)$, we add a new production rule \[S \to S_1 S_2.\]
For a grammar that generates $L(G_1)^*$, we add new production rules \[S \to S_1 S \mid \varepsilon.\]

To prove this formally, we would need to through an inductive argument—this makes a nice exercise.

Eventually, we will need to discuss how a grammar generates a particular language and show that this language coincides with a language we would like to describe. In order to do this, we need to formalize some aspects of the generative process, the sequence of applications of production rules that result in a string. We require the following definitions.

Let $G = (V,\Sigma,P,S)$ be a context-free grammar.

A sentential form is a string of variables and terminals $\alpha \in (V \cup \Sigma)^*$.
A sentence is a sentential form that consists only of terminals (i.e. $\alpha \in \Sigma^*$).
We define an application of a rule from the grammar $G$ on a sentential form by $\alpha B \gamma \Rightarrow_G \alpha \beta \gamma$ if $\alpha, \gamma \in (V \cup \Sigma)^*$ are strings of variables and terminals and $B \to \beta \in P$ is a rule in $G$.
We define $\Rightarrow_G^k$ for $k \in \mathbb N$ to be $k$ applications of rules from $G$ by
- $\beta \Rightarrow_G^0 \beta$ for all $\beta \in (V \cup \Sigma)^*$,
- $\alpha \Rightarrow_G^{k+1} \gamma$ if $\alpha \Rightarrow_G^k \beta$ and $\beta \Rightarrow_G \gamma$.
Then we define \[\Rightarrow_G^* = \bigcup\limits_{k \geq 0} \Rightarrow_G^k.\] We say that $\Rightarrow_G^*$ is the reflexive, transitive closure of $\Rightarrow_G$. That is, $\alpha \Rightarrow_G^* \beta$ if there are sentential forms $\alpha = \alpha_0, \alpha_1, \dots, \alpha_n = \beta$ such that $\alpha_0 \Rightarrow_G \alpha_1 \Rightarrow_G \cdots \Rightarrow_G \alpha_n$. Usually, we drop the subscript $G$ if it is clear from context which grammar we're talking about.
A derivation is 0 or more applications $\Rightarrow_G$ to a sentential form.