We have seen that grammars are typically used to define the structure of programming languages. However, we do not really have a clear way to determine whether a string is generated by a given grammar $G$—that is, the string is “valid” with respect to the grammar. For an aribtrary CFG, it’s not clear where to even begin with this.
One way to think about this is to think back to the correspondence between regular expressions and finite automata. There, we had a convenient descriptive formalism in the regular expressions which we can transform into a convenient recognizer in finite automata.
Let $G$ be a context-free grammar. Then there exists a pushdown automaton $M$ such that $N(M) = L(G)$.
Let $G = (V,\Sigma,P,S)$ be a CFG. We will create a PDA $M = (Q,\Sigma,\Gamma, \delta, q_0, Z_0)$ which accepts by empty stack defined as follows:
Here is what the machine looks like.
This machine essentially acts as a parser for $G$, albeit a nondeterministic one. Like the one-state machine we saw earlier, the computation in this machine is driven primarily by the stack:
This process essentially performs a leftmost derivation of our input word. As usual with a one-state machine, we can elide many of the tricky details of how we can ensure the correct production rule is chosen when we see $A$ through the magic of nondeterminism.
To formally prove this, we need to show that we have a derivation $S \Rightarrow^* w$ over $G$ if and only if there is a computation $(q,w,S) \vdash^* (q,\varepsilon,\varepsilon)$ in $M$.
What is the idea behind this? We can think of this as a recursive descent parser. Generally, recursive descent parsers start with the start variable and try to construct a derivation matching the input string. If it is unsuccessful, it will backtrack and try another option. In essence, we can think of this as a deterministic implementation of a PDA, where we backtrack on the computation tree and try each “nondeterministic choice”. We can also see that the job of the PDA stack ends up being mirrored by the call stack of the parser.
Here are the details for the proof that $S \Rightarrow^* w$ over $G$ if and only if $(q,w,S) \vdash^* (q,\varepsilon,\varepsilon)$ in $M$.
Note that in Chapter 24, Kozen proves this with the helpful assumption that all context-free grammars can be transformed into an equivalent grammar in Greibach Normal Form. We did not do this, so our job here is harder.
We will first show that $L(G) \subseteq N(M)$. Let $w \in L(G)$. There exists a leftmost derivation for $w$ $$S = \gamma_1 \Rightarrow \gamma_2 \Rightarrow \cdots \Rightarrow \gamma_n = w,$$ where $\gamma_i \in (V \cup \Sigma)^*$ for $1 \leq i \leq n$. Since this is a leftmost derivation, we can write $\gamma_i = x_i \alpha_i$, where $x_i \in \Sigma^*$ and $\alpha_i \in V(V \cup \Sigma)^*$ for $1 \leq i \leq n-1$. In other words, $\alpha_i$ begins with the leftmost variable of $\gamma_i$. Furthermore, $x_i$ must be a prefix of $w$ and for each $x_i$, we can write $w = x_i y_i$ for some $y_i \in \Sigma^*$.
We will show by induction on $i$, that if $S \Rightarrow^* \gamma_i = x_i \alpha_i$, then $(q,w,S) \vdash^* (q,y_i,\alpha_i)$ on the PDA $M$ for all $1 \leq i \leq n$.
For our base case, we take $i = 1$. We have $x_1 = \varepsilon$, $y_1 = w$, and $\alpha_1 = S$. Thus, we have $(q,w,S) \vdash^* (q,y_i,\alpha_i)$ trivially.
For our inductive hypothesis, we assume that $(q,w,S) \vdash^* (q,y_j,\alpha_j)$ for all $j < i$. So suppose we have $S \Rightarrow^* \gamma_{i-1} = x_{i-1} \alpha_{i-1}$. By the inductive hypothesis, there exists a computation $(q,w,S) \vdash^* (q,y_{i-1},\alpha_{i-1})$, where
At this point, there is a variable $A_{i-1}$ at the top of the stack. Since we want a sentential form $x_i \alpha_i$, where the first symbol of $\alpha_i$ is a variable, this means we need to do the following:
So we pop $A_{i-1}$ off the stack. Since we have a leftmost derivation of $w$, there must exist a production $A_{i-1} \to \beta_{i-1}$ in $G$ for some $\beta_{i-1} \in (V \cup \Sigma)^*$. By our definition of $M$, there is a transition with $(q,\beta_{i-1}) \in \delta(q,\varepsilon,A_{i-1})$. Therefore, we have $(q,y_{i-1},\alpha_{i-1}) \vdash (q,y_{i-1},\beta_{i-1}\eta_{i-1})$. That is, the machine pops $A_{i-1}$ off the stack and applies the rule $A_{i-1} \to \beta_{i-1}$ by putting $\beta_{i-1}$ on the stack.
At this point, $\beta_{i-1}$ is on the top of the stack. However, $\beta_{i-1}$ is an arbitrary sentential form made up of terminals and non-terminals. To consume terminals on the stack, we must have the same symbols in our input string and apply the rule $\delta(q,a,a) = \{(q,\varepsilon)\}$ for $a \in \Sigma$.
Let $z_{i-1}$ denote the symbols on the top of the stack to be consumed.
Reading $z_{i-1}$ then gives us $$(q,y_{i-1},\alpha_{i-1}) = (q,y_{i-1}, A_{i-1} \eta_{i-1}) \vdash (q, y_{i-1}, \beta_{i-1} \eta_{i-1}) = (q, z_{i-1} y_i, z_{i-1} \alpha_i) \vdash^* (q,y_i,\alpha_i).$$ Thus, we have shown that for each $i$, we have $(q,w,S) \vdash^* (q,y_i,\alpha_i)$. Then taking $i = n$, we have $$(q,w,S) \vdash^* (q,y_n,\alpha_n) = (q,\varepsilon,\varepsilon)$$ and thus $w \in N(M)$ as desired.
Next, we will show that $N(M) \subseteq L(G)$. Let $w \in N(M)$. We will show that for any variable $A \in V$, if $(q,w,A) \vdash^* (q,\varepsilon,\varepsilon)$, then $A \Rightarrow^* w$. We will show this by induction on $n$, then length of an accepting computation $(q,w,A) \vdash^* (q,\varepsilon,\varepsilon)$.
For our base case, we have $n = 1$, since $A \neq \varepsilon$. Since $A$ is at the top of the stack, we must choose a transition of the form $\delta(q,\varepsilon,A)$. However, this means that $w = \varepsilon$ and $(q,\varepsilon) \in \delta(q,\varepsilon,A)$ and this implies that there is a rule $A \to \varepsilon$ in $G$. Thus, we have $A \Rightarrow^* w$.
Now, suppose that $n > 1$ and for our inductive hypothesis, we assume that for all variables $A \in V$, if the length of a computation $(q,w,A) \vdash^* (q,\varepsilon,\varepsilon)$ of $M$ is fewer than $n$ steps, then $A \Rightarrow^* w$. Consider a computation $(q,w,A) \vdash^* (q,\varepsilon,\varepsilon)$ of length $n$.
Again, we must first take a transition $\delta(q,\varepsilon,A)$ since $A$ is on the top of the stack. This time, we take the transition corresponding to a rule of the form $A \to Y_1 \cdots Y_k$, where $Y_1, \dots, Y_k \in (V \cup \Sigma)$, and we have $(q,w,A) \vdash (q,w,Y_1 \cdots Y_k)$. From here, consider the following computation, $$(q,w,A) \vdash (q,w,Y_1 \cdots Y_k) \vdash^* (q,x_1,Y_2 \cdots Y_k)$$ where $x_1$ is a suffix of $w$ with $w = w_1 x_1$.
In other words, we have $(q,w_1 x_1,Y_1 \cdots Y_k) \vdash^* (q,x_1,Y_2 \cdots Y_k)$. Now, we make use of the following observation: the computation $$(q,w_1,Y_1 \cdots Y_k) \vdash^* (q,\varepsilon,Y_2 \cdots Y_k)$$ is also a valid computation in $M$. The reasoning is simple: the computation clearly does not depend on $x_1$, since none of $x_1$ was read. If our input word were simply $w_1$, we would have exactly the above computation. However, this leads us to another similarly useful observation we can make: the computation $$(q,w_1,Y_1) \vdash^* (q,\varepsilon,\varepsilon)$$ is also a valid computation in $M$. The reasoning is similar: if we began our computation of $w_1$ with only $Y_1$ on the stack, we would arrive at the same configuration since $Y_2 \cdots Y_k$ were inconsequential to the computation.
Now, we observe that we have one of two possibilities.
We can now return to the computation $$(q,w,A) \vdash (q,w,Y_1 \cdots Y_k) \vdash^* (q,x_1,Y_2 \cdots Y_k)$$ and observe that we have $$(q,x_{i-2},Y_{i-1} \cdots Y_k) = (q,w_{i-1} x_{i-1},Y_{i-1} \cdots Y_k) \vdash^* (q,x_{i-1},Y_i \cdots Y_k)$$ for every $1 \leq i \leq k$. So we can apply a similar argument to obtain derivations $Y_i \Rightarrow^* w_i$ for $w = w_1 \cdots w_k$. This gives us our derivation, $$A \Rightarrow Y_1 \cdots Y_k \Rightarrow^* w_1 Y_2 \cdots Y_k \Rightarrow^* w_1 w_2 Y_3 \cdots Y_k \Rightarrow^* w_1 \cdots w_k = w$$ and thus $A \Rightarrow^* w$. Taking $A = S$, we have $S \Rightarrow^* w$ and $w \in L(G)$ as desired.
Consider the grammar for palindromes, $$S \rightarrow aSa \mid bSb \mid a \mid b \mid \varepsilon,$$ and consider a PDA $P$ for this grammar, which we obtain by following the construction of Theorem 17.1. Then a computation of $P$ on the word $aababaa$ looks like \begin{align*} (q,aababaa,S) & \vdash (q,aababaa,aSa) \\ & \vdash (q,ababaa,Sa) \\ & \vdash (q,ababaa,aSaa) \\ & \vdash (q,babaa,Saa) \\ & \vdash (q,babaa,bSbaa) \\ & \vdash (q,abaa,Sbaa) \\ & \vdash (q,abaa,abaa) \\ & \vdash (q,baa,baa) \\ & \vdash (q,aa,aa) \\ & \vdash (q,a,a) \\ & \vdash (q,\varepsilon, \varepsilon) \end{align*}
Here is an implementation.
from automata.pda.npda import NPDA
A = NPDA(
states={0},
input_symbols=set('ab'),
stack_symbols=set('Sab'),
transitions={
0: {
'a': {'a': {(0,'')}},
'b': {'b': {(0,'')}},
'': {'S': {(0, r) for r in {'aSa', 'bSb', 'a', 'b', ''}}}
}
},
initial_state=0,
initial_stack_symbol="S",
final_states=set(),
acceptance_mode="empty_stack"
)
And here is a run of the machine that contains all of the configurations that are encountered (something that we would not want to do by hand).
>>> for i, c in enumerate(A.read_input_stepwise('abba')):
... print(f'Step {i}: {c}')
Step 0: {PDAConfiguration(0, 'abba', PDAStack(('S',)))}
Step 1: {PDAConfiguration(0, 'abba', PDAStack(('b',))),
PDAConfiguration(0, 'abba', PDAStack(('a',))),
PDAConfiguration(0, 'abba', PDAStack(('b', 'S', 'b'))),
PDAConfiguration(0, 'abba', PDAStack(())),
PDAConfiguration(0, 'abba', PDAStack(('a', 'S', 'a')))}
Step 2: {PDAConfiguration(0, 'bba', PDAStack(())),
PDAConfiguration(0, 'bba', PDAStack(('a', 'S')))}
Step 3: {PDAConfiguration(0, 'bba', PDAStack(('a', 'a', 'S', 'a'))),
PDAConfiguration(0, 'bba', PDAStack(('a', 'b', 'S', 'b'))),
PDAConfiguration(0, 'bba', PDAStack(('a',))),
PDAConfiguration(0, 'bba', PDAStack(('a', 'b'))),
PDAConfiguration(0, 'bba', PDAStack(('a', 'a')))}
Step 4: {PDAConfiguration(0, 'ba', PDAStack(('a', 'b', 'S'))),
PDAConfiguration(0, 'ba', PDAStack(('a',)))}
Step 5: {PDAConfiguration(0, 'ba', PDAStack(('a', 'b'))),
PDAConfiguration(0, 'ba', PDAStack(('a', 'b', 'b', 'S', 'b'))),
PDAConfiguration(0, 'ba', PDAStack(('a', 'b', 'b'))),
PDAConfiguration(0, 'ba', PDAStack(('a', 'b', 'a'))),
PDAConfiguration(0, 'ba', PDAStack(('a', 'b', 'a', 'S', 'a')))}
Step 6: {PDAConfiguration(0, 'a', PDAStack(('a',))),
PDAConfiguration(0, 'a', PDAStack(('a', 'b', 'b', 'S'))),
PDAConfiguration(0, 'a', PDAStack(('a', 'b')))}
Step 7: {PDAConfiguration(0, 'a', PDAStack(('a', 'b', 'b', 'b', 'S', 'b'))),
PDAConfiguration(0, '', PDAStack(())),
PDAConfiguration(0, 'a', PDAStack(('a', 'b', 'b', 'a', 'S', 'a'))),
PDAConfiguration(0, 'a', PDAStack(('a', 'b', 'b'))),
PDAConfiguration(0, 'a', PDAStack(('a', 'b', 'b', 'b'))),
PDAConfiguration(0, 'a', PDAStack(('a', 'b', 'b', 'a')))}
When given in this form, it’s not clear which configuration in each step leads to the next step. However, we recall that as long as we encounter a configuration with the input consumed and the stack empty, we know we can accept.
Now, we’ll consider the other direction. It turns out we've already done a lot of the work. Recall that we showed that every PDA can be transformed into an equivalent PDA with only one state that accepts by empty stack. Notice also that we happened to create a PDA with one state that accepts by empty stack to parse any given grammar. It turns out we can simply reverse this construction.
Let $P$ be a PDA with one state that accepts by empty stack. Then there exists a grammar $G$ such that $L(G) = N(P)$.
Let $P = ({0}, \Sigma, \Gamma, \delta, 0, Z_0)$. We define $G = (V, \Sigma, P, S)$ by
and we define the set of productions $P$ by considering transitions from $\delta$ in the following way: for each transition $(0, Y_1 Y_2 \cdots Y_k) \in \delta(0, a, X)$, we define the production \[X \to a Y_1 Y_2 \cdots Y_k,\] where $X, Y_1, \dots, Y_k \in \Gamma$, $a \in \Sigma \cup \{\varepsilon\}$.
An interesting consequence of this reverse construction is that the productions are in "almost" Greibach normal form. From here, we can apply the argument in Kozen Chapter 24 in reverse (this construction, found in Chapter 25, basically says this).
So we have shown that the class of context-free languages (those languages generated by context-free grammars) and the class of pushdown languages (those languages recognized by pushdown automata) are exactly the same. This result is due to Chomsky and Schützenberger and Evey around the same time in 1963.
This is a similar situation as with Kleene’s theorem, where we showed the equivalence of recognizable and regular languages via transformations of finite automata to regular expressions and vice versa. This of course doesn’t say anything directly about the structure of the language. For that, we looked to Myhill–Nerode for regular languages.
For context-free languages, we can turn to, who else, Chomsky and Schützenberger, who, also in 1963, they prove what is now called the Chomsky–Schützenberger theorem.
Every context-free language is a homomorphic image of the intersection of a Dyck language and a regular language.
A proof of this result can be found in Kozen Chapter G.
The Chomsky–Schützenberger theorem tells us that every context-free language can be described by a regular language, a Dyck language, and a homomorphism. The Dyck languages are the family of languages of balanced brackets of $k$ types, named after the combinatorialist Walther von Dyck. Since regular languages are closed under homomorphisms, what this tells us is that context-free languages are “like” regular languages, but what really separates the two is the ability to express “nestedness”, in the sense of brackets or trees.
This aligns with our observations of context-free languages: grammars generate strings in a balanced way, from the inside out, while pushdown automata exhibit nestedness because of the constraints of the stack—it is otherwise just an NFA.
At this point, you’ll notice that Chomsky and Schützenberger have appeared together a lot. They were very close collaborators and foundational the the theory of context-free languages. Chomsky is fairly well known for his contributions as a linguist, even outside of linguistics and computer science. Schützenberger, on the other hand, was a mathematician, and can be considered the godfather of French formal language theory. In addition to his work with Chomsky, he was the doctoral advisor for many of the leading French researchers in the area.
Though the PDA construction that we saw leads to the recursive descent parsing algorithm, for general context-free grammars, the possibility of backtracking is still too expensive to use in practice. But this is commonly the case for a lot of problems that we approach recursively at first from the “top-down” perspective. One way to get around this is to consider the problem “bottom-up”.
The top-down approach is natural: start with the “start” variable and try to construct a derivation for the input string. The bottom-up approach is to try to build derivations for each substring of our input string and attempt to piece together those derivations. We can think of this as working “backwards” starting with small pieces and putting them together to get the preceding pieces.
To do this, we can exploit the predictable structure of a CNF grammar to come up with a systematic way of building a derivation. We know that every symbol will have a production that generates it directly and we know that every non-terminal production will have exactly two variables on its right hand side. This means that every derivation of a string longer than a single symbol can be broken up into derivations of exactly two strings.
This naturally gives us some recursive structure to work with. Suppose we have a grammar $G = (V, \Sigma, P, S)$ and a string $w$ that we would like to figure out whether $w \in L(G)$. Recall that this means that $S \Rightarrow^* w$, but as usual, we’ll ask the broader question of whether there exists a variable $A$ such that $A \Rightarrow^* w$. We consider this based on the length of $w$.
For each string, what we’ll do is remember the set of variables $A$ such that $A \Rightarrow^* w$. Let’s call this set $T(w)$. Our discussion above gives us the folowing recurrence. \[T(w) = \begin{cases} \{A \in V \mid A \to w \in P\} & \text{if $|w| = 1$}, \\ \bigcup\limits_{w = uv} \{A \mid A \to BC \in P \wedge B\in T(u) \wedge C \in T(v)\} & \text{if $|w| > 1$}. \end{cases}\]
We can make this more explicit, for the algorithm, by referring to $u$ and $v$ as indexed substrings of $w = a_1 \cdots a_n$. \[T(i,j) = \begin{cases} \{A \in V \mid A \to a_i \in P \} & \text{if $i=j$}, \\ \bigcup\limits_{k=i}^{j-1} \{A \mid A \to BC \in P \wedge B\in T(i,k) \wedge C \in T(k+1,j)\} & \text{if $i < j$}. \end{cases}\]
This recurrence is a Bellman equation—i.e., a dynamic programming recurrence. This means that what we’ve described is really a dynamic programming algorithm. This algorithm is called the Cocke–Younger–Kasami (CYK) algorithm, named after the three individuals who independently proposed it.
\begin{algorithmic}
\PROCEDURE{cyk}{$G = (V,\Sigma,P,S), w=a_1 a_2 \cdots a_n$}
\FOR{$i$ from 1 to $n$}
\STATE $T[i,i] = \{A \in V \mid A \to a_i \in P\}$
\ENDFOR
\FOR{$m$ from 1 to $n-1$}
\FOR{$i$ from 1 to $n-m$}
\STATE $j = i+m$
\STATE $T[i,j] = \emptyset$
\FOR{$k$ from 1 to $j-1$}
\STATE $T[i,j] = T[i,j] \cup \{A \in V \mid A \to BC \in P \wedge B \in T[i,k] \wedge C \in T[k+1,j]\}$
\ENDFOR
\ENDFOR
\ENDFOR
\RETURN $S \in T[1,n]$
\ENDPROCEDURE
\end{algorithmic}
An interesting feature of this algorithm is how it fills out its “dynamic programming table”. You may be used to algorithms that fill out the table row by row, but here, we’ll see that because we are iterating on the length of the substrings, we end up filling out the table diagonally. This also means that where we need to look for our relevant subproblems is a bit different from the usual dynamic programming algorithms.
Another example of a dynamic programming algorithm that operates on “intervals” is the RNA secondary structure prediction algorithm due to Nussinov and Jacobson in 1980. Interestingly enough, probabilistic context-free grammars were later used for RNA secondary structure prediction, first shown by Sakakibara et al. in 1994. There, they give a modification to CYK that allows it to generate the most likely (with respect to the probabilistic CFG) parse tree for an input string.
Let’s walk through an example of the CYK algorithm.
Consider the following grammar $G$ in CNF. \begin{align*} S &\rightarrow BC \mid b \\ A &\rightarrow BC \mid b \\ B &\rightarrow DC \mid BB \mid a \\ C &\rightarrow BA \mid b \\ D &\rightarrow CA \mid c \end{align*}
We want to determine whether the word $cabab$ is generated by $G$. We will store our results in a table. First, we consider every substring of length 1 and the productions that could have generated each substring. We record which variables are on the left hand side of these productions.
| $c$ | $a$ | $b$ | $a$ | $b$ | |
| 1 | 2 | 3 | 4 | 5 | |
| 1 | $D$ | ||||
| 2 | $B$ | ||||
| 3 | $S,A,C$ | ||||
| 4 | $B$ | ||||
| 5 | $S,A,C$ |
Next, we consider substrings of length 2. A production can generate substrings of length 2 if they lead to two variables that each generate substrings of length 1. For example, we have that entry $1,2$ doesn’t have any variables that can generate it, because there is no production of the form $X \rightarrow DB$. However, we can determine that $ab$ is generated by a production that has either $BA$ or $BC$ on the right hand side. There are such productions: $S \rightarrow BC$, $A \rightarrow BC$, and $C \rightarrow BA$.
| $c$ | $a$ | $b$ | $a$ | $b$ | |
| 1 | 2 | 3 | 4 | 5 | |
| 1 | $D$ | $\emptyset$ | |||
| 2 | $B$ | $S,A,C$ | |||
| 3 | $S,A,C$ | $\emptyset$ | |||
| 4 | $B$ | $S,A,C$ | |||
| 5 | $S,A,C$ |
We will now consider substrings of length 3. Here, we have to do a bit more work, because strings of length 3, say $xyz$ can be generated by two variables by $xy$ and $z$ or $x$ and $yz$. We must check both possibilities. For example, for the string $cab$, we can generate it either by $ca$ and $b$ or by $c$ and $ab$. In the first case, $ca$ has no derivation, but the second case gives us a valid derivation, since $ab$ can be generated by $S$, $A$, or $C$ and we have $B \Rightarrow DC \Rightarrow cab$.
| $c$ | $a$ | $b$ | $a$ | $b$ | |
| 1 | 2 | 3 | 4 | 5 | |
| 1 | $D$ | $\emptyset$ | $B$ | ||
| 2 | $B$ | $S,A,C$ | $\emptyset$ | ||
| 3 | $S,A,C$ | $\emptyset$ | $D$ | ||
| 4 | $B$ | $S,A,C$ | |||
| 5 | $S,A,C$ |
We continue with strings of length 4. Again, we have to consider all of the ways to split up a string of length 4 into two substrings.
| $c$ | $a$ | $b$ | $a$ | $b$ | |
| 1 | 2 | 3 | 4 | 5 | |
| 1 | $D$ | $\emptyset$ | $B$ | $B$ | |
| 2 | $B$ | $S,A,C$ | $\emptyset$ | $D$ | |
| 3 | $S,A,C$ | $\emptyset$ | $D$ | ||
| 4 | $B$ | $S,A,C$ | |||
| 5 | $S,A,C$ |
Finally, we finish with the string of length 5, our original string $w$. We have constructed a table that tells us which variables can generate each substring $w$. To determine whether $w \in L(G)$, we simply need to see whether the substring corresponding to $w$ (i.e. $w[1..5]$ has a derivation that begins with $S$. We see that it does, so $w \in L(G)$.
| $c$ | $a$ | $b$ | $a$ | $b$ | |
| 1 | 2 | 3 | 4 | 5 | |
| 1 | $D$ | $\emptyset$ | $B$ | $B$ | $S,A,C$ |
| 2 | $B$ | $S,A,C$ | $\emptyset$ | $D$ | |
| 3 | $S,A,C$ | $\emptyset$ | $D$ | ||
| 4 | $B$ | $S,A,C$ | |||
| 5 | $S,A,C$ |
Since CYK is a dynamic programming algorithm, the running time is not complicated to figure out.
Given a CFG $G$ in Chomsky Normal Form and string $w$ of length $n$, we can determine whether $w \in L(G)$ in $O(n^3 |G|)$ time.
CYK takes $O(n^3)$ time since it has three loops: for each substring length $\ell$ from 1 to $n$, consider each substring of length $\ell$ (of which there are $O(n)$), and consider all of the ways to split this substring into two (again, $O(n)$). Alternatively, we can see that we are filling out an $n \times n$ table (about $O(n^2)$), where each entry requires the traversal of the associated row and column ($O(n)$), which gives us our $O(n^3)$ bound.
Here is a big discussion about the complexity of parsing and some of the consequences.
Leslie Valiant showed in 1975 that computing the CYK table can be reduced to boolean matrix multiplication, which at the time that most current formal languages books were printed (they do not get updated very often) had a time complexity of $O(n^{2.376})$-ish, due to Coppersmith and Winograd in 1990. This was true until the early 2010s (i.e. just over 20 years later), when there was a series of small improvements, the most recent of which was due to Alman, Duan, Vassilevska Williams, Y. Xu, Z. Xu, and Zhou in 2025, clocking in at $O(n^{2.371339})$.
Perhaps surprisingly, the same goes the other way around: matrix multiplication can’t do any better than whatever algorithm we muster up for parsing context-free grammars, because Lillian Lee showed in 2002 that you can solve matrix multiplication by reducing it to context-free grammar parsing. The consequence of this is that, just like with matrix multiplication, we don’t really know if we get parsing down to $O(n^2)$.
The more immediate problem is that $O(n^3)$ is still considered big, if you think about your typical large software codebase and the amount of time it takes to build a project and how mad everyone gets when you break the build. Ideally, we would like something that runs in linear time, or in other words, something where we can essentially read our input once and parse it correctly without having to go back over it.
This seems to lead to consideration of the languages accepted by deterministic pushdown automata. These happen to characterize the class of languages called deterministic context-free languages. If we think about it, the fact that these are deterministic suggests that we don’t run into some of the complexity issues with nondeterministic choice. The problem with this is that this is a class defined based on the machine model, rather than the grammar.
However, Donald Knuth, in 1965, defined the LR(k) grammars. Here, L stands for left-to-right scan, R stands for rightmost derivation, and $k$ is the number of characters we're allowed to “look ahead” in order to decide what to do. Knuth links LR(k) grammars with DPDAs:
If $G$ is an $\mathrm{LR}(k)$ grammar, then there is a deterministic pushdown automaton that accepts $L(G)$. If $L$ is a deterministic context-free language, then there is an $\mathrm{LR}(1)$ grammar that generates $L$.
Practically speaking, most programming languages are really languages that are $\mathrm{LR}(1)$ or so. This makes sense—it would not be a good idea for a program to have multiple possible valid structural interpretations.