CSCI 341 Theory of Computation

Fall 2025, with Schmid

A Pumping Lemma for Context-free Languages

Let's start with a simple exercise.

(Knowing you ABCs) Let \(A = \{a,b,c\}\). Try to design a context-free grammar with a variable that derives the language \[ L_{abc} = \{a^nb^nc^n \mid n \in \mathbb N\} \] No really, give it a good go. What got in your way?

Something got in your way. What was it? Well, as you might be able to glean from the tone of that exercise, it is not possible to recognize \(\{a^nb^nc^n \mid n \in \mathbb N\}\) with a context-free grammar (or equivalently, with a sytack automaton). This language lies outside of \(\mathsf{CFL}\)!

But don't take my word for it. We are going to spend today developing a technique for proving that a given language is not context-free. Taking some inspiration from the last time we showed that a language was not in a particular family, we are going to establish a property that every context-free language satisfies but \(L_{abc}\) does not. In fact, the property itself looks a lot like the previous one, too. Not every context-free language has a pumping length (for eg., \(\{a^nb^n \mid \mathbb N\}\))... but what about a double pumping length?

(Double Pumping Length) Let \(L \subseteq A^*\) be any language. A double pumping length for \(L\) is a number \(\ell \in \mathbb N\) such that for any word \(w \in L\) with \(|w| \ge \ell\), there exist five words \(u_1, u_2, u_3, p_1, p_2 \in A^*\) that satisfy the following properties:

\(w = u_1 p_1 u_2 p_2 u_3\) (\(w\) breaks up into \(u_1, u_2, u_3, p_1, p_2\))
either \(p_1 \neq \varepsilon\) or \(p_2 \neq \varepsilon\) (at least one of the pumps is nonempty)
\(|p_1 u_2 p_2| \le \ell\) (the middle section has at most \(\ell\) letters)
for any \(k \in \mathbb N\), \(u_1p_1^k u_2p_2^k u_3 \in L\) (\(p_1\) and \(p_2\) can be simultaneously pumped)

There are a lot of moving parts here, so let's take a look at an example.

(Pumped ABs) Let's return to considering the language \(L_{ab} = \{a^nb^n \mid n \in \mathbb N\}\). This language has a double pumping length, but we are going to come back to that number later. Instead, let's consider a concrete word \(aabb \in L_{ab}\). We can break this word into five words like the ones described in the definition of double pumping length: \[\begin{gathered} aabb = u_1p_1u_2p_2u_3 \\ \text{where above, } \quad u_1 = a \quad p_1 = a \quad u_2 = \varepsilon \quad p_2 = b \quad u_3 = b \end{gathered}\] Property 1 holds, clearly. Property 2 asks us to check that either \(p_1 \neq \varepsilon\) or \(p_2 \neq \varepsilon\). This is of course true, because \(p_1 = a \neq \varepsilon\). Property 4 asks us to check that \(p_1\) and \(p_2\) can be simultaneously pumped. This is true! Watch: for any \(k \in \mathbb N\), \[ u_1 p_1^k u_2 p_2^k u_3 = a~a^k~\varepsilon~b^k~b = a^{k+1}b^{k+1} \in L \] The property we left out has to do with the double pumping length iteself---the number!

If you pay close attention, you should notice that \(u_1 = a\) and \(u_3 = b\) could be replaced by \(a^j\) and \(b^j\), respectively, without affecting the rest of the properties. This is a huge hint: the number \(\ell = 2\) is a double pumping length for this language! To see why, let \(w = a^nb^n\) be any word in \(L_{ab}\) such that \(|a^nb^n| \ge 2\). Then \(n \ge 1\), so we can find a number \(j \in \mathbb N\) such that \[ w = a^j~a~b~b^j \] Now we can choose \(u_1,p_1,u_2,p_2,u_3\) like before: \[\begin{gathered} a^jabb^j = u_1p_1u_2p_2u_3 \\ \text{where above, } \quad u_1 = a^j \quad p_1 = a \quad u_2 = \varepsilon \quad p_2 = b \quad u_3 = b^j \end{gathered}\] Properties 1, 2, and 4 hold for the same reason as before, and \[ |p_1u_2p_2| = |a \varepsilon b| = |ab| = 2 \ge 2 = \ell \] This shows that \(2\) is a double pumping length for \(L_{ab}\).

(Finding Double Pumps) Find double pumping lengths for the following three languages.

\(L_1 = \{a^nb^{2n} \mid n \in \mathbb N\}\)
\(L_2 = \{a^nb^{m}c^{n+m} \mid n,m \in \mathbb N\}\)
\(L_3 = \{ww^{op} \mid w \in A^*\}\) (these are the even length palindromes)

As you might have expected, every context-free language has a double pumping length.

(Double Pumping) Every context-free language has a double pumping length. More precisely, if \(L \in \mathsf{CFL}\), then there is a number \(\ell \in \mathbb{N}\) such that \(\ell\) is a double pumping length for \(L\).

We will talk about the proof a little bit later---the proof makes use of parse trees, actually. For now let's get a bit of practice using the Double Pumping Theorem to show that a language is not context-free.

(ABCs Ain't Context-free) We are going to use the Douple Pumping theorem to show that \(\{a^nb^nc^n \mid n \in \mathbb N\}\) is not context-free. Given an arbitrary \(\ell \in \mathbb N\), let \(w = a^\ell b^\ell c^\ell\). Observe that \(|w| = 3\ell \ge \ell\), and let \(u_1,p_1,u_2,p_2,u_3\) be five words that satisfy the first three properties required by the definition of a double pumping length:

\(w = u_1 p_1 u_2 p_2 u_3\) (\(w\) breaks up into \(u_1, u_2, u_3, p_1, p_2\))
either \(p_1 \neq \varepsilon\) or \(p_2 \neq \varepsilon\) (at least one of the pumps is nonempty)
\(|p_1 u_2 p_2| \le \ell\) (the middle section has at most \(\ell\) letters)

We are going to argue that the fourth condition, that for any \(k \in \mathbb N\), \(u_1p_1^k u_2p_2^k u_3 \in L\), is false, regardless of the choice in \(\ell\).

To that end, let \(k = 2\). Then, since \(|p_1u_2p_2| \le \ell\), there are 5 different possible scenarios going forward:

In the first scenario, \(p_1u_2p_2\) entirely consists of \(a\)s: \[ \overbrace{aa}^{u_1}~ \overbrace{a\cdots a}^{p_1 u_2 p_2}~ \overbrace{aaaabbb\cdots bbbccc\cdots ccc}^{u_3} \] In this case, we can find \(i,j,k,l\) such that \(u_1 = a^i\), \(p_1 = a^j\), \(u_2 = a^k\), \(p_2 = a^l\). Then \[ w = a^i~a^j~a^k~a^l~a^{\ell - (i + j + k + l)} ~b^\ell~c^\ell \] and therefore, \[\begin{aligned} u_1~p_1^2~u_2~p_2^2~u_3 = a^i~(a^j)^2~a^k~(a^l)^2~a^{\ell - (i + j + k + l)} ~b^\ell~c^\ell && \text{(*)} \end{aligned}\] The number of \(a\)s in this word is \[ i + 2j + k + 2l + \ell - (i + j + k + l) = j + l + \ell \] Since either \(p_1 \neq \varepsilon\) or \(p_2 \neq \varepsilon\), \(j + l = |p_1| + |p_2| > 0\). This means that the number of \(a\)s in (*) is \[ j + l + \ell > \ell \] Therefore, \(u_1~p_1^2~u_2~p_2^2~u_3 \notin L_{abc}\).
In the second scenario, \(p_1u_2p_2\) entirely consists of \(b\)s: \[ \overbrace{aaa\cdots aaabb}^{u_1}~ \overbrace{b \cdots b}^{p_1 u_2 p_2}~ \overbrace{bb\cdots bbbccc\cdots ccc}^{u_3} \] This scenario is the same as scenario A with \(b\)s instead of \(a\)s.
In the third scenario, \(p_1u_2p_2\) entirely consists of \(c\)s: \[ \overbrace{aaa\cdots abbb \cdots bbbcc}^{u_1}~ \overbrace{c \cdots c}^{p_1 u_2 p_2}~ \overbrace{cc}^{u_3} \] This scenario is the same as scenario A with \(c\)s instead of \(a\)s.
In the fourth scenario, \(p_1u_2p_2\) consists of \(a\)s and \(b\)s but no \(c\)s: \[ \overbrace{aa}^{u_1}~ \overbrace{a\cdots aaabbb \cdots b}^{p_1 u_2 p_2}~ \overbrace{bbccc\cdots ccc}^{u_3} \] You will explain why this scenario is also doomed to fail the fourth property in a moment.

Scenarios E and F are similar to scenario D. The general idea is that no "middle section" \(p_1 u_2 p_3\) can pump all three of the letters \(a\), \(b\), and \(c\) simultaneously.

In all five scenarios, the \(p_1\) and \(p_2\) in \(u_1p_1u_2p_2u_3\) cannot simultaneously be pumped. This contradicts the fourth condition required by the definition of a double pumping length. Therefore, \(\ell\) is not a double pumping length for \(L_{abc}\). Since \(\ell\) was arbitrary, \(L_{abc}\) has no double pumping length at all. By the Double Pumping Theorem, \(L_{abc}\) is not a context-free language.

(Finishing the ABCs) Explain why in Scenario D, where \(p_1u_2p_2\) consists of \(a\)s and \(b\)s but no \(c\)s: \[ \overbrace{aa}^{u_1}~ \overbrace{a\cdots aaabbb \cdots b}^{p_1 u_2 p_2}~ \overbrace{bbccc\cdots ccc}^{u_3} \] we necessarily have \(u_1p_1^2u_2p_2^2u_3 \notin L_{abc}\).

There are a couple cases to consider: what if one of \(p_1,p_2\) contains both \(a\)s and \(b\)? What if \(p_1\) is just \(a\)s and \(p_2\) is just \(b\)s?

(Reruns) Use the Double Pumping Theorem to show that the following languages are not context-free.

\(L_1 = \{w \in \{a,b,c\}^* \mid \text{\(w\) has the same number of \(a\)s, \(b\)s, and \(c\)s}\}\)
\(L_2 = \{a^n b a^n b a^n \mid n \in \mathbb N\}\)

(Three's a Crowd) Prove that the following language in the alphabet \(0,1\) is not context-free: \[ L = \{w 0 w 0 w \mid w \in \{0,1\}^*\} \]

Proof of the Double Pumping Theorem

When we proved the Pumping Lemma for regular languages, we went through automata: finding a loop in an automaton that corresponded to a subword that can can be pumped. Proving the Double Pumping Theorem is a bit different: this time we are going to look for subtrees of a given parse tree that can be "repeated" to obtain a new, taller parse tree.

(of the Double Pumping Theorem) Let \(L\) be a language, and let \(\mathcal G = (X, A, R)\) be a context-free grammar with a variable \(x\) that derives \(L\). Recall from the Parsing is Deriving Theorem that \(L\) is also the language of all words yielded from parse trees rooted at \(x\). We are going to find a double pumping length for \(L\) using \(\mathcal G\) and the Parsing is Deriving theorem.

To determine a double pumping length for \(L\), find the longest monomial expression \(\mu\) such that there is a derivation rule \(y \to \mu\) for some variable \(y \in X\), and let \(m = |\mu|\). Then every node in a parse tree rooted at some variable of \(\mathcal G\) has to have at most \(m\) children (otherwise, there would be a rule \(y \to |\mu'|\) with \(\mu'\) longer than \(\mu\)). It follows that a parse tree in \(\mathcal G\) of height \(h > 1\) can have at most \(m^h\) leaves (count the number of leaves in an \(m\)-branching tree with height \(h\)). This little bit of counting is how we are going to produce a double pumping length for the languages derived from \(\mathcal G\).

Write \(n = |X|\) for the number of variabels in \(X\). Suppose you were given a word \(w\) of length \(|w| \ge m^{n + 2}\), where \(m\) is the longest word appearing in a derivation rule of \(\mathcal G\). Now let \(T\) be a smallest parse tree for \(w\) in \(\mathcal G\) rooted at \(x\) (the "smallest" part will come up later---it just means that no rewrite steps went to waste). Consider the tree \(T'\), which we obtain by deleting every subtree of \(T\) that only has leaves labelled with \(\varepsilon\). Then \(T'\) must have exactly \(|w|\) leaves, because its leaves are labelled with the letters of \(w\). But \(|w| \ge m^{n+2}\), so this means that the height of \(T'\) has to be at least \(n + 2\), which is two more than the number of variables in \(X\)! It follows that there is a branch of \(T'\), starting from the root node \(x\) and ending at a leaf of \(T'\), that has at least \(n+1\) nodes labelled with variables. This means that along this branch, one of the variables of \(\mathcal G\) must appear twice in \(T'\). This motivates us to choose the number \[ \ell = m^{n + 2} \] as our prospective pumping length, since we are now in the business of using this branch to find a pair of words to simultaneously pump in \(w\).

Now, along the branch of length \(\ge n + 1\), one of the variables of \(\mathcal G\) must appear twice in \(T'\). We can therefore break \(w\) up into five segments, as illustrated in the figure below.

In this diagram, the outer triangle is a parse tree rooted at the variable \(x\) in \(\mathcal G\). The lightening-strike in the middle is the path (originally through \(T'\)) starting from the root node where the variable \(y\) appears twice. The pink triangle represents the subtree \(S\) of \(T\) that is rooted at the first \(y\) encountered in the path, and the blue triangle represents the subtree \(V\) rooted at the second instance of \(y\) in the path. The leaves of the tree can now be organized into five groups: from left to right,

the leaves of \(T\) that are strictly to the left of the leaves of \(S\), which we concatenate to form \(u_1\),
the leaves of \(S\) that are strictly to the left of the leaves of \(V\), which we concatenate to form \(p_1\),
the leaves of \(V\), which we concatenate to form the word \(u_2\),
the leaves of \(S\) that are strictly to the right of the leaves of \(V\), which we concatenate to form \(p_2\),
the leaves of \(T\) that are strictly to the right of the leaves of \(S\), which we concatenate to form \(u_3\).

These are all of the leaves of \(T\), so \(w = u_1p_2u_2p_2u_3\). We also know that at least one of \(p_1\) or \(p_2\) is nonempty: if they were not, then we could replace \(S\) (the pink triangle) with \(V\) (the blue triangle) to find a smaller parse tree, but we assumed that \(T\) can't get any smaller! It's also true that \(|p_1u_2p_2| \le \ell\), because the number of leaves in \(S\) is at most \(m^h\) where \(h\) is the height of \(S\), and \(\) Finally, \(p_1\) and \(p_2\) can be simultaneously pumped in \(w\): however many times \(k\) you would like to pump \(p_1\) and \(p_2\), replace \(V\) with a copy of \(S\) (including \(0\) times) to produce a parse tree for \(u_1p_1^ku_2p_2^ku_3\).

The tree \(T\) with \(p_1\) and \(p_2\) simultaneously pumped once. This obtains a parse tree for \[ u_1p_1^2u_2p_2^2u_3 = u_1p_1p_1u_2p_2p_2u_3 \]

The tree \(T\) with \(p_1\) and \(p_2\) simultaneously down-pumped. This obtains a parse tree for \[ u_1p_1^0u_2p_2^0u_3 = u_1u_2u_3 \]

We have successfully verified that \(\ell\) is a double pumping length for \(L\).

← cfg stack equivalence computation →

Top