The Algebra of Regular Languages
So far, we have seen that every regular language is finitely presentable, formally \(\mathsf{Reg} \subseteq \mathsf{Fin}\). Continuing our journey towards the proof of Kleene's Theorem, which states that \(\mathsf{Reg} = \mathsf{Fin}\), we need to gain a bit more proficiency with regular expressions. The most important step in the proof of the reverse containment is to show that certain systems of equations involving regular expressions can be solved. This process of reasoning with equations between regular expressions is called the algebra of regular expressions, and can be really fun once you get used to it. If solving systems of equations doesn't sound like algebra to you, then I'm not sure what will.
Let's get a bit more formal about what all this is about.
It may come as a surprise to you that different regular expressions can be language equivalent. It helps to think of arithmetic: \(5 + 2 = 2 + 5\), even though those two arithmetic expressions are different. One might call these two arithmetic expressions number equivalent. In a similar fashion, the regular expressions \(a + b\) and \(b + a\) (over an alphabet containing \(a\) and \(b\)) are language equivalent. Indeed, \[ \mathcal L(a + b) = \{a, b\} = \{b, a\} = \mathcal L(b + a) \] Not too crazy, right?
- \((ab + b) + c\)
- \(c + (a + \varepsilon) b\)
- \(a(b + c)\)
- \((b + c)a\)
Unions
Here is another example of language equivalent regular expressions: \[ a + (b + c) =_{\mathcal L} (a + b) + c \] Again, remember that regular expressions are just sequences of symbols, so the two regular expressions that appear above are not exactly the same. However, it's not hard to prove that they are language equivalent: on the one hand, we have \[\begin{aligned} \mathcal L(a + (b + c)) &= \mathcal L(a) \cup \mathcal L(b + c) \\ &= \mathcal L(a) \cup (\mathcal L(b) \cup \mathcal L(c)) \\ &= \{a\} \cup (\{b\} \cup \{c\}) \\ &= \{a\} \cup \{b,c\} \\ &= \{a,b,d\} \end{aligned}\] On the other, we also have \[ \mathcal L((a + b) + c) = \{a, b, c\} \hspace{5em} (\star) \] Therefore, \(\mathcal L((a + b) + c) = \mathcal L(a + (b + c))\), and we can write \((a + b) + c =_{\mathcal L} a + (b + c)\).
- \(r_1 + r_2 =_{\mathcal L} r_2 + r_1\)
- \(r + \emptyset =_{\mathcal L} r\)
- \(\emptyset + r =_{\mathcal L} r\)
- \(r + r =_{\mathcal L} r\)
- \(r_1 + (r_2 + r_3) =_{\mathcal L} (r_1 + r_2) + r_3\)
Sequential Composition
Here is another example of a pair of language equivalent regular expressions: \(a(bc) =_{\mathcal L} (ab)c\). Here, \[\begin{aligned} \mathcal L(a (bc)) &= \mathcal L(a) \cdot \mathcal L(bc) \\ &= \mathcal L(a) \cdot (\mathcal L(b) \cdot \mathcal L(c)) \\ &= \{a\} \cdot (\{b\} \cdot \{c\}) \\ &= \{a\} \cdot (\{bc\}) \\ &= \{abc\} \\ \end{aligned}\] This is what we should expect; the language semantics of a word, represented as a regular expression, should be the set containing only that word. But there is also the other way of forming the word "\(abc\)", namely \((ab)c\), which also has the language semantics \[\mathcal L((a b)c) = \{abc\}\] Therefore, \(a(bc) =_{\mathcal L} (ab)c\).
There are a few more equations to do with sequential composition that are going to be useful later.
- \(r \cdot \varepsilon =_{\mathcal L} r\) and \(\varepsilon \cdot r =_{\mathcal L} r\)
- \(r\cdot \emptyset =_{\mathcal L} \emptyset\) and \(\emptyset \cdot r =_{\mathcal L} \emptyset\)
- \(r_1 \cdot (r_2 \cdot r_3) =_{\mathcal L} (r_1 \cdot r_2) \cdot r_3\)
- \(r_1 \cdot (r_2 + r_3) =_{\mathcal L} (r_1 \cdot r_2) + (r_1 \cdot r_3)\)
- \((r_1 + r_2) \cdot r_3 =_{\mathcal L} (r_1 \cdot r_3) + (r_2 \cdot r_3)\)
Kleene Star
So far, we have dealt with unions and sequential composition. The last operation on our list to deal with is the Kleene star, which is... let's just say, a lot less familiar. The gist of the equations we are about to see is this: The Kleene star of a language consists of the empty word (if it is not already there), as well as any concatenation of the words in the language (including repetitions).
For example, for a letter \(a \in A\), unraveling the definition of the language semantics of \(a^*\) gives \[\mathcal L(a^*) = \{a^n \mid n \in \mathbb N\}\] For \(n = 0\), \(a^n = a^0 = \varepsilon\). For \(n > 0\), \(a^n = a a^{n-1}\). Unravelling the equation above, we can write down the following calculation: \[\begin{aligned} \mathcal L(a^*) &= \{a^0\} \cup \{a^n \mid n \in \mathbb N \text{ and } n > 0\} \\ &= \{a^0\} \cup \{aa^{n-1} \mid n \in \mathbb N \text{ and } n > 0\} \\ &= \{\varepsilon\} \cup \{aa^n \mid n \in \mathbb N\} \\ &= \{\varepsilon\} \cup (\{a\} \cdot \{a^n \mid n \in \mathbb N\}) \\ &= \mathcal L(\varepsilon) \cup (\mathcal L(a) \cdot \mathcal L(a^*)) \\ &= \mathcal L(\varepsilon + (a \cdot a^*)) \\ \end{aligned}\] In other words, \( a^* =_{\mathcal L} \varepsilon + aa^* \).
The exercise and example above point to the following two equations.
- \(\varepsilon + rr^* =_{\mathcal L} r^*\)
- \(\varepsilon + r^*r =_{\mathcal L} r^*\)
- \((\varepsilon + r)^* =_{\mathcal L} r^*\)
- \((\varepsilon + a)^* =_{\mathcal L} (\emptyset + a)^*\)
- \(\varepsilon + a^* =_{\mathcal L} a^*\)
- \(\emptyset^* =_{\mathcal L} \varepsilon\)
- \((a + b)^* =_{\mathcal L} a(a + b + \varepsilon)^* + b(\emptyset + b + a)^* + \varepsilon\)
- for any \(r_i,s_i \in \mathit{RExp}\), \((r_1 + r_2)(s_1 + s_2) =_{\mathcal L} r_1s_1 + r_1s_2 + r_2s_1 + r_2s_2\)
Arden's Rule
There is one equation we are missing from our toolset so far. It's not quite an equation so much as it is a rule, since it only applies in certain situations. The situations it applies to have to do with the empty word property.
So, for example, the regular expression \(\varepsilon\) does have the empty word property, while \(\emptyset\) does not. And for any \(a \in A\), we have \(\mathcal L(a) = \{a\}\), so \(a\) does not have the empty word property. On the other hand, for every regular expression \(r \in \mathit{RExp}\), \(r^*\) does have the empty word property.
- (Left Rule) if \(s =_{\mathcal L} t + r\cdot s\), then \(s =_{\mathcal L} r^* \cdot t\)
- (Right Rule) if \(s =_{\mathcal L} t + s\cdot r\), then \(s =_{\mathcal L} t \cdot r^*\)
Typically, Arden's Left Rule is just called "Arden's Rule". Arden's Left Rule is all that's needed for the next lecture, but the Right rule is going to make things a lot easier for you in the exercises!
Remember the equation \(r^* =_{\mathcal L} \varepsilon + rr^*\) from the Basics of the Kleene Star. This equation reveals to us that \(r^*\) solves the following equation for an unknown variable \(x\), \[ x =_{\mathcal L} \varepsilon + r \cdot x \] Arden's rule tells us more: it says that if \(r\) does not have the empty word property, then \(r^*\) is the only solution to the equation above.
- \(aa^* = a^* a\)
- \(a^* =_{\mathcal L} (aa)^*(a + \varepsilon)\).
- \(a^*a^* =_{\mathcal L} a^*\)
- \((a + b)^* =_{\mathcal L} b^*(ab^*)^*\)
Proof of Arden's Rule
We are now ready to prove Arden's Rule, but it is worth mentioning that this is an important place where strong induction comes up.
Remember that induction states that if a subset \(S \subseteq \mathbb N\) is upwards-closed and contains \(0\), then \(S = \mathbb{N}\). There are actually two ways of establishing that such a set \(S\) is upwards-closed and contains \(0\):
- The first way is what you are used to: showing that \(0 \in S\) and also that if \(n \in S\), then \(n + 1 \in S\) as well. This way is ordinary induction.
- The second way is maybe more convoluted-feeling, but equivalent: the second way is to show that for any \(n \in \mathbb N\), if \(m \in S\) for all \(m < n\), then \(n \in S\). To see why this works, first note that there are no natural numbers \(m < 0\), so vaccuously \(0 \in S\). Then for any \(m < 1\), \(m \in S\), because only \(0 < 1\). Likewise, if \(0, 1, 2, \dots, n \in S\), then \(n + 1 \in S\) as well. This shows that \(S\) is upwards-closed, so by induction, \(S = \mathbb N\).
Let's start by proving (1). Let \(w \in L\). We need to prove that \(w \in U^*\cdot V\). We are going to do this by strong induction on the length of \(w\).
A proof by strong induction starts with the induction hypothesis: suppose that for any word \(u \in L\) with \(|u| < |w|\), we have \(u \in U^* \cdot V\). We now need to show that it follows from this supposition that \(w \in U^* \cdot V\). Since \(L = V \cup U\cdot L\), \(w\) is of one of two forms: either \(w \in V\) or \(w \in U\cdot L\). This means there are two cases to consider.
- In the first case, \(w \in V\), and we need to show that \(w \in U^*\cdot V\). This follows from the definition of the Kleene star for a language: \(\varepsilon \in U\), so \(\varepsilon w \in U^* \cdot V\). Since \(\varepsilon w = w\), \(w \in U^*\cdot V\) and we are done with this case.
- In the second case, \(w \in U \cdot L\). In this case, \(w\) is of the form \(w = uv\) for some \(u \in U\) and \(v \in L\). Again, we need to argue that \(w \in U^*\cdot V\). But \(u \in U\) implies that \(u \neq \varepsilon\), because \(r\) does not have the empty word property and \(U = \mathcal L(r)\). This means that \(|u| > 0\), so \(|v| < |u| + |v| = |uv| = |w|\). By the induction hypothesis, \(v \in L\) and \(|v| < |w|\) means that \(v \in U^* \cdot V\). Therefore, there is a \(u' \in U^*\) and a \(v' \in V\) such that \(v = u'v'\). Then \(w = uu'v'\). Since \(u,u' \in U^*\), their concatenation \(uu' \in U^*\) as well. It follows that \(w = uu'v \in U^*V\), and we are done with this case.
Now we show (2), that \(L \supseteq U^* \cdot V\). Let \(w \in U^* \cdot V\). Then \(w = uv\) for some \(u\in U^*\) and \(v \in V\). We are going to show that \(uv \in L\) by strong induction on \(u\). This means that our induction hypothesis states that if \(u'v \in L\) for all \(u' \in U\) with \(|u'| < |u|\), then \(uv \in L\).
If \(u = \varepsilon\), then \(uv = v \in V \cup U\cdot L = L\), so \(w = uv \in L\). Otherwise, \(u = u_1u_2\) for some \(u_1,u_2 \in U\) with \(|u_1| > 0\). This means that \(u_2v \in U^*V\) and \(|u_2| < |u|\), so the induction hypothesis tells us that \(u_2v \in L\). But concatenating with \(u_1\), we get \[ w = uv = u_1u_2v \in U \cdot L \subseteq V \cup U \cdot L = L \] so that \(w \in L\). This shows that \(L \supseteq U^* \cdot V\).
Since (1) \(L \subseteq U^* \cdot V\) and (2) \(L \supseteq U^* \cdot V\), \(L = U^* \cdot V\). Therefore, \(r =_{\mathcal L} s^* \cdot t\).