Regular Expressions and Languages
Before, we mentioned that automata are like machines, and states are like programs. What we are missing so far in this course is a coding language for these programs. That's what we are about to talk about for the next few lectures, a coding language for states in automata. Most of you have seen regular expressions at his point in your career. The regular expressions that theoretical computer scientists study are quite a bit different: they are like the regular expressions you know, except they can't do anything like count instances of a string or whatnot.
It turns out that if you remove all the bells and whistles from the regular expressions you know and love, you get a coding language that is perfectly suited to programming states of finite automata. Today we're going to step away from finite automata for a moment and just get familiar with regular expressions on their own.
Regular Expression Syntax
The set of regular expressions has a recursive definition that tells us how they can be formed. This involves a bunch of rules for how to form new regular expressions from old ones. Formally, the definition looks like this:
- The symbol \(\varepsilon\) is a regular expression, \(\varepsilon \in \mathit{RExp}\).
- The symbol \(\emptyset\) is a regular expression, \(\emptyset \in \mathit{RExp}\).
- For any input letter \(a \in A\), the symbol \(a\) is a regular expression, \(a \in \mathit{RExp}\).
- If \(r_1, r_2 \in \mathit{RExp}\) are regular expressions, then so is \((r_1 + r_2) \in \mathit{RExp}\).
- If \(r_1, r_2 \in \mathit{RExp}\) are regular expressions, then so is \((r_1 \cdot r_2) \in \mathit{RExp}\).
- If \(r \in \mathit{RExp}\) is a regular expression, then so is \((r)^* \in \mathit{RExp}\).
- \(a\), \(\emptyset\), \(\varepsilon\), and \(b\) are regular expressions.
- Since \(a\) and \(\emptyset\) are regular expressions, \((a + \emptyset)\) is a regular expression.
- Since \((a + \emptyset)\) is a regular expression, \(((a + \emptyset))^*\) is a regular expression.
- Since \(((a + \emptyset))^*\) and \(\varepsilon\) are a regular expressions, \((((a + \emptyset))^* \cdot \varepsilon)\) is a regular expression.
- Since \((((a + \emptyset))^* \cdot \varepsilon)\) and \(b\) are a regular expressions, \(((((a + \emptyset))^* \cdot \varepsilon) + b)\) is a regular expression.
There are so many brackets! We can get rid of them, in much the same way as we know how to get rid of brackets when we're doing arithmetic. Like, what is \[ 7 + 8 \times 4^2 = \text{?} \] I would use BEDMAS to write in the brackets that should be there, if I were being really careful: \[ 7 + 8 \times 4^2 = (7 + (8 \times (4)^2)) \] The order of precedence for the arithmetic operations is \((-) > {-}^2 > \times > +\), so we bracket them in that order. We can do the same thing with regular expressions: the order of precedence is \[ (-) > {-}^* > {\cdot} > + \] This allows us to remove brackets in the expressions: \[ (a + \emptyset)^* \cdot \varepsilon + b = ((((a + \emptyset))^* \cdot \varepsilon) + b) \] Much better!
- \(a + b \cdot c ^* \)
- \((a + b)^* \cdot c ^* + b \)
- \(a \cdot b^* + c ^* + b \)
The last convention we are going to assume here is that when the brackets are ambiguous, we always "associate to the right". For example, \(a + b + c = a + (b + c)\), and \(a\cdot b \cdot c = a \cdot (b \cdot c)\).
Regular Languages
So far, regular expressions are just meaningless symbols. We need to give them a semantics. The semantics of a regular expression is the language it represents.
- \(\mathcal L(\emptyset) = \{\}\)
- \(\mathcal L(\varepsilon) = \{\varepsilon\}\)
- for each \(a \in A\), \(\mathcal L(a) = \{a\}\)
- Given regular expressions \(r_1, r_2 \in \mathit{RExp}\), \(\mathcal L(r_1 + r_2) = \mathcal L(r_1) \cup \mathcal L(r_2)\).
- Given regular expressions \(r_1, r_2 \in \mathit{RExp}\), \[ \mathcal L(r_1 \cdot r_2) = \mathcal L(r_1) \cdot \mathcal L(r_2) = \{wu \in A^* \mid w \in \mathcal L(r_1) \text{ and } u \in \mathcal L(r_2)\} \]
- Given a regular expression \(r \in \mathit{RExp}\), \[ \mathcal L(r^*) = \mathcal L(r)^* = \{w_1w_2\cdots w_n \in A^* \mid w_1,w_2,\dots, w_n \in \mathcal L(r)\} \]
These operations have names:
- \(+\) is called union (as usual),
- \(\cdot\) is called sequential composition (or just composition),
- \(-^*\) is called Kleene star (or just star).
Now let's take a look at the star. The star takes whatever words are in the language and forms all of the different concatenations of them (including the empty word). So, since \(\mathcal L(a) = \{a\}\) has only one word, \[ \mathcal L(a^*) = \{\varepsilon, a, aa, aaa, aaaa, a^5, a^6, \dots\} = \{a^n \mid n \in \mathbb N\} \] It gets more complicated when there are multiple words in the language. For example, \[ \mathcal L((a + b)^*) = \mathcal L(a + b)^* = \{a, b\}^* = \{\varepsilon, a, b, aa, ab, ba, bb, aaa, aab, aba, \dots\} \]
One more thing: kind of like how we never actually write \(\times\) when we're multiplying, we never really write \(\cdot\) when we are sequentially composing. So you'll often see expressions like \(a^*b\), which if formally written would be \(a^* \cdot b\). We are only going to write \(\cdot\) when it makes something a lot easier to read.
- \(a + \varepsilon\)
- \(b(a + \varepsilon)\)
- \((ab)^*\)
- \(ab^*\)
- \(\varepsilon^*\)
- \((\varepsilon + a)^*\)
So far, we have shown how to turn regular expressions into languages directly. But we set out, to begin with, towards the goal of creating a coding language for automata, not languages. This discrepancy introduces yet another family of languages into the picture.
At this point, we have no idea how automata and regular languages are related. Is \(\mathsf{Fin} \subseteq \mathsf{Reg}\)? What about \(\mathsf{Fin} \supseteq \mathsf{Reg}\)? The first main result we will prove in this course is that, in fact, these two families are the same. This is called Kleene's Theorem, but it will take several days before we are ready to write down an actual proof.
- \(L_1 = \{a, \varepsilon\}\)
- \(L_2 = \{ba, ab\}\)
- \(L_3 = \{ba^n \mid n \in \mathbb N\}\)
- \(L_4 = A^*\)
- \(L_5 = \{w \in A^*\mid w \text{ contains the word } abba\}\)
- \(L_6 = \mathcal L(b^*a(a + b)^*) \cap \mathcal L(a^*b(a + b)^*)\)
- \(L_7 = A^* \setminus L_6\)