CSCI 341 Theory of Computation

Fall 2025, with Schmid

← structure of fin antimirov derivatives →

Regular Expressions and Languages

Before, we mentioned that automata are like machines, and states are like programs. What we are missing so far in this course is a coding language for these programs. That's what we are about to talk about for the next few lectures, a coding language for states in automata. Most of you have seen regular expressions at his point in your career. The regular expressions that theoretical computer scientists study are quite a bit different: they are like the regular expressions you know, except they can't do anything like count instances of a string or whatnot.

It turns out that if you remove all the bells and whistles from the regular expressions you know and love, you get a coding language that is perfectly suited to programming states of finite automata. Today we're going to step away from finite automata for a moment and just get familiar with regular expressions on their own.

Regular Expression Syntax

The set of regular expressions has a recursive definition that tells us how they can be formed. This involves a bunch of rules for how to form new regular expressions from old ones. Formally, the definition looks like this:

(Regular Expressions) Let \(A\) be an alphabet of input symbols. The set of regular expressions over \(A\) is written \(\mathit{RExp}\) and defines by the following formation rules:

The symbol \(\varepsilon\) is a regular expression, \(\varepsilon \in \mathit{RExp}\).
The symbol \(\emptyset\) is a regular expression, \(\emptyset \in \mathit{RExp}\).
For any input letter \(a \in A\), the symbol \(a\) is a regular expression, \(a \in \mathit{RExp}\).

The next few formation rules are the recursive ones: each rule builds a new regular expression from old ones.

If \(r_1, r_2 \in \mathit{RExp}\) are regular expressions, then so is \((r_1 + r_2) \in \mathit{RExp}\).
If \(r_1, r_2 \in \mathit{RExp}\) are regular expressions, then so is \((r_1 \cdot r_2) \in \mathit{RExp}\).
If \(r \in \mathit{RExp}\) is a regular expression, then so is \((r)^* \in \mathit{RExp}\).

The elements of the set \(\mathit{RExp}\) are called regular expressions.

(Some Regular Expressions) The string of symbols \( ((((a + \emptyset))^* \cdot \varepsilon) + b)\) is a regular expression (over the alphabet \(A = \{a, b\}\)). You can form this expression as follows:

\(a\), \(\emptyset\), \(\varepsilon\), and \(b\) are regular expressions.
Since \(a\) and \(\emptyset\) are regular expressions, \((a + \emptyset)\) is a regular expression.
Since \((a + \emptyset)\) is a regular expression, \(((a + \emptyset))^*\) is a regular expression.
Since \(((a + \emptyset))^*\) and \(\varepsilon\) are a regular expressions, \((((a + \emptyset))^* \cdot \varepsilon)\) is a regular expression.
Since \((((a + \emptyset))^* \cdot \varepsilon)\) and \(b\) are a regular expressions, \(((((a + \emptyset))^* \cdot \varepsilon) + b)\) is a regular expression.

(Practicing the Rules) Is \((a + b\) a regular expression? What about \(b + (\emptyset)\)? Form five more regular expressions using the formation rules for regular expressions.

There are so many brackets! We can get rid of them, in much the same way as we know how to get rid of brackets when we're doing arithmetic. Like, what is \[ 7 + 8 \times 4^2 = \text{?} \] I would use BEDMAS to write in the brackets that should be there, if I were being really careful: \[ 7 + 8 \times 4^2 = (7 + (8 \times (4)^2)) \] The order of precedence for the arithmetic operations is \((-) > {-}^2 > \times > +\), so we bracket them in that order. We can do the same thing with regular expressions: the order of precedence is \[ (-) > {-}^* > {\cdot} > + \] This allows us to remove brackets in the expressions: \[ (a + \emptyset)^* \cdot \varepsilon + b = ((((a + \emptyset))^* \cdot \varepsilon) + b) \] Much better!

(Getting in Formation) Put the brackets back into the following expressions.

\(a + b \cdot c ^* \)
\((a + b)^* \cdot c ^* + b \)
\(a \cdot b^* + c ^* + b \)

The last convention we are going to assume here is that when the brackets are ambiguous, we always "associate to the right". For example, \(a + b + c = a + (b + c)\), and \(a\cdot b \cdot c = a \cdot (b \cdot c)\).

Regular Languages

So far, regular expressions are just meaningless symbols. We need to give them a semantics. The semantics of a regular expression is the language it represents.

(Language Semantics) Recall that \(\mathit{RExp}\) is the set of regular expressions and \(2^{A^*}\) is the set of languages over an alphabet \(A\). We define the language semantics \(\mathcal L(r) \in 2^{A^*}\) of a regular expression \(r \in \mathit{RExp}\) recursively using the formation rules.

\(\mathcal L(\emptyset) = \{\}\)
\(\mathcal L(\varepsilon) = \{\varepsilon\}\)
for each \(a \in A\), \(\mathcal L(a) = \{a\}\)

The next few rules are the recursive rules.

Given regular expressions \(r_1, r_2 \in \mathit{RExp}\), \(\mathcal L(r_1 + r_2) = \mathcal L(r_1) \cup \mathcal L(r_2)\).
Given regular expressions \(r_1, r_2 \in \mathit{RExp}\), \[ \mathcal L(r_1 \cdot r_2) = \mathcal L(r_1) \cdot \mathcal L(r_2) = \{wu \in A^* \mid w \in \mathcal L(r_1) \text{ and } u \in \mathcal L(r_2)\} \]
Given a regular expression \(r \in \mathit{RExp}\), \[ \mathcal L(r^*) = \mathcal L(r)^* = \{w_1w_2\cdots w_n \in A^* \mid w_1,w_2,\dots, w_n \in \mathcal L(r)\} \]

These operations have names:

\(+\) is called union (as usual),
\(\cdot\) is called sequential composition (or just composition),
\(-^*\) is called Kleene star (or just star).

The Kleene star is named after Stephen Cole Kleene, who discovered regular expressions in the 50s.

(Some Small Concatenations and Stars) Let's do some small calculations using the definition of language semantics. First of all, let's start with letters \(a,b \in A\). Then the definition of the language semantics of \(a\) (as a regular expression) is \(\mathcal L(a) = \{a\}\), and similarly for \(b\). Therefore, \[ \mathcal L(a + b) = \mathcal L(a) \cup \mathcal L(b) = \{a\} \cup \{b\} = \{a, b\} \] We can compute in the same manner the language \[ \mathcal L(a \cdot b) = \mathcal L(a)\cdot\mathcal L(b) = \{a\} \cdot \{b\} = \{ab\} \] That last equation comes from the definition: the words in \(\mathcal L(a) \cdot \mathcal L(b)\) are all the words \(wu\) such that \(w \in \mathcal L(a)\) and \(u \in \mathcal L(b)\). These languages have one word each, \(a\) and \(b\) respectively, so there is only the concatenation \(wu = ab\) in \(\mathcal L(a)\cdot \mathcal L(b)\).

Now let's take a look at the star. The star takes whatever words are in the language and forms all of the different concatenations of them (including the empty word). So, since \(\mathcal L(a) = \{a\}\) has only one word, \[ \mathcal L(a^*) = \{\varepsilon, a, aa, aaa, aaaa, a^5, a^6, \dots\} = \{a^n \mid n \in \mathbb N\} \] It gets more complicated when there are multiple words in the language. For example, \[ \mathcal L((a + b)^*) = \mathcal L(a + b)^* = \{a, b\}^* = \{\varepsilon, a, b, aa, ab, ba, bb, aaa, aab, aba, \dots\} \]

One more thing: kind of like how we never actually write \(\times\) when we're multiplying, we never really write \(\cdot\) when we are sequentially composing. So you'll often see expressions like \(a^*b\), which if formally written would be \(a^* \cdot b\). We are only going to write \(\cdot\) when it makes something a lot easier to read.

(Let's Compute some Languages!) Compute the language semantics of the following regular expressions over the alphabet \(A = \{a,b,c\}\). In other words, describe the language semantics explicitly or using set comprehension.

\(a + \varepsilon\)
\(b(a + \varepsilon)\)
\((ab)^*\)
\(ab^*\)
\(\varepsilon^*\)
\((\varepsilon + a)^*\)

So far, we have shown how to turn regular expressions into languages directly. But we set out, to begin with, towards the goal of creating a coding language for automata, not languages. This discrepancy introduces yet another family of languages into the picture.

(Regular Language) Let \(L \subseteq A^*\) be a language. Then \(L\) is regular if there is a regular expression \(r \in \mathit{RExp}\) such that \(L = \mathcal L(r)\). The family of regular languages is \(\mathsf{Reg}\).

At this point, we have no idea how automata and regular languages are related. Is \(\mathsf{Fin} \subseteq \mathsf{Reg}\)? What about \(\mathsf{Fin} \supseteq \mathsf{Reg}\)? The first main result we will prove in this course is that, in fact, these two families are the same. This is called Kleene's Theorem, but it will take several days before we are ready to write down an actual proof.

(Finding Expressions) Show that the following languages are regular over the alphabet \(A = \{a, b, c\}\).

\(L_1 = \{a, \varepsilon\}\)
\(L_2 = \{ba, ab\}\)
\(L_3 = \{ba^n \mid n \in \mathbb N\}\)
\(L_4 = A^*\)
\(L_5 = \{w \in A^*\mid w \text{ contains the word } abba\}\)

(Intersections and Complements) Show that the following two languages are regular over \(A = \{a, b\}\).

\(L_6 = \mathcal L(b^*a(a + b)^*) \cap \mathcal L(a^*b(a + b)^*)\)
\(L_7 = A^* \setminus L_6\)

Start by describing these languages explicitly. The first one has something to do with the numbers of \(a\)s and \(b\)s in the word.

← structure of fin antimirov derivatives →

Top