Antimirov Derivatives
We are currently working toward using regular expressions as a coding language for automata, specifically finite automata. Today, we are going to show that every regular expression represents a state in a finite automaton. We are going to start by defining an infinite automaton, called the Antimirov automaton \(\mathcal A_{Ant}\), whose states are regular expressions and such that \(\mathcal L(\mathcal A_{Ant}, r) = \mathcal L(r)\) for every regular expression \(r \in \mathit{RExp}\). This is analogous to the Brzozowski automaton, whose states were languages. Except in this case, the states are expressions!
Antimirov Derivatives
When we built the Brzozowski automaton, we needed to define what was called the Brzozowski derivative of a language, \(a^{-1}L\) for each \(a \in A\) and \(L \in 2^{A^*}\). To define the Antimirov automaton, we have to do something similar---we are going to define a notion of derivative for regular expressions. However, there is an important difference between the Brzozowski automaton and the Antimirov automaton: the Brzozowski automaton is total and deterministic, but the Antimirov automaton is neither. In other words, a regular expression can have multiple Antimirov derivatives.
Here is the formal definition:
Starting with \(F\), we define
- \(\emptyset \notin F\) and \(a \notin F\) for any \(a \in A\)
- \(\varepsilon \in F\)
- for any regular expression \(r\), we have \(r^* \in F\)
- \((r_1 + r_2) \in F\) if and only if \(r_1 \in F\) or \(r_2 \in F\)
- \((r_1 \cdot r_2) \in F\) if and only if \(r_1 \in F\) and \(r_2 \in F\)
Now let's define \(\delta\). We are going to use the reading notation from before to define \(\delta\), where we wrote \(\delta(x, w)\) to denote the set of states that are active after reading a word \(w\) from \(x\). In the Antimirov automaton, states are regular expressions, so in each of the equations below describing \(\delta\), you will see \(\delta(r, a)\). Here, \(r\) is a regular expression and \(a\) is a letter.
Let's get into it: for any \(a \in A\), we define
- \(\delta(\emptyset, a) = \{\}\)
- \(\delta(\varepsilon, a) = \{\}\)
- \(\delta(a, a) = \{\varepsilon\}\)
- \(\delta(a, b) = \{\}\) where \(b \in A\) and \(b \neq a\)
- \(\delta(r_1 + r_2, a) = \delta(r_1, a) \cup \delta(r_2, a)\)
- \(\delta(r_1 \cdot r_2, a) = \{s \cdot r_2 \mid s \in \delta(r_1, a)\} \cup \{s \mid r_1 \in F \text{ and } s \in \delta(r_2, a)\}\)
- \(\delta(r^*, a) = \{s \cdot r^* \mid s \in \delta(r, a)\}\)
Antimirov derivatives are named after Valentin M. Antimirov, who discovered them in the 1990s (although they implicitly appear in a paper of Robin Milner's from the 1980s).
To reiterate: states of \(\mathcal A_{Ant}\) are regular expressions. The transitions of \(\mathcal A_{Ant}\) are therefore transitions between regular expressions. And furthermore, like the Brzozowski automaton, transitions are given by derivatives. Except, in this case, derivatives look a bit different: regular expressions can have multiple derivatives. Before we get to the multiple-derivatives situation, let's look at a small example.
- \(r_0 = abb + bab + \emptyset\)
- \(r_1 = a(b + ab) + \varepsilon\)
- \(r_2 = (a(b + ab) + \varepsilon a)b\)
- \(r_3 = a^*\) (just unravel the definition of \(\delta(a^*, a)\) here, but be careful! There's a hidden \(\varepsilon\) somewhere...)
- \(r_4 = (ab)^*\)
- \(r_5 = (ab^*)^*\)
The Antimirov Fixed-point Theorem and \(\mathsf{Fin}\)
The significance of the Antimirov automaton is twofold: first, that it is yet again an automaton that witnesses languages "accepting themselves" (actually, the language semantics of a regular expression).
The proof of the Antimirov fixed point theorem and the Linear Bound on Antimirov Derivatives lemma above proceed by induction on regular expressions. This is similar to vanilla induction and induction on words, but instead deals with regular expressions.
- (Base Case 1) \(\emptyset \in S\)
- (Base Case 2) \(\varepsilon \in S\)
- (Base Case 3) \(a \in S\) for all \(a \in A\)
- (Induction Step 1) If \(r_1,r_2 \in S\), then \(r_1 + r_2\in S\)
- (Induction Step 1) If \(r_1,r_2 \in S\), then \(r_1 \cdot r_2\in S\)
- (Induction Step 1) If \(r \in S\), then \(r^* \in S\)
The base cases are by calculation: since \(\emptyset\) has no outgoing transitions and \(\emptyset\) is not an accepting state of \(\mathcal A_{Ant}\), then \[ \mathcal L(\mathcal A_{Ant}, \emptyset) = \{\} = \mathcal L(\emptyset) \] Since \(\varepsilon\) has no outgoing transitions and \(\varepsilon\) is an accepting state of \(\mathcal A_{Ant}\), then \[ \mathcal L(\mathcal A_{Ant}, \varepsilon) = \{\varepsilon\} = \mathcal L(\varepsilon) \] For each \(a \in A\), the automaton generated by the regular expression \(a \in \mathit{RExp}\) in \(\mathcal A_{Ant}\) is \[ \framebox{\(a\)} \xrightarrow{a} \framebox{\(\framebox{\(\varepsilon\)}\)} \] The language accepted by \(a\) in \(\mathcal A_{Ant}\) is therefore \[ \mathcal L(\mathcal A_{Ant}, a) = \{a\} = \mathcal L(a) \] This concludes the base case.
Let \(r,r_1,r_2 \in \mathit{RExp}\), and assume for an induction hypothesis that \[ \mathcal L(\mathcal A_{Ant}, r) = \mathcal L(r) \quad \mathcal L(\mathcal A_{Ant}, r_1) = \mathcal L(r_1) \quad \mathcal L(\mathcal A_{Ant}, r_2) = \mathcal L(r_2) \] There are three induction steps.
-
In the first induction step, we consider the regular expression \(r_1 + r_2\).
Since \(r_1 + r_2\) is an accepting state of \(\mathcal A_{Ant}\) if and only if either \(r_1\) is an accepting state or \(r_2\) is an accepting state,
\[\begin{aligned}
&\varepsilon \in \mathcal L(\mathcal A_{Ant}, r_1 + r_2) \\
&\text{ iff }
\varepsilon \in \mathcal L(\mathcal A_{Ant}, r_1) \cup \mathcal L(\mathcal A_{Ant}, r_2) \\
&\text{ iff }
\varepsilon \in \mathcal L(r_1) \cup \mathcal L(r_2) = \mathcal L(r_1 + r_2) &&\text{(IH)}
\end{aligned}\]
The last \(\text{iff}\) is the induction hypothesis.
This shows that the empty word is in \(\mathcal L(\mathcal A_{Ant}, r_1 + r_2)\) if and only if it is in \(\mathcal L(r_1 + r_2)\).
It now suffices to show that a nonempty word is in \(\mathcal L(\mathcal A_{Ant}, r_1 + r_2)\) if and only if it is in \(\mathcal L(r_1 + r_2)\). Every nonempty word is of the form \(aw\) for some \(a \in A\) and \(w \in A^*\). So, we can reason as follows: \[\begin{aligned} &aw \in \mathcal L(\mathcal A_{Ant}, r_1 + r_2) \\ &\text{ iff } r_1 + r_2 \xrightarrow{a} s \text{ and } w \in \mathcal L(\mathcal A_{Ant}, s)\\ &\text{ iff either } r_1 \xrightarrow{a} s \text{ or } r_2 \xrightarrow{a} s \text{ and } w \in \mathcal L(\mathcal A_{Ant}, s)\\ &\text{ iff } aw \in \mathcal L(\mathcal A_{Ant}, r_1) \text{ or } aw \in \mathcal L(\mathcal A_{Ant}, r_2) \\ &\text{ iff } \varepsilon \in \mathcal L(r_1) \cup \mathcal L(r_2) = \mathcal L(r_1 + r_2) &&\text{(IH)} \end{aligned}\] The last \(\text{iff}\) is the induction hypothesis. -
In the second induction step, we consider the regular expression \(r_1 \cdot r_2\).
We want to show that
\[
\mathcal L(\mathcal A_{Ant}, r_1 \cdot r_2)
= \mathcal L(r_1 \cdot r_2)
= \mathcal L(r_1) \cdot \mathcal L(r_2)
\]
We are going to prove the two inclusions separately.
In the forward inclusion, we want \[ \mathcal L(\mathcal A_{Ant}, r_1 \cdot r_2) \subseteq \mathcal L(r_1) \cdot \mathcal L(r_2) \qquad\text{(want!)} \] So, let \(w \in \mathcal L(\mathcal A_{Ant}, r_1 \cdot r_2)\) and write \(w = a_1 \cdots a_n\). Then there is a path \[ r_1 \cdot r_2 \xrightarrow{a_1} t_1 \xrightarrow{a_2} t_2 \xrightarrow{a_3} \cdots \xrightarrow{a_n} t_n \qquad \text{(*)} \] where \(t_n\) is an accepting state of \(\mathcal A_{Ant}\). There are two cases to consider: either (1) all of the \(t_i\) are of the form \(t_i = s_i \cdot r_2\) for all \(i=1,\dots,n\) and \(s_n\) and \(r_2\) are accepting, or (2) there is a \(j \le n\) such that \(t_i = s_i \cdot r_2\) for \(i < j\), \(t_j = r_2\), and \(r_2 \xrightarrow{a_{j+1}} t_j \xrightarrow{a_{j+1}} \cdots \xrightarrow{a_n} t_n\).
In case (1), the path in \(\text{(*)}\) looks like this: \[ r_1 \cdot r_2 \xrightarrow{a_1} s_1 \cdot r_2 \xrightarrow{a_2} s_2 \cdot r_2 \xrightarrow{a_3} \cdots \xrightarrow{a_n} s_n \cdot r_2 \] In particular, this path tells us that the path below exists in \(\mathcal A_{Ant}\): \[ r_1 \xrightarrow{a_1} s_1 \xrightarrow{a_2} s_2 \xrightarrow{a_3} \cdots \xrightarrow{a_n} s_n \] Since \(s_n\) is accepting, \(w \in \mathcal L(\mathcal A_{Ant}, r_1)\). The induction hypothesis tells us that \(w \in \mathcal L(\mathcal A_{Ant}, r_1) = \mathcal L(r_1)\), so \(w \in \mathcal L(r_1)\). In case (1), \(r_2\) is accepting, so \(\varepsilon \in \mathcal L(A_{Ant}, r_2)\). Again, the induction hypothesis now tells us that \(\varepsilon \in \mathcal L(r_2)\). Therefore, \(w = w\varepsilon \in \mathcal L(r_1 \cdot r_2)\).
In case (2), similar to case (1) we find \(a_1\cdots a_{j-1} \in \mathcal L(\mathcal A_{Ant}, r_1)\) and \(a_j \cdots \mathcal L(\mathcal A_{Ant}, r_2)\). By the induction hypothesis, we therefore find \[ w = (a_1\cdots a_{j-1})(a_{j} \cdots a_n) \in \mathcal L(r_1) \cdot \mathcal L(r_2) = \mathcal L(r_1\cdot r_2) \]
We have just shown that \(\mathcal L(\mathcal A_{Ant}, r_1 \cdot r_2) \subseteq \mathcal L(r_1) \cdot \mathcal L(r_2)\).For the opposite inclusion, let \(w = \mathcal L(r_1) \cdot \mathcal L(r_2)\). Then there are words \(u_1,u_2 \in A^*\) such that \(u_1 \in \mathcal L(r_1)\) and \(u_2 \in \mathcal L(r_2)\). By the induction hypothesis, \(u_1 \in \mathcal L(\mathcal A_{Ant}, r_1)\) and \(u_2 \in \mathcal L(\mathcal A_{Ant}, r_2)\). So, write \[ u_1 = a_1 \cdots a_n \quad u_2 = b_1 \cdots b_m \] Then by definition, there are paths \[\begin{aligned} &r_1 \xrightarrow{a_1} s_1 \xrightarrow{a_2} s_2 \xrightarrow{a_3} \cdots \xrightarrow{a_n} s_n \\ &r_2 \xrightarrow{b_1} t_1 \xrightarrow{b_2} t_2 \xrightarrow{b_3} \cdots \xrightarrow{b_n} t_m \end{aligned}\] in \(\mathcal A_{Ant}\) such that \(s_n,t_m\) are accepting. By definition of \(\mathcal A_{Ant}\), we obtain the path \[\begin{aligned} &r_1 \xrightarrow{a_1} s_1\cdot r_2 \xrightarrow{a_2} \cdots \xrightarrow{a_n} s_n \cdot r_2 \xrightarrow{b_1} t_1 \xrightarrow{b_2} \cdots \xrightarrow{b_n} t_m \end{aligned}\] in \(\mathcal A_{Ant}\). This means that \[w = (a_1\cdots a_n)(b_1\cdots b_m) \in \mathcal L(\mathcal A_{Ant}, r_1 \cdot r_2)\]
We have just shown that \(\mathcal L(\mathcal A_{Ant}, r_1 \cdot r_2) \supseteq \mathcal L(r_1\cdot r_2)\), so this induction step is done. -
In the final induction step, we need to show that
\(
\mathcal L(\mathcal A_{Ant}, r^*) = \mathcal L(r)^*
\).
Let \(w = a_1 \cdots a_n\).
Every path of the form
\[
r^* \xrightarrow{a_1} t_1 \xrightarrow{a_2} \cdots \xrightarrow{a_n} t_n
\]
is actually of the form
\[
r^* \xrightarrow{a_1} s_1 \cdot r^* \xrightarrow{a_2} \cdots \xrightarrow{a_n} s_n \cdot r^*
\qquad \text{(**)}
\]
in \(\mathcal A_{Ant}\), by definition, because every Antimirov derivative of \(r^*\) is of the form \(s \cdot r^*\) for some \(s\).
Now, it's not necessarily true that we have a path
\[
r \xrightarrow{a_1} s_1 \xrightarrow{a_2} \cdots \xrightarrow{a_n} s_n
\]
because one might have that the transition \(s_{i - 1} \cdot r^* \xrightarrow{a_i} s_i \cdot r^*\) exists because \(r \xrightarrow{a_i} s_i\) and \(s_{i-1}\) is accepting.
But this is the only pathology here, so we can find indices \(1 \le i_1, i_2, \dots, i_k \le n\) such that
\[
s_{i_j-1} \text{ is accepting and }
r \xrightarrow{a_{i_j}} s_{i_j}
\qquad \text{(***)}
\]
Therefore, if the word \(w = a_1\cdots a_n\) is accepted by \(r^*\) in \(\mathcal A_{Ant}\) because of the path drawn above, then we can split \(w\) into
\[
w = (a_1 \cdots a_{i_1-1})(a_{i_1} \cdots a_{i_2-1}) \cdots (a_{i_k} \cdots a_n)
\]
such that \(\text{(***)}\) holds for each \(j\).
If we let \(u_j = (a_{i_{j-1}} \cdots a_{i_j-1})\) for each \(j\), then \(w = u_1 \dots u_k\) and \(u_j \in \mathcal L(\mathcal A_{Ant}, r)\) for each \(j\).
By the induction hypothesis, \(u_j \in \mathcal L(r)\) for each \(j\), which means that
\[
w = u_1 \cdots u_k \in \mathcal L(r)^* = \mathcal L(r^*)
\]
This proves that \(\mathcal L(\mathcal A_{Ant}, r^*) \subseteq \mathcal L(r^*)\).
For the opposite inclusion, let \(w \in \mathcal L(r^*)\). Then by definition, there are \(u_1,\dots, u_k \in \mathcal L(r)\) such that \(w = u_1\cdots u_k\). By the induction hypothesis, there is a path of the form \(\text{(***)}\) satisfying \(\text{(**)}\) above, so by definition of \(\mathcal A_{Ant}\), we find \(w \in \mathcal L(\mathcal A_{Ant}, r^*)\). This shows that \(\mathcal L(\mathcal A_{Ant}, r^*) \supseteq \mathcal L(r^*)\)
Now, for the induction steps, suppose that \(r, r_1,r_2 \in S\).
- (Induction Step 1) Every regular expression reachable from \(r_1 + r_2\) in \(\mathcal A_{Ant}\) is reachable from either \(r_1\) or \(r_2\). This means that \[\begin{aligned} \#(r_1 + r_2) &\le |\{r_1 + r_2\} \cup \{s \mid r_1 \to \cdots \to s \} \cup \{s \mid r_2 \to \cdots \to s \}| \\ &\le 1 + \#(r_1) + \#(r_2) \end{aligned}\] Since \(r_1,r_2 \in S\), \(\#(r_1) + \#(r_2)\) is finite. Therefore, \(\#(r_1 + r_2)\) is finite, so \(r_1 + r_2 \in S\). So far, \(\#(r)\) has risen to the order of \(O(\frac12 n) + O(\frac12 n) = O(n)\).
- (Induction Step 2) Every regular expression \(p \in \mathit{RExp}\) reachable from \(r_1 \cdot r_2\) in \(\mathcal A_{Ant}\) is of one of two forms: either \(p = s \cdot r_2\) where \(s\) is reachable from \(r_1\), or \(p = \varepsilon \cdot s\) where \(s\) is reachable from \(r_2\). Counting these up, we again see that \[\begin{aligned} \#(r_1 \cdot r_2) &\le |\{r_1 \cdot r_2\} \cup \{s \cdot r_2 \mid r_1 \to \cdots \to s \} \\ &\hspace{4em}\cup \{\varepsilon s \mid r_1 \to \cdots \to q \in F \text{ and } r_2 \to \cdots \to s \}| \\ &\le 1 + |\{s \mid r_1 \to \cdots \to s \} \\ &\hspace{4em}\cup \{s \mid r_2 \to \cdots \to s \}| \\ &\le 1 + \#(r_1) + \#(r_2) \end{aligned}\] Since \(r_1,r_2 \in S\), \(\#(r_1) + \#(r_2)\) is finite. Therefore, \(\#(r_1 \cdot r_2)\) is finite, so \(r_1 \cdot r_2 \in S\). So far, \(\#(r)\) has risen to the order of \(O(\frac12 n) + O(\frac12 n) = O(n)\).
- (Induction Step 3) Every regular expression \(p \in \mathit{RExp}\) reachable from \(r^*\) is of the form \(p = s \cdot r^*\) for some \(s\) reachable from \(r\). Therefore, \[\begin{aligned} \#(r^*) &\le |\{r^*\} \cup \{s \cdot r^* \mid r \to \cdots \to s \}| \\ &\le 1 + |\{s \mid r \to \cdots \to s \}| \\ &\le 1 + \#(r) \end{aligned}\] Since \(r \in S\), \(\#(r)\) is finite. Therefore, \(\#(r^*)\) is finite, so \(r^* \in S\). So far, \(\#(r)\) has risen to the order of \(O(n)\).
Part of what is really nice about the Antimirov automaton is that the \(O(n)\) bound on the number of states shows us that these automata are relatively small.
- Name three different words \(w_0,w_1,w_1 \in A^*\) that are not in \(\mathcal L(r)\), i.e., \(w_0,w_1,w_2 \notin \mathcal L(r)\).
- Use the inequalities in the proof of the Linear Bound on Antimirov Derivatives Lemma to determine an upper bound on the number of states in the automaton \(\langle r\rangle_{\mathcal A_{Ant}}\) generated by \(r\) in \(\mathcal A_{Ant}\), i.e., \(\#(r)\).
- Now draw a state diagram of \(\langle r\rangle_{\mathcal A_{Ant}}\).
- How many formation rules were used to form the regular expression \(r\)? How does this number of formation rules compare to the number of states in \(\langle r\rangle_{\mathcal A_{Ant}}\)?