Finite automata are one of the most fundamental models of computation and are taught in almost all undergraduate computer-science curricula. Although automata are typically presented as a theoretical model of computation, they have found their place in a variety of practical applications, such as natural language processing, networking, program verification, and regular-expression matching. In this article, we argue that the classical definition of finite automata is not well suited for practical implementation and present symbolic automata, a model that addresses this limitation.
Key Insights
- Gap: Real implementation of finite automata use the structure of the alphabet, for example, character classes, to enable efficient data structures and
- algorithms. These aspects are missing in the way automata are taught and defined in the literature.
- Innovation: Symbolic automata match how automata are implemented in practice by representing the alphabet and automaton structures separately. This separation yields a more general model that can handle complex and infinite alphabets while retaining all the desirable properties of finite automata and enabling new practical applications.
- Opportunity: Separating the alphabet and automaton representations opens opportunities for redesigning how we teach automata and design algorithms for them.
In most of the literature, finite automata are intuitively presented as follows. A finite automaton is a graph that describes a set of strings over a finite alphabet Σ. In the graph, nodes are called states, some states are initial, some states are final, and each edge “reads” a symbol from Σ. Every path from an initial to a final state describes a string accepted by the automaton. For example, the automaton in Figure 1(a), given Σ = {0, 1}, describes the set of all binary strings starting with 0. Researchers have designed a number of decision procedures and extensions for this simple but very powerful model.
Figure 1. (a) classical automaton over the alphabet {0, 1} accepting all strings that start with 0; (b-g) symbolic automata accepting all ASCII strings starting with a decimal digit with predicates represented by: (b) hashsets; (c) intervals; (d) binary decision diagrams where dashed arrow is case bi = 0; (e) bitvectors; (f) Boolean formulas; and (g) finite automata over the alphabet {0, 1} restricted to {0, 1}(7).
As discussed in textbooks, example Hopcroft and Ullman,22 the aforementioned way of defining automata is simple and easy to implement.
Any automaton can be represented by a set of edges, where an edge (p, a, q) goes from state p to q reading symbol a.
For certain applications, such as natural language processing, this type of implementation is appropriate. However, in many applications (particularly in software verification), the alphabet over which the automaton has to operate is so big (sometimes infinite) that an explicit representation of all its symbols is infeasible.
The following two applications of finite automata are examples of this problem.
Regular expressions. Automata are used in most implementations of regular-expression matching and analysis algorithms. In this domain, the alphabet ΣR is typically very large, ranging between hundreds and millions of characters, for example, extended ASCII has 256 characters and Unicode has over one million characters. A quick Web search of finite-automata implementations for regular expressions shows that in all the performing implementations, transitions (that is, the edges of the graph) do not “read” a single character in ΣR as done in the classical definition.a Instead, practical implementations allow transitions to carry (that is, “read”) sets of characters, that is, an edge in the graph denotes a transition of the form (p, S, q) where S ⊆ ΣR, see Figure 1(b). Moreover, in most implementations, S is represented using a dedicated data structure, for example, an interval [‘∅’, ‘9’] representing the set of all ASCII decimal digits {‘∅’, ‘1’, …, ‘9’}, see Figure 1(c). Use of intervals assumes that all characters have a numeric code that implies a total order over ΣR—for example, the ith decimal ASCII digit has code 48 + i.
Model checking. Automata are the backbone of many model checking algorithms,35 where these models are used to describe properties a system must obey. In model checking, the alphabet is typically defined with respect to a finite set of Boolean properties of interest, often called atomic propositions. The alphabet is ΣMC= 2APk, where a character α ∈ ΣMC represents the assignment {b = true}b∈α ∪ {b = false}b∈APα. In practical implementations of automata for model checking, transitions do not read individual characters and instead read sets of characters.18 The tools assume a particular representation of the sets, for example as binary decision diagrams, see Figure 1(d), or as Boolean formulas φ over the variables AP where each satisfying assignment defines a character α, denoted by α ⊨ φ. For example, given k = 7, the predicate in Figure 1(f) has exactly 10 satisfying assignments, such as {b0,b4,b5} ⊨ φ.
The above applications clarify the following point.
Practical implementations of automata allow transitions to carry sets of characters instead of individual characters and take advantage of the structure of the input alphabet.
These examples show that the classical theory of automata does not do a good job in capturing the ways in which automata are implemented in practice. Therefore, all the theoretical results—for example, decision procedures and closure properties—do not directly apply to such implementations and cannot “take advantage” of the structure present in how automata are implemented. Symbolic automata and their variants (the models presented in this article) address this problem.
Symbolic automata explicitly model how the alphabet is represented/implemented in practical applications and allow us to design automata decision procedures that take advantage of the alphabet representation.
Symbolic automata extend finite automata by allowing transitions to carry predicates over rich alphabet theories (for example, intervals or binary decision diagrams). Unlike what is done in existing implementations of automata, where the alphabet representation is chosen a priori and hard-coded in the model, symbolic automata explicitly separate the representation of the alphabet structure from one of the finite state graph structures. Concretely, symbolic automata provide a unifying approach for different alphabet representations by representing the alphabet using an effective Boolean algebra A, also called the alphabet theory. The automata algorithms are now designed modulo A. This separation of concerns allows one to seamlessly change the alphabet representation (for example, choosing any of the representations in Figure 1), without changing any of the underlying automata algorithms (that is, each decision procedure is implemented in a way that is agnostic to the choice of the alphabet theory).
Not only symbolic automata better reflect how automata are implemented in practice, but they also allow one to represent strings over infinite alphabets, for example, the set of rational numbers. Despite this increase in expressiveness, symbolic automata enjoy many of the desirable closure and decidability properties of finite automata and have been in fact used in a variety of applications: verification of functional programs operating over lists and trees,13 analysis of complex implementations of BASE64 and UTF encoders,10 automatic synthesis of inverses of complex string-manipulating programs,23 analysis of programs involving regular expressions over large alphabets,8,9,36 automatic parallelization of list-processing code,31 solving constraints over sequences,33 vulnerability detection in Web-applications,4 analysis of program binaries,5 and fast execution of string transformations.1,28
Symbolic automata extend finite automata by allowing transitions to carry predicates over rich alphabet theories.
The goal of this article is to give an overview of what is currently known about symbolic automata and their variants, and what applications these models have enabled. The concept of automata with predicates instead of concrete symbols was first mentioned in Watson.39 This article focuses on work done following the definition of symbolic finite automata presented in Veanes et al.,36 where predicates have to be drawn from a decidable Boolean algebra. The term symbolic automata is sometimes used to refer to automata over finite alphabets where the state space is represented using binary decision diagrams. This meaning is different from the one described in this article.
It is hard to describe all the works related to symbolic automata in one article, and the authors curate an updated list of papers on symbolic automata and transducers (http://pages.cs.wisc.edu/~loris/symbolic-automata.html). Moreover, the algorithms discussed in this article are implemented in the open source libraries AutomataDotNet (C#, https://github.com/AutomataDotNet/) and symbolic automata (Java, https://github.com/lorisdanto/symbolicautomata/).
Symbolic Automata
Structured alphabets and Boolean algebras. As we illustrated in Figure 1, in symbolic automata, transitions carry predicates, and to formally describe predicates, we need to define Boolean algebras. Formally, an effective Boolean algebra A is a tuple (D, ψ, ⟦_⟧, ⊥, ⊤, ∨, ∧, ¬), where D is a set of domain elements; ψ is a set of predicates closed under the Boolean connectives, with ⊥, ⊤ ∈ ψ; and the component ⟦_⟧: ψ → 2D is a denotation function such that (i) ⟦⊥⟧ = ∅, (ii) ⟦⊤⟧ = D, and (iii) for all φ, ψ ∈ ψ, ⟦φ∨ψ⟧ = ⟦φ⟧ ∪ ⟦ψ⟧, ⟦φ∧ψ⟧ = ⟦φ⟧ ∩ ⟦ψ⟧, and ⟦¬φ⟧ = D⟦φ⟧. We also require that checking satisfiability of φ—that is, whether ⟦φ⟧ ≠ ∅—is decidable. A is extensional when ⟦φ⟧ = ⟦ψ⟧ implies that φ = ψ—that is, semantic equivalence coincides with syntactic equivalence.
The following example shows how some of the character representations described in Figure 1 are Boolean algebras.
Example 2.1 (ASCII Algebras). Regular expressions in modern programming languages use character classes as basic building blocks. Character classes can be unioned and complemented. For example, the character classes [A-Z] and [a-z] denote the set of all ASCII upper and lower case letters, respectively, [A-Za-z] denotes their union, and [^A-Za-z] denotes the complement of [A-Za-z]. A natural choice for domain D is the set of all ASCII codes {n | 0 ≤ n ≤ 127}. Figure 1(b-g) shows several different options for representing ψ. Consider the following representations of φ ∈ ψ:
- φ is a binary decision diagram (BDD) over AP7 (Figure 1(d)); ∧ is AND-product of BDDs; and ⊥ is the false-leaf BDD;
- φ is a 128-bit bitvector whose nth bit is 1 iff n ∈ ⟦φ⟧ (Figure 1(e)); ∧ is bitwise-AND; and ⊥ is 0(128)—that is, a sequence of 128 zeroes;
- φ is a Boolean formula over AP7 (Figure 1(f)); ∧ is syntactic conjunction; and ⊤ is the true predicate.
- φ is an automaton over Σ = {0, 1} restricted to Σ(7) (Figure 1(g)); ∧ is product of automata; and ⊤ is the fixed automaton accepting all strings in Σ(7).
We identify the underlying semantic domain D in this example with character codes. This reflects the fact that implementations that use ASCII often use expressions such as A and x41 as equivalent notations of the same underlying character (code). In other words, ⟦φ⟧ ⊆ D is a set of natural numbers, namely the set of all character codes denoted by φ. In particular, in (a) and (c), ⟦φ⟧ = {Σbi∈α 2i | α ⊨ φ}, and in (d), ⟦φ⟧ = {Σ1≤i≤7,vi=1 27−i | v ∈ L(φ)}. Observe that the algebras in (a) and (b) are extensional where satisfiability is trivial (nonequality to ⊥). For example, in (b), the disjunction 0(64)1(64) ∨ 1(64)0(64) (where ∨ is bitwise-OR) is the predicate 1(128), that is, ⊤. The algebra in (c) is nonextensional and satisfiability can be decided using a satisfiability (SAT) solver. The algebra in (d) is also nonextensional but its satisfiability is trivial, assuming unreachable states and deadends are always removed from (the automaton) φ. ⊠
The following Boolean algebra mimics how finite alphabets are represented in traditional automata implementations.b
Example 2.2 (Equality Algebra). The equality algebra over an arbitrary set D has an atomic predicate φa for every a ∈ D such that ⟦φa⟧ = {a} as well as predicates ⊥ and ⊤. There are no formal requirements on D, but one case is that D = ΣMC is a powerset of a finite set of Boolean properties as discussed above. The set of predicates ψ is the Boolean closure generated from the atomic predicates—for example, φa ∨ φb and ¬φa, where a, b ∈ D are predicates in ψ. Intuitively, one can think of each predicate in this algebra to be of the form λx. x=a1 ∨…∨ x = an, where each ai is an element of D. Hence, the name equality algebra. ⊠
The following example shows how the predicate representations employed by modern SMT solvers14 can be used to design Boolean algebras for arbitrarily complex domains.
Example 2.3 (SMT Algebra). Let ψ be the set of all quantifier-free formulas with one fixed free variable X of a fixed type τ. Formally, SMTτ = (D, ψ, ⟦_⟧, ⊥, ⊤, ∨, ∧, ¬), where D is the domain of τ, and the Boolean operations are the corresponding connectives in SMT formulas. Intuitively, SMTτrepresents a restricted use of an SMT solver. The interpretation function ⟦φ⟧ is defined using the operations of satisfiability checking and witness generation of an SMT solver. For example, in SMTZ, elements are integers and predicates are linear arithmetic formulas, such as and . ⊠
Symbolic finite automata. We can now define symbolic finite automata, which are finite automata where edge labels are replaced by predicates in a Boolean algebra.
Definition 2.4. A symbolic finite automaton (s-FA) is a tuple M = (A, Q, q0, F, Δ), where A is an effective Boolean algebra, Q is a finite set of states, q0 ∈ Q is the initial state, F ⊆ Q is the set of final states, and Δ ⊆ Q × ψA × Q is a finite set of transitions.
Elements of the domain D of A are called characters, and elements of D* are called strings. A transition ρ = (q1,φ,q2)∈Δ, also denoted has source state q1, target state q2, and guard φ. For a ∈ D, the concrete a-transition denotes that and a∈⟦φ⟧ for some φ; M is deterministic when there is at most one concrete a-transition from any source q1 and character a.
A string W = a1a2…ak is accepted at state q iff, for 1 ≤ i ≤ k, there exist ai-transitions such that q0 = q and qk ∈ F. The set of strings accepted at q is denoted by Lq(M) and the language of M is L(M) = Lq0 (M).
It is convenient to work with s-FAs that are normalized by treating Δ as a function from Q × Q to ψ with Δ(p,q) = ⊥ when there is no transition from p to q. To this end, let Δ(p,q)=∨{φ|(p,φ,q)∈Δ}, where ∨∅=⊥. We also define dom , as the domain-predicate of p, and say that p is complete if ⟦dom(p)⟧= DA ; p is partial otherwise. Observe that p is partial iff ¬dom(p) is satisfiable. M is complete if all states of M are complete; M is partial otherwise.
Example 2.5. Examples of s-FAs are Mpos and Mev/odd in Figure 2(a) and (b), respectively. These two s-FAs have 1 and 2 states, respectively, and they both operate over the Boolean algebra SMTZ from Example 2.3. The s-FA Mpos accepts all strings consisting only of positive numbers, whereas the s-FA Mev/odd accepts all strings of even length consisting only of odd numbers. For example, Mev/odd accepts the string[1, 3, 5, 3] and rejects strings[1, 3, 5] and [51, 26]. The product automaton of Mpos and Mev/odd’, Mev/odd × Mpos, accepts the language L(Mpos)∩L(Mev/odd) (Figure 2(c)). Both s-FAs are partial—for example, neither of them has transitions for character 0. Finally, Mcpos accepts the complement of the language of Mpos (Figure 2(d)). ⊠
Figure 2. Symbolic automata: (a) Mpos; (b) Mev/odd; (c) Mev/odd × Mpos; and (d) Mcpos.
Properties of symbolic automata. In this section, we illustrate some basic properties of s-FAs and show how, although these models are more expressive (they can model infinite alphabets) and succinct (they allow multiple characters on individual transitions) than finite automata, they still enjoy many desirable decidability and closure properties. A key characteristic of all s-FAs algorithms is that there is no explicit use of characters because D may be infinite and the interface to the Boolean algebra does not directly support use of individual characters. This aspect is in sharp contrast to traditional finite automata algorithms.
Closure properties. First, such as for finite automata, nondeterminism does not add expressiveness for s-FAs.
THEOREM 2.6 (DETERMINIZABILITY36). Given an s-FA M one can effectively construct a deterministic s-FA Mdet such that L(M) = L(Mdet).
The determinization algorithm is similar to the subset construction for automata over finite alphabets, but also requires combining predicates appearing in different transitions. If M contains k inequivalent predicates and n states, then the number of distinct predicates (other than ⊤) in Mdet is at most 2k and the number of states is at most 2n. In other words, in addition to the classical state space explosion risk, there is also a predicate space explosion risk. Figure 3 illustrates such explosion for k = 2 and n = 2.
Figure 3. (a) Nondeterministic partial s-FA with predicates φodd and φ>0 from Example 2.3. (b) Equivalent complete s-FA after determinization where φ∅ denotes the predicate ¬φodd∧¬φ>0 on transitions to the “sink” state ∅.
Because s-FAs can be determinized, we can show that s-FAs are closed under Boolean operations using variations of traditional automata constructions.
THEOREM 2.7 (BOOLEAN OPERATIONS36). Given s-FAs M1 and M2 one can effectively construct s-FAs and M1 × M2 such that and L(M1 × M2) = L(M1) ∩ L(M2).
The intersection of two s-FAs is computed using a variation of the classical product construction in which transitions are “synchronized” using conjunction. For example, the intersection of Mpos and Mev/odd from Example 2.5 is shown in Figure 2(c). To complement a deterministic partial s-FA M, M is first completed by adding a new nonfinal state s with loop and for each partial state p a transition . Then, the final states and the nonfinal states are swapped in Mc. Following this procedure, the complement of Mpos from Example 2.5 is shown in Figure 2(d).
Decision procedures. s-FAs enjoy the same decidability properties of finite automata:
THEOREM 2.8 (DECIDABILITY36). Given s-FAs M1 and M2 it is decidable to check if M1 is empty—that is, whether L(M1) = ∅—and if M1 and M2 are language-equivalent—that is, whether L(M1) = L(M2).
Checking emptiness requires checking what transitions are satisfiable and, once unsatisfiable transitions are removed, any path reaching a final state from an initial state represents at least one accepting string. Equivalence can be reduced to emptiness using closure under Boolean operations—that is, .
Algorithms have also been proposed for minimizing deterministic s-FAs,9 for checking language inclusion,25 and for learning s-FAs from membership and equivalence queries.3,26,27
Parametric complexities. Because s-FAs are parametric in an underlying alphabet theory A, the complexities of the aforementioned algorithms must in some way depend on the complexities of performing certain operations in A. For example, the cost of checking emptiness of an s-FA depends on the cost of checking satisfiability in A. Another issue is representation of ψ, how to measure the size |φ| of φ ∈ ψ, and the cost of Boolean operations whose repeated applications may cause predicates to grow in size and thus increase related costs, in particular when A is not extensional. This peculiar aspect of s-FAs opens up a new set of complexity questions that have not been studied in automata theory.
We summarize the known complexity results for the procedures we described in this section. Let fsat(|φ|) denote the cost of checking satisfiability of a predicate φ ∈ ψ (of size |φ|) and M and M’ be two deterministic s-FAs such that M has n states and m transitions, M’ has n’ states and m’ transitions, and the largest predicate in M and M’ has size ℓ. Then, (i) checking whether M is empty has complexity O(n + mf (ℓ)), and (ii) checking whether M and M’ are language equivalent has complexity O(mm‘f(ℓ)(n+n‘) + α(n+m)), where α(·) is related to a functional inverse of the Ackermann function. This last complexity is obtained by a straightforward adaptation of Hopcroft and Karp’s algorithm for DFA equivalence.21
For certain classes of problems, different algorithms can have different incomparable complexities. Consider, for example, the problem of minimizing a deterministic s-FA M and let fsat(|φ|) denote the cost of checking satisfiability of a predicate φ ∈ ψ. If M has n states and m transitions, and the largest predicate in M has size ℓ, then the symbolic adaptation of Moore’s algorithm for minimizing DFAs has complexity O(mm·fsat (ℓ)), whereas the symbolic adaptation of Hopcroft’s algorithm for minimizing DFAs has complexity O(m log n·fsat (nℓ)), if the cost of Boolean operations is linear.9 Although in the world of DFAs, Hopcroft’s algorithm20 is provably better than Moore’s,29 in the world of s-FAs, the two algorithms have orthogonal complexities: Hopcroft’s algorithm saves a logarithmic factor in terms of n, but this saving comes at the increased cost of satisfiability queries.
Reduction to classical automata. Although the set S of predicates appearing in a given s-FA (or finitely many s-FAs over the same alphabet algebra) operates generally over an infinite domain, the set of minterms, Minterms(S), the maximal satisfiable Boolean combinations of S induces a finite set of equivalence classes. In general, every s-FA M can be compiled into a symbolically equivalent finite automaton over alphabet Minterms(S), where Predicates(M) ⊆ S and S is a finite subset of ψ. This idea, also called predicate abstraction, is often used in program verification.16 However, note that the size of Minterms(S) is in general exponential in the size of S.
Extensions of the Basic s-FA Model
Symbolic automata have been extended in various ways, as summarized in D’Antoni and Veanes12:
- Symbolic alternating automata (s-AFA)8 are equivalent in expressiveness to s-FAs, but achieve succinctness by extending s-FAs with alternation and, despite the high theoretical complexity, this model can at times be more practical than s-FAs. For example, s-AFAs can efficiently check whether the inter-section of multiple ASCII regular expressions is empty in cases where traditional automata suffer from state explosion.
- Symbolic extended finite automata (s-EFA)10 allow s-FAs to read multiple input characters in a single transition and can be used to model relations between adjacent input symbols.
- Symbolic tree automata (s-TA)13 operate over trees instead of strings. s-FAs are a special case of s-TAs in which all nonleaf nodes in the tree have one child. s-TAs have the same closure and decidability properties as s-FAs and can also be minimized efficiently.
- Symbolic visibly pushdown automata (s-VPA)6 operate over nested words,2 which are used to model data with both linear and hierarchical structure—for example, XML documents and recursive program traces. s-VPAs can be determinized and have the same closure and decidability properties as s-FAs.
Symbolic transducers. Automata that also emit string outputs are generally called transducers. Perhaps the most practical extension of s-FAs is Symbolic Finite (state) Transducers (s-FTs), which extend finite transducers by allowing outputs to functionally depend on inputs, in addition to allowing predicates over those inputs.
Example 3.1. Consider an HTML encoder that is often used as a string sanitizer in a Web browser. At a high level, its purpose is to replace all nonwhitelisted characters by their equivalent HTML encoded substrings in an HTML document, for example, the character < can be encoded as the string “<” (or as the string “<” using its character code 60). It is prohibitively expensive to view such an encoder as a classical Finite (state) Transducer (FT) because the standard Unicode alphabetc has 1,112,064 characters. So the FT would in general need over a million transitions, one for each character c: namely, a transition if c is whitelisted (safe), and a transition otherwise where encode(c) denotes the HTML encoding of c.
Instead, the corresponding s-FT would need only two transitions: a transition for safe characters (this includes all ASCII uppercase and lowercase letters and digits), and a transition where πk = λx.((x÷10k) mod 10) + 48 yields the kth decimal digit from the character X (recall that the standard character code of a decimal digit d is d +48). Note that the output of an s-FT transition is a sequence of functions—in the safe case, the sequence contains a single identity function (out-putting the input character). ⊠
We discuss next some of the main properties of s-FTs and the kinds of analysis they enable. A more formal treatment is covered in D’Antoni and Veanes.12
A transducer T denotes a relation JT⊆D*×D* from input to output sequences. When T is deterministic (at most one transition can be triggered by an input symbol), this relation is a function. If T is nondeterministic but each input sequence is mapped to only one output sequence, the relation is also a function. In these last two cases, we call the transducer functional or single-valued. We call the first and second projections of the relation describing the semantics of the transducer T, its domain (dom(T)) and range (ran(T)), respectively. We write JT(u) for {v|(u,v)∈JT}.
Although both the domain and the range of a finite state transducer are regular, this is not true for s-FTs. By a regular language, here we mean a language accepted by an s-FA. Given an s-FT T, one can compute an s-FA DOM(T) such that L(DOM(T)) = dom(T). The range of an s-FT is in general not regular: Take an s-FT T with a single transition that duplicates its input if the input is odd. The range ran(T) of the s-FT T only accepts sequences of numbers where each number at position 0 (respectively, 2,4,…) has to be the same as the number at position 1 (respectively, 3,5,….). Then, ran(T) is not regular because an s-FA cannot enforce the equality constraint between two symbols at different positions in the sequence. Also, because the alphabet is infinite, states cannot be used to remember what symbol is at each position such as it is normally done for finite automata over finite alphabets.
The most important property of s-FTs is that they are closed under sequential composition: Given two s-FTs S and T, one can compute an s-FT S(T) such that:
This enables several interesting program analyses19 and optimizations, such as (symbolic regular) type-checking that is decidable when combined with closure properties of s-FAs:
Given an s-FT T and s-FAs MI and MO, decide if for all v ∈ L(MI): JT(v) ⊆ L(MO).
Treat any s-FA M implicitly as an s-FT by treating each transition as . Then, M(T) restricts outputs of T to L(M) and T(M) restricts inputs of T to L(M). So type-checking reduces to emptiness of .
Example 3.2. By using the type-checking algorithm one can prove that, for every input sequence v∈D*, avn HTML-Encoder T always produces a valid output: JT(v)⊆L(MO), where MO disallows all unsafe characters besides ‘&’, and allows ‘&’ only as an encoding prefix—for example, as described by the regular expression “([#0-~]|&(amp|lt|gt|#[0-9]+);)*“. Observe that accepts all invalid outputs and thus represents the restriction of T to unwanted behaviors. ⊠
Although equivalence of finite transducers is in general undecidable,17 equivalence of single-valued s-FTs is decidable and single-valuedness is itself a decidable property.37 When combined with closure under composition, decidable equivalence is a powerful verification tool. For example, one can check whether composing two s-FTs E and D, representing a string encoder and decoder, respectively, yields the identity function I—that is, if JD(E) = JI.
A transducer T is injective if for all distinct u, v ∈ D*, we have JT(u)∩ = JT(v) = ∅. Although injectivity of finite transducers is decidable, injectivity is undecidable for s-FTs.23
Extensions of symbolic transducers. Similarly to how s-EFAs extend s-FAs, Symbolic Extended Finite Transducers (s-EFT)10 are symbolic transducers in which each transition can read more than a single character. A similar extension, called s-RTs, incorporates the notion of bounded look-back and rollback in form of roll-backtransitions, not present in any other transducer formalisms, to accommodate default or exceptional behavior.
s-FTs have also been extended with registers and are called symbolic transducers.10, 38 The key motivation is to support loop-carried data state, such as the maximal number seen so far. This model is closed under composition, but most decision problems for it are undecidable, even emptiness.
Sanitizers provide a first line of defense against cross site scripting attacks.
Symbolic tree transducers (s-TT)13 operate over trees instead of strings. s-FTs are a special case of s-TTs in which all nodes in the tree have one child or are leaves. s-TTs are only closed under composition when certain assumptions hold.
Applications of Symbolic Automata and Their Variants
Analysis of regular expressions. The connection between automata and regular expressions has been studied for more than 50 years. However, real-world regular expressions are much more complex than the simple model described in a typical theory of computation course. In particular, in practical regular expressions, the alphabet can contain upward of 216 characters due to the widely adopted UTF16 standard of Unicode characters. s-FAs can model characters as bitvectors and use various representations for predicates—for example, Binary Decision Diagrams (BDDs) or bitvector arithmetic. These representations turned out to be a viable way to model regular expression constraints in parameterized unit tests36 and for random password generation.9
Analysis of string encoders and sanitizers. The original motivation for s-FTs was to enable static analysis for string sanitizers.19 String sanitizers are particular string to string functions over Unicode (a very large alphabet that cannot be handled by traditional automata models) designed to encode special characters in text that may otherwise trigger malicious code execution in certain sensitive contexts, primarily in HTML pages. Thus, sanitizers provide a first line of defense against cross site scripting (XSS) attacks. When sanitizers can be represented as s-FTs, one can, for example, decide if two sanitizers A and B commute—that is, if JA(B) = JB(A)—if a sanitizer A is idempotent—that is, if JA(A) = JA—or if A cannot be compromised with an input attack vector—that is, if ran(A) ⊆ SafeSet. Checking such properties can help ensure the correct usage of sanitizers. s-EFTs have been used to prove that efficient implementations of BASE64 or UTF encoders and decoders correctly invert each other.10 Recently, s-EFTs have been used to automatically synthesize inverses of encoders that are correct by construction.23
Analysis of functional programs. Symbolic transducers have been used to perform static analysis of functional programs that operate over lists and trees.13 In particular, symbolic tree transducers were used to verify HTML sanitizers and to perform deforestation, a technique used to speedup function composition in functional language compilation. These programs cannot be modeled using traditional automata models because they operate over infinite alphabets—for example, lists of integers.
Code generation and parallelization. Symbolic transducers can be used to expose data parallelism in computations that may otherwise seem inherently sequential. A DFA transition function can be viewed as a particular kind of matrix multiplication. Therefore, DFA-based pattern matching can be executed in parallel by virtue of associativity of multiplication. This idea can be lifted to symbolic transducers and applied to many common string transformations. Symbolic transducers can also be extended with registers and branching rules, which are transitions with multiple target states in the form of if-then-else statements. The main purpose of a branching rule is to support built-in determinism and to enable a way to control evaluation order of transition guards for serial code generation, so that hot paths can be optimized. Moreover, symbolic transducers can be composed in a manner that is similar to loop fusion in order to avoid intermediate data structures. The main context where these ideas have been evaluated is in log/data processing pipelines.31
Symbolic regex matcher (SRM). Analogously to s-FAs, regular expressions can also be defined modulo an alphabet theory A, instead of using a finite alphabet. One can develop a corresponding theory of symbolic derivatives of such symbolic regular expressions. Symbolic derivatives enable lazy unfolding and on-the-fly creation of s-FAs that leads to a new class of more predictable matching algorithms that avoid backtracking. Such algorithms and a tool called SRM32 have been developed at Microsoft Research and are being deployed daily in Azure for scanning credentials and other sensitive content in cloud service software. The input to SRM is a .NET regular expression over a restricted set of features. SRM’s linear matching complexity has helped avoid unpredictable performance in the built-in .NET regular expression engine that was susceptible to catastrophic backtracking on files with long lines, such as minified JavaScript and SQL server seeding files.
Open Problems and Future Directions
We conclude this article with a list of open questions that are unique to symbolic automata and transducers, as well as a summary of what unexplored applications could benefit from these models.
Theoretical directions. Adapting DFA algorithms to s-FAs. Several algorithms for finite automata are based on efficient data structures that take advantage of the alphabet being finite. For example, Hopcroft’s and Karp’s algorithm for DFA minimization iterates over the alphabet to refine state partitions through splitting. This iteration can be avoided in s-FAs using satisfiability checks on certain carefully crafted predicates.9 Paige-Tarjan’s algorithm for computing forward bisimulations of NFAs is similar to Hopcroft’s algorithm.30 The algorithm can compute the partition of forward-bisimilar states in time O(km log n). However, unlike Hopcroft’s algorithm, Paige-Tarjan’s algorithm is hard to adapt to the symbolic setting and the current adaptation has complexity O(2m log n + 2mfsat(nℓ)).11 By contrast, the simpler O(km2) algorithm for forward bisimulations can be easily turned into a symbolic O(m2 fsat(ℓ)) algorithm.11 Another example of this complexity of adaptation arises in checking equivalence of two unambiguous NFAs.34
The problem of learning symbolic automata has only received limited attention.3, 15, 26, 27 Classical learning algorithms require querying an oracle for all characters in the alphabet and this is impossible for symbolic automata. On the other hand, the learner simply needs to learn the predicates on each transition of the s-FA, which might require a finite number of queries to the oracle. This is a common problem in computational learning theory and there is an opportunity to apply concepts from this domain to the problem of learning symbolic automata.
Properties of new models. Some symbolic models are still not well understood because they do not have finite automata counterparts. In particular, s-EFAs10 do not enjoy many good properties, but it is possible that they have practical subclasses—for example deterministic, unambiguous, etc.—with good properties.
A new model, called Symbolic Register Automata (s-RA),7 combines the symbolic aspect of s-FA with the ability of comparing elements at different positions in a string—for example, equality. This model is strictly more expressive than its components and it enjoys the same decidability properties of its nonsymbolic counterpart24—for example, equivalence is decidable for deterministic s-RAs. s-RAs could be used to represent complex list properties and improve static-analysis procedures. However, their theoretical treatment is limited so far and an exciting avenue for those interested in symbolic automata.
Complexity and expressiveness. In classical automata theory, the complexities of the algorithms are given with respect to the number of states and transitions in the automaton. We discussed previously how the algorithm complexities in the symbolic setting depend on trade-offs made in the alphabet theory. Exactly understanding these trade-offs is an interesting research question.
New potential applications. Extracting automata from recurrent neural networks. Due to the widespread adoption of machine learning modes, there has been renewed interest in the problem of extracting “explanations” from opaque black-box models. The work that is most relevant to ours is that of using automata learning to extract finite automata from recurrent neural networks.40 Existing works have used automata learning for extracting automata from very simple synthetic RNNs. In particular, these techniques can only handle small-valued features (2–4 values) and cannot handle complex real-valued feature sets. s-FAs can be used to address this limitation. In particular, recent work on learning s-FAs3 could potentially be used for extracting automata from complex real-valued neural networks.
SMT solving with sequences. SMT solvers14 have drastically changed the world of programming languages and turned previously unsolvable problems into feasible ones. The recent interest in verifying programs operating over sequences has created a need for extending existing SMT solving techniques to handle sequences over complex theories. Solvers that are able to handle strings typically use automata. Most solvers only handle strings over finite small alphabets and s-FAs have the potential to impact the way in which such solvers for SMT are built. Recently, some SMT solvers such as Z3 have started incorporating s-FAs in the context of supporting regular expressions in the sequence theory.33
Static analysis. Dalla Preda et al. recently investigated how to use s-FAs to model program binaries.5 s-FAs can use their state space to capture the control flow of a program and their predicates to abstract the I/O semantics of basic blocks appearing in the programs. This approach unifies existing syntactic and semantic techniques for similarity of binaries and has the promise to lead us to better understand techniques for malware detection in low-level code. The same authors recently started investigating whether, using s-FTs, the same techniques could be extended to perform analysis of reflective code—that is, code that can self-modify itself at runtime.
Argyros et al.4 have used s-FA learning to extract models of Web applications and detecting inconsistencies among different applications of the same logical application. Extending this approach to the more powerful extensions of s-FAs could lead to the ability to detect complex bugs in various kinds of software.
Conclusion
Symbolic automata and their variants have proven to be a versatile and powerful model to reason about practical applications that were beyond the reach of finite-alphabet automata models. In this article, we summarized what theoretical results are known for symbolic models, described the numerous extensions of symbolic automata, and clarified why these models are different from their finite-alphabet counterparts. We also presented the following list of open problems we hope that the research community will help us solve: Can we provide theoretical treatments of the complexities of the algorithms for symbolic models? Can we extend classical algorithms for automata over finite alphabets to the symbolic setting? Can we use symbolic automata algorithms to design decision procedures for the SMT theory of sequences?
Join the Discussion (0)
Become a Member or Sign In to Post a Comment