Research and Advances
Computing Applications Research highlights

Enhancing Symbolic Execution with Veritesting

Posted
  1. 1. Introduction
  2. 2. Symbolic Execution Background
  3. 3. Veritesting
  4. 4. Evaluation
  5. 5. Related Work
  6. 6. Conclusion
  7. Acknowledgments
  8. References
  9. Authors
  10. Footnotes
  11. Figures
  12. Tables
Read the related Technical Perspective
Enhancing Symbolic Execution with Veritesting, illustration

Symbolic execution is a popular automatic approach for testing software and finding bugs. Over the past decade, numerous symbolic execution tools have appeared—both in academia and industry—demonstrating the effectiveness of the technique in finding crashing inputs, generating test cases with high coverage, and exposing software vulnerabilities.5 Microsoft’s symbolic executor SAGE is responsible for finding one-third of all bugs discovered during the development of Windows 7.12

Symbolic execution is attractive because of two salient features. First, it generates real test cases; every bug report is accompanied by a concrete input that reproduces the problem (thus eliminating false reports). Second, symbolic execution systematically checks each program path exactly once—no work will be repeated as in other typical testing techniques (e.g., random fuzzing).

Symbolic execution works by automatically translating program fragments to logical formulas. The logical formulas are satisfied by inputs that have a desired property, for example, they execute a specific path or violate safety. Thus, with symbolic execution, finding crashing test cases effectively reduces to finding satisfying variable assignments in logical formulas, a process typically automated by Satisfiability Modulo Theories (SMT) solvers.9

At a high level, there are two main approaches for generating formulas: dynamic symbolic execution (DSE) and static symbolic execution (SSE). DSE executes the analyzed program fragment and generates formulas on a per-path basis. SSE translates program fragments into formulas, where each formula represents the desired property over any path within the selected fragment. The path-based nature of DSE introduces significant overhead when generating formulas, but the formulas themselves are easy to solve. The statement-based nature of SSE has less overhead and produces more succinct formulas that cover more paths, but the formulas are harder to solve. Is there a way to get the best of both worlds?

In this article, we present a new technique for generating formulas called veritesting that alternates between SSE and DSE. The alternation mitigates the difficulty of solving formulas, while alleviating the high overhead associated with a path-based DSE approach. In addition, DSE systems replicate the path-based nature of concrete execution, allowing them to handle cases such as system calls and indirect jumps where static approaches would need summaries or additional analysis. Alternating allows veritesting to switch to DSE-based methods when such cases are encountered.

We implemented veritesting in MergePoint, a system for automatically checking all programs in a Linux distribution.

MergePoint operates on 32-bit Linux binaries and does not require any source information (e.g., debugging symbols). We have systematically used MergePoint to test and evaluate veritesting on 33,248 binaries from Debian Linux. The binaries were collected by downloading and mining for executable programs all available packages from the Debian main repository. We did not pick particular binaries or a dataset that would highlight specific aspects of our system; instead we focus on our system as experienced in the general case. The large dataset allows us to explore questions with high fidelity and with a smaller chance of per-program sample bias. The binaries are exactly what runs on millions of systems throughout the world.

We demonstrate that MergePoint with veritesting beats previous techniques in the three main metrics: bugs found, node coverage, and path coverage. In particular, MergePoint has found 11,687 distinct bugs in 4379 different programs. Overall, MergePoint has generated over 15 billion SMT queries and created over 200 million test cases. Out of the 1043 bugs we have reported so far to the developers, 229 have been fixed.

Our main contributions are as follows. First, we propose a new technique for symbolic execution called veritesting. Second, we provide and study in depth the first system for testing every binary in an OS distribution using symbolic execution. Our experiments reduce the chance of per-program or per-dataset bias. We evaluate MergePoint with and without veritesting and show that veritesting outperforms previous work on all three major metrics. Finally, we improve open source software by finding over 10,000 bugs and generating millions of test cases. Debian maintainers have already incorporated 229 patches due to our bug reports. We have made our data available on our website.20 For more experiments and details we refer the reader to the original paper.2

Back to Top

2. Symbolic Execution Background

Symbolic execution14 is similar to normal program execution with one main twist: instead of using concrete input values, symbolic execution uses variables (symbols). During execution, all program values are expressed in terms of input variables. To keep track of the currently executing path, symbolic execution stores all conditions required to follow the same path (e.g., assertions, conditions on branch statements, etc.) in a logical formula called the path predicate.

ins01.gif

Inputs that make the path predicate true are guaranteed to follow the exact same execution path. If there is no input satisfying the path predicate, the current execution path is infeasible.

In the following sections, we give a brief overview of the two main symbolic execution approaches: dynamic and SSE. We refer the reader to symbolic execution surveys for more details and examples.5, 21

*  2.1. Dynamic symbolic execution (DSE)

Algorithm 1 presents the core steps in DSE. The algorithm operates on a representative low-level language with assignments, assertions and conditional jumps (simplified from the original Avgerinos et al.2).

Similar to an interpreter, a symbolic executor consists of a main instruction fetch-decode-execute loop (Lines 14-17). On each iteration, the removeOne function selects the next state to execute from Worklist, decodes the instruction, executes it and inserts the new execution states in Worklist. Each execution state is a triple (pc, Π, Δ) where pc is the current program counter, Π is the path predicate (the condition under which the current path will be executed), and Δ is a dictionary that maps each variable to its current value.

Unlike concrete interpreters, symbolic executors need to maintain a list of execution states (not only one). The reason is conditional branches. Line 10 (highlighted in red) demonstrates why: the executed branch condition could be true or false—depending on the program input—and the symbolic executor needs to execute both paths in order to check correctness. The process of generating two new execution states out of a single state (one for the true branch and one for the false), is typically called “forking.” Due to forking, every branch encountered during execution, doubles the number of states that need to be analyzed, a problem known in DSE as path (or state) explosion.5

Example. We now give a short example demonstrating DSE in action. Consider the following program:

ins02.gif

Before execution starts DSE initializes the worklist with a state pointing to the start of the program (Line 12): (1, true, {}). After it fetches the conditional branch instruction, DSE will have to fork two new states in ExecuteInstruction: (2, input_char = ‘B’, {}) for the taken branch and (3, input_char ≠ ‘B’, {}) for the non-taken. Generating a test case for each execution path is straightforward: we send each path predicate to an SMT solver and any satisfying assignment will execute the same path, for example, input_char → ‘B’ to reach the buggy line of code. An unsatisfiable path predicate means the selected path is infeasible.

Advantages/Disadvantages. Forking executors and analyzing a single path at a time has benefits: the analysis code is simple, solving the generated path predicates is typically fast (e.g., in SAGE4 99% of all queries takes less than 1 s) since we only reason about a single path, and the concrete path-specific state resolves several practical problems. For example, executors can execute hard-to-model functionality concretely (e.g., system calls), side effects such as allocating memory in each DSE path are reasoned about independently without extra work, and loops are unrolled as the code executes. The disadvantage is path explosion: the number of executors can grow exponentially in the number of branches. The path explosion problem is the main motivation for our veritesting algorithm (see Section 3).

*  2.2. Static symbolic execution (SSE)

SSE is a verification technique for representing a program as a logical formula. Safety checks are encoded as logical assertions that will falsify the formula if safety is violated. Because SSE checks programs, not paths, it is typically employed to verify the absence of bugs. As we will see, veritesting repurposes SSE techniques for summarizing program fragments instead of verifying complete programs.

Modern SSE algorithms summarize the effects of both branches at path confluence points. In contrast, DSE traditionally forks off two executors at the same line, which remain subsequently forever independent. Due to space, we do not repeat complete SSE algorithms here, and refer the reader to previous work.3, 15, 23

Advantages/Disadvantages. Unlike DSE, SSE does not suffer from path explosion. All paths are encoded in a single formula that is then passed to the solver.a For acyclic programs, existing techniques allow generating compact formulas of size O (n2),10, 18 where n is the number of program statements. Despite these advantages over DSE, state-of-the-art tools still have trouble scaling to very large programs.13, 16 Problems include the presence of loops (how many times should they be unrolled?), formula complexity (are the formulas solvable if we encode loops and recursion?), the absence of concrete state (what is the concrete environment the program is running in?), as well as unmodeled behavior (a kernel model is required to emulate system calls). Another hurdle is completeness: for the verifier to prove absence of bugs, all program paths must be checked.

Back to Top

3. Veritesting

DSE has proven to be effective in analyzing real world programs.6, 12 However, the path explosion problem can severely reduce the effectiveness of the technique. For example, consider the following 7-line program that counts the occurrences of the character ‘B’ in an input string:

ins03.gif

The program above has 2100 possible execution paths. Each path must be analyzed separately by DSE, thus making full path coverage unattainable for practical purposes. In contrast, two test cases suffice for obtaining full code coverage: a string of 75 ‘B’s and a string with no ‘B’s. However, finding such test cases in the 2100 state space is challenging.b We ran the above program with several state-of-the-art symbolic executors, including KLEE,6 S2E,8 Mayhem,7 and Cloud9 with state merging.16 None of the above systems was able to find the bug within a 1-h time limit (they ran out of memory or kept running). Veritesting allows us to find the bug and obtain full path coverage in 47 s on the same hardware.

Veritesting starts with DSE, but switches to an SSE-style approach when we encounter code that—similar to the example above—does not contain system calls, indirect jumps, or other statements that are difficult to precisely reason about statically. Once in SSE mode, veritesting performs analysis on a dynamically recovered control flow graph (CFG) and identifies a core of statements that are easy for SSE, and a frontier of hard-to-analyze statements. The SSE algorithm summarizes the effects of all paths through the easy nodes up to the hard frontier. Veritesting then switches back to DSE to handle the cases that are hard to treat statically.

In the rest of this section, we present the main algorithm and the details of the technique.

*  3.1. The algorithm

In default mode, MergePoint behaves as a typical dynamic symbolic executor. It starts exploration with a concrete seed and explores paths in the neighborhood of the original seed following a generational search strategy.12 MergePoint does not always fork when it encounters a symbolic branch. Instead, MergePoint intercepts the forking process—as shown in Line 10 (highlighted in green) of algorithm 1—of DSE and performs veritesting.

Veritesting consists of four main steps:

  1. CFG Recovery. Obtains the CFG reachable from the address of the symbolic branch (Section 3.2).
  2. Transition Point Identification & Unrolling. Takes in a CFG, and outputs candidate transition points and a CFGe, an acyclic CFG with edges annotated with the control flow conditions (Section 3.3). Transition points indicate CFG locations with hard-to-model constructs where DSE may continue.
  3. SSE. Takes the acyclic CFGe and current execution state, and uses SSE to build formulas that encompass all feasible paths in the CFGe. The output is a mapping from CFGe nodes to SSE states (Section 3.4).
  4. Switch to DSE. Given the transition points and SSE states, returns the DSE executors to be forked (Section 3.5).

*  3.2. CFG recovery

The goal of the CFG recovery phase is to obtain a partial CFG of the program, where the entry point is the current symbolic branch. We now define the notion of underapproximate and overapproximate CFG recovery.

A recovered CFG is an underapproximation if all edges of the CFG represent feasible paths. A recovered CFG is an overapproximation if all feasible paths in the program are represented by edges in the CFG (statically recovering a perfect—that is, non-approximate—CFG on binary code can be non-trivial). A recovered CFG might be an underapproximation or an overapproximation, or even both in practice.

Veritesting was designed to handle both underapproximated and overapproximated CFGs without losing paths or precision (see Section 3.4). MergePoint uses a customized CFG recovery mechanism designed to stop recovery at function boundaries, system calls and unknown instructions.

The output of this step is a partial (possibly approximate) intraprocedural CFG. Unresolved jump targets (e.g., ret, call, etc.) are forwarded to a generic Exit node in the CFG. Figure 1a shows the form of an example CFG after the recovery phase.

*  3.3. Transition point identification and unrolling

Once the CFG is obtained, MergePoint proceeds to identifying a set of transition points. Transition points define the boundary of the SSE algorithm (where DSE will continue exploration). Note that every possible execution path from the entry of the CFG needs to end in a transition point (our implementation uses domination analysis2).

For a fully recovered CFG, a single transition point may be sufficient, for example, the bottom node in Figure 1a. However, for CFGs with unresolved jumps or system calls, any predecessor of the Exit node will be a possible transition point (e.g., the ret node in Figure 1b). Transition points represent the frontier of the visible CFG, which stops at unresolved jumps, function boundaries and system calls. The number of transition points gives an upper-bound on the number of executors that may be forked.

Unrolling Loops. Loop unrolling represents a challenge for static verification tools. However, MergePoint is dynamic and can concretely execute the CFG to identify how many times each loop will execute. The number of concrete loop iterations determines the number of loop unrolls. MergePoint also allows the user to extend loops beyond the concrete iteration limit, by providing a minimum number of unrolls.

To make the CFG acyclic, back edges are removed and forwarded to a newly created node for each loop, for example, the “Incomplete Loop” node in Figure 1b, which is a new transition point that will be explored if executing the loop more times is feasible. In a final pass, the edges of the CFG are annotated with the conditions required to follow the edge.

The end result of this step is a CFGe and a set of transition points. Figure 1b shows an example CFG—without edge conditions—after transition point identification and loop unrolling.

*  3.4. Static symbolic execution

Given the CFGe, MergePoint applies SSE to summarize the execution of multiple paths. Previous work,3 first converted the program to Gated Single Assignment (GSA)22 and then performed symbolic execution. In MergePoint, we encode SSE as a single pass dataflow analysis where GSA is computed on the fly—more details can be found in the full paper.2

To illustrate the algorithm, we run SSE on the following program:

ueq01.gif

Figure 2 shows the progress of the variable state as SSE iterates through the blocks. SSE starts from the entry of the CFGe and executes basic blocks in topological order. SSE uses conditional ite (if-then-else) expressions—ite is a ternary operator similar to ?: in C—to encode the behavior of multiple paths. For example, every variable assignment following the true branch after the condition (x > 1) in Figure 2 will be guarded as ite(x > 1, value, ⊥), where value denotes the assigned value and ⊥ is a don’t care term. Thus, for the edge from B3 to B6 in Figure 2, Δ is updated to {yite (x > 1, 42, ⊥)}.

When distinct paths (with distinct Δ’s) merge to the same confluence point on the CFG, a merge operator is needed to “combine” the side effects from all incoming edges. To do so, we apply the following recursive merge operation M to each symbolic value:

ueq02.gif

This way, at the last node of Figure 2, the value of y will be M(ite(x > 1, 42, ⊥), ite(x > 1, ⊥, ite(x < 42, 17, y0))) which is merged to ite(x > 1, 42, ite(x < 42, 17, y0)), capturing all possible paths.c Note that this transformation is inlining multiple statements into a single one using ite operators. Also, note that values from unmerged paths (⊥ values) can be immediately simplified, for example, ite(e, x, ⊥) = x. During SSE, MergePoint keeps a mapping from each traversed node to the corresponding variable state.

Handling Overapproximated CFGs. At any point during SSE, the path predicate is computed as the conjunction of the DSE predicate ΠDSE and the SSE predicate computed by substitution: ΠSSE. MergePoint uses the resulting predicate to perform path pruning offering two advantages: any infeasible edges introduced by CFG recovery are eliminated, and our formulas only consider feasible paths (e.g., the shaded nodes in Figure 1b can be ignored).

*  3.5. Switch to DSE

After the SSE pass is complete, we check which states need to be forked. We first gather transition points and check whether they were reached by SSE. For the set of distinct, reachable transition points, MergePoint will fork a new symbolic state in a final step, where a DSE executor is created (pc, Π, Δ) using the state of each transition point.

Generating Test Cases. Though MergePoint can generate an input for each covered path, that would result in an exponential number of test cases in the size of the CFGe. By default, we only output one test per CFG node explored by SSE. (Note that for branch coverage the algorithm can be modified to generate a test case for every edge of the CFG.) The number of test cases can alternatively be minimized by generating test cases only for nodes that have not been covered by previous test cases.

Underapproximated CFGs. Last, before proceeding with DSE, veritesting checks whether we missed any paths due to the underapproximated CFG. To do so, veritesting queries the negation of the path predicate at the Exit node (the disjunction of the path predicates of forked states). If the query is satisfiable, an extra state is forked to explore missed paths.

Back to Top

4. Evaluation

In this section we evaluate our techniques using multiple benchmarks with respect to three main questions:

  1. Does Veritesting find more bugs than previous approaches? We show that MergePoint with veritesting finds twice as many bugs than without.
  2. Does Veritesting improve node coverage? We show MergePoint with veritesting improves node coverage over DSE.
  3. Does Veritesting improve path coverage? Previous work showed dynamic state merging outperforms vanilla DSE.16 We show MergePoint with veritesting improves path coverage and outperforms both approaches.

We detail our large-scale experiment on 33,248 programs from Debian Linux. MergePoint generated billions of SMT queries, hundreds of millions of test cases, millions of crashes, and found 11,687 distinct bugs.

Overall, our results show MergePoint with veritesting improves performance on all three metrics. We also show that MergePoint is effective at checking a large number of programs. Before proceeding to the evaluation, we present our setup and benchmarks sets. All experimental data from MergePoint are publicly available online.20

Experiment Setup. We ran all distributed MergePoint experiments on a private cluster consisting of 100 virtual nodes running Debian Squeeze on a single Intel 2.68 GHz Xeon core with 1 GB of RAM. All comparison tests against previous systems were run on a single node Intel Core i7 CPU and 16 GB of RAM since these systems could not run on our distributed infrastructure.

We created three benchmarks: coreutils, BIN, and Debian. Coreutils and BIN were compiled so that coverage information could be collected via gcov. The Debian benchmark consists of binaries used by millions of users worldwide.

Benchmark 1: GNU coreutils (86 programs) We use the coreutils benchmark to compare to previous work since: (1) the coreutils suite was originally used by KLEE6 and other researchers6, 7, 16 to evaluate their systems, and (2) configuration parameters for these programs used by other tools are publicly available.6 Numbers reported with respect to coreutils do not include library code to remain consistent with compared work. Unless otherwise specified, we ran each program in this suite for 1 h.

Benchmark 2: The BIN suite (1023 programs). We obtained all the binaries located under the /bin,/usr/bin, and /sbin directories from a default Debian Squeeze installation.d We kept binaries reading from /dev/stdin, or from a file specified on the command line. In a final processing step, we filtered out programs that require user interaction (e.g., GUIs). BIN consists of 1023 binary programs, and comprises 2,181,735 executable lines of source code (as reported by gcov). The BIN benchmark includes library code packaged with the application in the dataset, making coverage measurements more conservative than coreutils. For example, an application may include an entire library, but only one function is reachable from the application. We nonetheless include all uncovered lines from the library source file in our coverage computation. Unless otherwise specified, we ran each program in this suite for 30 min.

Benchmark 3: Debian (33,248 programs). This benchmark consists of all binaries from Debian Wheezy and Sid. We extracted binaries and shared libraries from every package available from the main Debian repository. We downloaded 23,944 binaries from Debian Wheezy, and 27,564 binaries from Debian Sid. After discarding duplicate binaries in the two distributions, we are left with a benchmark comprising 33,248 binaries. This represents an order of magnitude more applications than have been tested by prior symbolic execution research. We analyzed each application for less than 15 min per experiment.

*  4.1. Bug finding

Table 1 shows the number of bugs found by MergePoint with and without veritesting. Overall, veritesting finds 2× more bugs than without for BIN. Veritesting finds 63 (83%) of the bugs found without veritesting, as well as 85 additional distinct bugs that traditional DSE could not detect.

Veritesting also found two previously unknown crashes in coreutils, even though these applications have been thoroughly tested with symbolic execution.6, 7, 16 Further investigation showed that the coreutils crashes originate from a library bug that had been undetected for 9 years. The bug is in the time zone parser of the GNU portability library Gnulib, which dynamically deallocates a statically allocated memory buffer. It can be triggered by running touch -d 'TZ=""" ', or date –d 'TZ=""" '. Furthermore, Gnulib is used by several popular projects, and we have confirmed that the bug affects other programs, for example, find, patch, tar.

*  4.2. Node coverage

We evaluated MergePoint both with and without Veritesting on node coverage. Table 2 shows our overall results. Veritesting improves node coverage on average in all cases. Note that any positive increase in coverage is important. In particular, Kuznetsov et al. showed both dynamic state merging and SSE reduced node coverage when compared to vanilla DSE (Figure 8 in Ref.16).

Figures 3 and 4 break down the improvement per program. For coreutils, enabling veritesting decreased coverage in only three programs (md5sum, printf, and pr). Manual investigation of these programs showed that veritesting generated much harder formulas, and spent more than 90% of its time in the SMT solver, resulting in timeouts. In Figure 4 for BIN, we omit programs where node coverage was the same for readability. Overall, the BIN performance improved for 446 programs and decreased for 206.

Figure 5 shows the average coverage over time achieved by MergePoint with and without veritesting for the BIN suite. After 30 min, MergePoint without veritesting reached 34.45% code coverage. Veritesting achieved the same coverage in less than half the original time (12 min 48 s). Veritesting’s coverage improvement becomes more substantial as analysis time goes on. Veritesting achieved higher coverage velocity, that is, the rate at which new coverage is obtained, than standard symbolic execution. Over a longer period of time, the difference in velocity means that the coverage difference between the two techniques is likely to increase further, showing that the longer MergePoint runs, the more essential veritesting becomes for high code coverage.

The above tests demonstrates the improvements of veritesting for MergePoint. We also ran both S2E and MergePoint (with veritesting) on coreutils using the same configuration for 1 h on each utility in coreutils, excluding 11 programs where S2E emits assertion errors. Figure 6 compares the increase in coverage obtained by MergePoint with veritesting over S2E. MergePoint achieved 27% more code coverage on average than S2E. We investigated programs where S2E outperforms MergePoint. For instance, on pinky—the main outlier in the distribution—S2E achieves 50% more coverage. The main reason for this difference is that pinky uses a system call not handled by the current MergePoint implementation (netlink socket).

*  4.3. Path coverage

We evaluated the path coverage of MergePoint both with and without veritesting using three different metrics: time to complete exploration, as well as multiplicity.

Time to complete exploration. The metric reports the amount of time required to completely explore a program, in those cases where exploration finished.

The number of paths checked by an exhaustive DSE run is also the total number of paths possible. In such cases we can measure (a) whether veritesting also completed, and (b) if so, how long it took relative to DSE. MergePoint without veritesting was able to exhaust all paths for 46 programs. MergePoint with veritesting completes all paths 73% faster than without veritesting. This result shows that veritesting is faster when reaching the same end goal.

Multiplicity. Multiplicity was proposed by Kuznetsov et al.16 as a metric correlated with path coverage. The initial multiplicity of a state is 1. When a state forks, both children inherit the state multiplicity. When combining two states, the multiplicity of the resulting state is the sum of their multiplicities. A higher multiplicity indicates higher path coverage.

We also evaluated the multiplicity for veritesting. Figure 7 shows the state multiplicity probability distribution function for BIN. The average multiplicity over all programs was 1.4 × 10290 and the median was 1.8 × 1012 (recall, higher is better). The distribution resembles a lognormal with a spike for programs with multiplicity of 4096 (212). The multiplicity average and median for coreutils were 1.4 × 10199 and 4.4 × 1011, respectively. Multiplicity had high variance; thus the median is likely a better performance estimator.

*  4.4. Checking Debian

In this section, we evaluate veritesting’s bug finding ability on every program available in Debian Wheezy and Sid. We show that veritesting enables large-scale bug finding.

Since we test 33,248 binaries, any type of per-program manual labor is impractical. We used a single input specification for our experiments: -sym-arg 1 10 -sym-arg 2 2 -sym-arg 3 2 -sym-anon-file 24 -sym-stdin 24 (3 symbolic arguments up to 10, 2, and 2 bytes, respectively, and symbolic files/stdin up to 24 bytes). MergePoint encountered at least one symbolic branch in 23,731 binaries. We analyzed Wheezy binaries once, and Sid binaries twice (one experiment with a 24-byte symbolic file, the other with 2100 bytes to find buffer overflows). Including data processing, the experiments took 18 CPU-months.

Our overall results are shown in Table 3. Veritesting found 11,687 distinct bugs that crash programs. The bugs appear in 4379 of the 33,248 programs. Veritesting also finds bugs that are potential security threats. Two hundred and twenty-four crashes have a corrupt stack, that is, a saved instruction pointer has been overwritten by user input. As an interesting data point, it would have cost $0.28 per unique crash had we run our experiments on the Amazon Elastic Compute Cloud, assuming that our cluster nodes are equivalent to large instances.

The volume of bugs makes it difficult to report all bugs in a usable manner. Note that each bug report includes a crashing test case, thus reproducing the bug is easy. Instead, practical problems such as identifying the correct developer and ensuring responsible disclosure of potential vulnerabilities dominate our time. As of this writing, we have reported 1043 crashes in total.19 Not a single report was marked as unreproducible on the Debian bug tracking system. Two hundred and twenty-nine bugs have already been fixed in the Debian repositories, demonstrating the real-world impact of our work. Additionally, the patches gave an opportunity to the package maintainers to harden at least 29 programs, enabling modern defenses like stack canaries and DEP.

*  4.5. Discussion

Our experiments so far show that veritesting can effectively increase multiplicity, achieve higher code coverage, and find more bugs. In this section, we discuss why it works well according to our collected data.

Each run takes longer with veritesting because multi-path SMT formulas tend to be harder. The coverage improvement demonstrates that additional SMT cost is amortized over the increased number of paths represented in each run. At its core, veritesting is pushing the SMT engine harder instead of brute-forcing paths by forking new DSE executors. This result confirms that the benefits of veritesting outweigh its cost. The distribution of path times (Figure 8b) shows that the vast majority (56%) of paths explored take less than 1 s for standard symbolic execution. With veritesting, the fast 0 paths are fewer (21%), and we get more timeouts (6.4% vs. 1.2%). The same differences are also reflected in the component breakdown. With veritesting, most of the time (63%) is spent in the solver, while with standard DSE most of the time (60%) is spent re-executing similar paths that could be merged and explored in a single execution.

Of course there is no free lunch, and some programs do perform worse. We emphasize that on average over a fairly large dataset our results indicate the tradeoff is beneficial.

Back to Top

5. Related Work

Symbolic execution was discovered in 1975,14 with the volume of academic research and commercial systems exploding in the last decade. Notable symbolic executors include SAGE and KLEE. SAGE4 is responsible for finding one third of all bugs discovered by file fuzzing during the development of Windows 7.4 KLEE6 was the first tool to show that symbolic execution can generate test cases that achieve high coverage on real programs by demonstrating it on the UNIX utilities. There is a multitude of symbolic execution systems—for more details, we refer the reader to recent surveys.5, 21

Merging execution paths is not new. Koelbl and Pixley15 pioneered path merging in SSE. Concurrently and independently, Xie and Aiken23 developed Saturn, a verification tool capable of encoding of multiple paths before converting the problem to SAT. Hansen et al.13 follow an approach similar to Koelbl et al. at the binary level. Babic and Hu3 improved their static algorithm to produce smaller and faster to solve formulas by leveraging GSA.22 The static portion of our veritesting algorithm is built on top of their ideas. In our approach, we alternate between SSE and DSE. Our approach amplifies the effect of DSE and takes advantage of the strengths of both techniques.

The efficiency of the static algorithms mentioned above typically stems from various types of if-conversion,1 a technique for converting code with branches into predicated straightline statements. The technique is also known as φ-folding,17 a compiler optimization technique that collapses simple diamond-shaped structures in the CFG. Godefroid11 introduced function summaries to test code compositionally. The main idea is to record the output of an analyzed function, and reuse it whenever the function is called again. Veritesting generates context-sensitive on-demand summaries of code fragments as the program executes—extending to compositional summaries is possible future work.

Back to Top

6. Conclusion

In this article we proposed MergePoint and veritesting, a new technique to enhance symbolic execution with verification-based algorithms. We evaluated MergePoint on 1023 programs and showed that veritesting increases the number of bugs found, node coverage, and path coverage. We showed that veritesting enables large-scale bug finding by testing 33,248 Debian binaries, and finding 11,687 bugs. Our results have had real world impact with 229 bug fixes already present in the latest version of Debian.

Back to Top

Acknowledgments

We would like to thank Samantha Gottlieb, Tiffany Bao, and our anonymous reviewers for their comments and suggestions. We also thank Mitch Franzos and PDL for the support they provided during our experiments. This research was supported in part by grants from DARPA and the NSF, as well as the Prabhu and Poonam Goel Fellowship.

Back to Top

Back to Top

Back to Top

Back to Top

Figures

F1 Figure 1. Veritesting on a program fragment with loops and system calls. (a) Recovered CFG. (b) CFG after transition point identification & loop unrolling. Unreachable nodes are shaded.

F2 Figure 2. SSE running on an unrolled CFG—the variable state (Δ) is shown within brackets.

F3 Figure 3. Code coverage difference on coreutils before and after veritesting.

F4 Figure 4. Code coverage difference on BIN before and after veritesting, where it made a difference.

F5 Figure 5. Coverage over time (BIN suite).

F6 Figure 6. Code coverage difference on coreutils obtained by MergePoint versus S2E.

F7 Figure 7. Multiplicity distribution (BIN suite).

F8 Figure 8. MergePoint performance before and after veritesting for BIN. The above figures show: (a) Performance breakdown for each component; (b) Analysis time distribution.

Back to Top

Tables

T1 Table 1. Veritesting finds 2× more bugs.

T2 Table 2. Veritesting improves node coverage.

T3 Table 3. Overall numbers for checking Debian.

Back to top

    1. Allen, J.R., Kennedy, K., Porterfield, C., Warren, J. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (Austin, Texas, 1983). ACM Press, New York, NY, 177–189.

    2. Avgerinos, T., Rebert, A., Cha, S.K., Brumley, D. Enhancing symbolic execution with veritesting. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014 (Hyderabad, India, 2014). ACM, New York, NY, 1083–1094. DOI: 10.1145/2568225.2568293. URL http://doi.acm.org/10.1145/2568225.2568293.

    3. Babic, D., Hu, A.J. Calysto: Scalable and precise extended static checking. In Proceedings of the 30th International Conference on Software Engineering (Leipzig, Germany, 2008). ACM, New York, NY, 211–220.

    4. Bounimova, E., Godefroid, P., Molnar, D. Billions and billions of constraints: Whitebox Fuzz testing in production. In Proceedings of the 35th IEEE International Conference on Software Engineering (San Francisco, CA, 2013). IEEE Press, Piscataway, NJ, 122–131.

    5. Cadar, C., Sen, K. Symbolic execution for software testing: three decades later. Commun. ACM 56, 2 (2013), 82–90.

    6. Cadar, C., Dunbar, D., Engler, D. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Symposium on Operating System Design and Implementation (San Diego, CA, 2008). USENIX Association, Berkeley, CA, 209–224.

    7. Cha, S.K., Avgerinos, T., Rebert, A., Brumley, D. Unleashing mayhem on binary code. In Proceedings of the 33rd IEEE Symposium on Security and Privacy (2012). IEEE Computer Society, Washington, DC, 380–394.

    8. Chipounov, V., Kuznetsov, V., Candea, G. S2E: A platform for in vivo multi-path analysis of software systems. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (Newport Beach, CA, 2011). ACM, New York, NY, 265–278.

    9. de Moura, L., Bjørner, N. Satisfiability modulo theories: Introduction and applications. Commun. ACM 54, 9 (Sept. 2011), 69. ISSN 00010782. doi: 10.1145/1995376.1995394. URL http://dl.acm.org/citation.cfm?doid=1995376.1995394.

    10. Flanagan, C., Saxe, J. Avoiding exponential explosion: Generating compact verification conditions. In Proceedings of the 28th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (London, United Kingdom, 2001). ACM, New York, NY, 193–205.

    11. Godefroid, P. Compositional dynamic test generation. In Proceedings of the 34th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Nice, France, 2007). ACM, New York, NY, 47–54.

    12. Godefroid, P., Levin, M.Y., Molnar, D. SAGE: Whitebox fuzzing for security testing. Commun. ACM 55, 3 (2012), 40–44.

    13. Hansen, T., Schachte, P., Søndergaard, H. State joining and splitting for the symbolic execution of binaries. Runtime Verif. (2009), 76–92.

    14. King, J.C. Symbolic execution and program testing. Commun. ACM 19, 7 (1976), 385–394.

    15. Koelbl, A., Pixley, C. Constructing efficient formal models from high-level descriptions using symbolic simulation. Int. J. Parallel Program. 33, 6 (Dec. 2005), 645–666.

    16. Kuznetsov, V., Kinder, J., Bucur, S., Candea, G. Efficient state merging in symbolic execution. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (Beijing, China, 2012). ACM, New York, NY, 193–204.

    17. Lattner, C., Adve, V. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (Palo Alto, CA, 2004). IEEE Computer Society, Washington, DC, 75–86.

    18. Leino, K.R.M. Efficient weakest preconditions. Inform. Process. Lett. 93, 6 (2005), 281–288.

    19. Mayhem. 1.2K Crashes in Debian, 2013. URL http://lists.debian.org/debian-devel/2013/06/msg00720.html.

    20. Mayhem. Open Source Statistics & Analysis, 2013. URL http://www.forallsecure.com/summaries.

    21. Schwartz, E.J., Avgerinos, T., Brumley, D. All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask). In Proceedings of the 31st IEEE Symposium on Security and Privacy (2010). IEEE Computer Society, Washington, DC, 317–331.

    22. Tu, P., Padua, D. Efficient building and placing of gating functions. In Proceedings of the 16th ACM Conference on Programming Language Design and Implementation (La Jolla, CA, 1995). ACM, New York, NY, 47–55.

    23. Xie, Y., Aiken, A. Scalable error detection using boolean satisfiability. In Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Long Beach, CA, 2005). ACM, New York, NY, 351–363.

    a. Note the solver may still have to reason internally about an exponential number of paths—finding a satisfying assignment to a logical formula is an NP-hard problem.

    b. For example, cacm5906_i.gif paths reach the buggy line of code. The probability of finding one of those paths by random selection is approximately 278/2100 = 2–22.

    c. To efficiently handle deeply nested and potentially duplicated expressions, MergePoint utilizes hash-consing at the expression level.2

    d. What better source of benchmark programs than the ones you use everyday?

    The original version of this paper was published in the Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India, May 31–June 7, 2014). ACM, New York, NY, 1083–1094.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More