Home/Magazine Archive/August 2010 (Vol. 53, No. 8)/Reasoning About the Unknown in Static Analysis/Full Text

Research highlights
# Reasoning About the Unknown in Static Analysis

Static program analysis techniques cannot know certain values, such as the value of user input or network state, at analysis time. While such unknown values need to be treated as nondeterministic choices made by the program's execution environment, it is still possible to glean very useful information about how such statically unknown values may or must influence computation. We give a method for integrating such nondeterministic choices with an expressive static analysis. Interestingly, we cannot solve the resulting recursive constraints directly, but we give an exact method for answering all may and must queries. We show experimentally that the resulting solved forms are concise in practice, enabling us to apply the technique to very large programs, including an entire operating system.

Preventing software errors is a central challenge in software engineering. The many tool-based approaches to the problem can be grouped roughly into two categories. *Dynamic analysis* techniques discover properties by monitoring program executions for *particular* inputs; standard testing is the most commonly used form of dynamic analysis. In contrast, a *static analysis* discovers properties that hold for *all* possible inputs; a *sound* static analysis concludes a program is error-free only if the program indeed has no errors.

Unlike dynamic analyses, sound static analyses have the advantage of never missing any potential errors, but, unfortunately, there is no free lunch: Soundness usually comes at the cost of reporting *false positives* (i.e., spurious warnings about error-free code) because static analyses must approximate some aspects of program behavior. This approximation is inevitable as analyzing even very simple properties of programs' behavior is undecidable. Hence, a key challenge for static analysis techniques is achieving a satisfactory combination of precision, soundness, and scalability by reporting as few false positives as possible while still being sound and scaling to real systems.

This goal of obtaining satisfactory precision is further complicated by the fact that certain values are simply unknown statically: For example, if a program queries the user for an input, this input appears as a nondeterministic environment choice to the static analysis. Similarly, the result of receiving arbitrary data from the network or the result of reading operating system state are all unknowns that need to be treated as nondeterministic environment choices by the analysis.

Even in the special case where all program inputs are known, static analyses still need to deal with unknowns that arise from approximating program behavior. A static analysis cannot simply carry out an exact program simulation; if nothing else, we usually want to guarantee the analysis terminates even if the program does not. Thus, static analysis always has some imprecision built in. For example, since lists, sets, and trees may have an unbounded number of elements, many static techniques do not precisely model the data structure's contents. Reading an element from a data structure is modeled as a nondeterministic choice that returns any element of the data structure. Similarly, if the chosen program abstraction cannot express nonlinear arithmetic, the value of a "complicated" expression, such as `coef*a*b+size`

, may also need to treated as an unknown by the static analysis.

The question of what, if any, useful information can be garnered from such unknown values is not much discussed in the literature. It is our impression that if the question is considered at all, it is left as an engineering detail in the implementation; at least, this is the approach we have taken ourselves in the past. But two observations have changed our minds: First, unknown values are astonishingly pervasive when statically analyzing programs; there are always calls to external functions not modeled by the analysis as well as approximations that lose information. Second, in our experience, analyses that do a poor job handling unknown values either end up being unscalable or too imprecise. For these reasons, we now believe a systematic approach for dealing with unknown values is a problem of the first order in the design of an expressive static analysis.

We begin by informally sketching a very simple, but imprecise, approach to dealing with unknown values in static analysis. Consider the following code snippet:

Suppose we want to prove that for every call to `fopen`

, there is exactly one matching call to `fclose`

. For the matching property to be violated, it must be the case that the value of `input`

is `'y'`

on line 2, but the value of `input`

is not `'y'`

on line 4. Since the value of the input is unknown, one simple approach is to represent the unknown value using a special abstract constant *. Now, programs may have multiple sources of unknown values, all of which are represented by *. Thus, * is not a particular unknown but the set of all unknowns in the program. Hence, the predicates * = `'y'`

(which should be read as: `'y'`

is equal to some element of values represented by *) and * `'y'`

(which should be read as: `'y'`

is not equal to some element of values represented by *) are simultaneously satisfiable. As a result, program paths where `input`

is equal to `'y'`

at line (2), but not equal to `'y'`

at line (4) (or vice versa) cannot be ruled out, and the analysis would erroneously report an error.

A more precise alternative for reasoning about unknown values is to name them using variables (called *choice variables*) that stand for a single, but unknown, value. Observe that this strategy of introducing choice variables is a refinement over the previous approach because two distinct environment choices are modeled by two distinct choice variables, and '. Thus, while a choice variable may represent any value, it cannot represent two distinct values at the same time. For instance, if we introduce the choice variable for the unknown value of the result of the call to `get_user_input`

on line 1, the constraint characterizing the failure condition is = y y, which is unsatisfiable, establishing that the call to `fopen`

is matched by a call to `fclose`

. The insight is that the use of choice variables allows the analysis to identify when two values arise from the same environment choice without imposing any restrictions on their values.

While this latter strategy allows for more precise reasoning, it leads to two difficultiesone theoretical and one

practicalthat the simpler, but less precise, strategy does not suffer from. Consider the following function:^{a}

Suppose we want to know when `query_user`

returns `true`

. The return value of `get_user_input`

is statically unknown; hence it is identified by a choice variable . The variable `feature_enabled`

, however, is definitely not a nondeterministic choice, as its value is determined by the function's caller. We represent `feature_enabled`

by an *observable variable*, , provided by callers of this function. The condition, , under which `query_user`

returns `true`

(abbreviated `T`

) in any calling context, is then given by the constraint:

This formula is read as follows. The term = `T`

captures that the function returns `true`

only if `feature_enabled`

is `true`

(line A). Furthermore, the user input must either be `'y'`

(term = `'y'`

and line C) or it must not be `'n'`

(term (= `'n'`

) and line D) and the recursive call on line G must return `true`

(term [`T`

/]). Observe that because the function is recursive, so is the formula. In the term [`T`

/], the substitution [`T`

/] models that on the recursive call, the formal parameter is replaced by actual parameter `true`

. Finally, the binding . reminds us that is a choice variable. When the equation is unfolded to perform the substitution [`T`

/] we must also make the environment choice for . The most general choice we can make is to replace with a fresh variable ', indicating that we do not know what choice is made, but it is potentially different from any other choice on subsequent recursive calls. Thus, [`T`

/] unfolds to:

While the equation (*) expresses the condition under which `query_user`

returns `true`

, the recursive definition means it is not immediately useful. Furthermore, it is easy to see that there is no finite nonrecursive formula that is a solution of the recursive equation (*) because repeated unfolding of [`T`

/] introduces an infinite sequence of fresh choice variables ', ", '", .... Hence, it is not always possible to give a finite closed-form formula describing the exact condition under which a program property holds.

On the practical side, real programs have many sources of unknowns; for example, assuming we do not reason about the internal state of the memory management system, every call to `malloc`

in a C program appears as a nondeterministic choice returning either `NULL`

or newly allocated memory. In practice, the number of choice variables grows rapidly with the size of the program, overwhelming the constraint solver and resulting in poor analysis scalability. Therefore, it is important to avoid tracking choice variables whenever they are unnecessary for proving a property.

Our solution to both the theoretical and the practical problems can be understood only in the larger context of why we want to perform static analysis in the first place. Choice variables allow us to create precise models of how programs interact with their environment, which is good because we never know a priori which parts of the program are important to analyze precisely and so introducing unnecessary imprecision anywhere in the model is potentially disastrous. But the model has more information than needed to answer most individual questions we care about; in fact, we are usually interested in only two kinds of 1-bit decision problems, *may* and *must* queries. If one is interested in proving that a program does not do something "bad" (so-called *safety properties*), then the analysis needs to ask may questions, such as "May this program dereference NULL?" or "May this program raise an exception?". On the other hand, if one is interested in proving that a program eventually does something good (so called *liveness properties*), then the analysis needs to ask must questions, such as "Must this memory be eventually freed?".

May questions can be formulated as satisfiability queries; if a formula representing the condition under which the bad event happens is satisfiable, then the program is not guaranteed to be error-free. Conversely, must questions are naturally formulated as validity queries: If a formula representing the condition under which something good happens is not valid, then the program may violate the desired property. Hence, to answer may and must questions about programs precisely, we do not necessarily need to solve the exact formula characterizing a property, but only formulas that preserve satisfiability (for may queries) or validity (for must queries).

The key idea underlying our technique is that while choice variables add useful precision within the function invocation in which they arise, the aggregate behavior of the function can be precisely summarized in terms of only observable variables for answering may and must queries. Given a finite abstraction of the program, our technique first generates a recursive system of equations, which is precise with respect to the initial abstraction but contains choice variables. We then eliminate choice variables from this recursive system to obtain a pair of equisatisfiable and equivalid systems over only observable variables. After ensuring that satisfiability and validity are preserved under syntactic substitution, we then solve the two recursive systems via standard fixed-point computation. The final result is a *bracketing constraint* _{NC}, _{SC} for each initial equation, corresponding to closed-form strongest necessary and weakest sufficient conditions.

We demonstrate experimentally that the resulting bracketing constraints are small in practice and, most surprisingly, do not grow in the size of the program, allowing our technique to scale to analyzing programs as large as the entire Linux kernel. We also apply this technique for finding null dereference errors in large open source C applications and show that this technique is useful for reducing the number of false positives by an order of magnitude.

As mentioned in Section 1, static analyses operate on a model or abstraction of the program rather than the program itself. In this paper, we consider a family of finite abstractions where each variable has one of abstract values *C*_{1}, ..., *C*_{k}. These abstract values can be any fixed set of predicates, typestates, dataflow values, or any chosen finite domain. We consider a language with abstract values *C*_{1}, ..., *C*_{k}; while simple, this language is sufficiently expressive to illustrate the main ideas of our techniques:

Expressions are `true, false`

, *abstract values* C_{i}, variables x, function calls, conditional expressions, let bindings and comparisons between two expressions. Boolean-valued expressions can be composed using the standard Boolean connectives, , , and ¬. In this language, we model unknown values by references to unbound variables, which are by convention taken to have a nondeterministic value chosen on function invocation. Thus, any free variables occurring in a function body are choice variables. Observe that this language has an expressive set of predicates used in conditionals, so the condition under which some program property holds may be nontrivial.

To be specific, in the remainder of this paper, we consider the program properties "*May* a given function return constant (i.e., abstract value) *C _{i}?*" and "

DEFINITION 1 (Constraints).

Symbols *s* in the constraint language are abstract values *C*_{i}, choice variables whose corresponding abstract values are unknown, and observable variables representing function inputs provided by callers. Because the values of inputs to each function *f* are represented by variables , the constraints generated by the analysis are polymorphic, i.e., can be used in any calling context of *f*. Constraints are equalities between symbols (*s*_{1} = *s*_{2}), constraint variables with a substitution [*C*_{i} /], or Boolean combinations of constraints. The substitutions [*C*_{i} /] on constraint variables are used for the substitution of formals by actuals, and recall that the vector of choice variables named with the variable is replaced by a vector of fresh choice variables ' in each unfolding of the equation. More formally, if . = , then:

This renaming is necessary both to avoid naming collisions and to model that a different environment choice may be made on different recursive invocations. Constraints express the condition under which a function *f* with input returns a particular abstract value *C*_{i}; we usually index the corresponding constraint variable _{f, , c} for clarity. So, for example, if there are only two abstract values *C*_{1} and *C*_{2}, the equation

describes the function *f* that always returns C_{1}, and

describes the function *f* that returns *C*_{1} if its input has abstract value *C*_{2} and vice versa. As a final example, the function

where the unbound variable `y`

models a nondeterministic choice is described by the equation:

Note that is shared by the two constraints; in particular, in any solution must be either *C*_{1} or *C*_{2}, capturing that a function call returns only one value.

Our goal is to generate constraints characterizing the condition under which a given function returns an abstract value *C _{i}*. Figure 1 presents most of the constraint inference rules for the language given above; the remaining rules are omitted for lack of space but are all straightforward analogs of the rules shown. In these inference rules, an environment

We briefly explain a subset of the rules in more detail. In Rule 3, two expressions *e*_{1} and *e*_{2} are equal whenever both have the same abstract value. Rule 8 says that if under environment *A*, the abstract value of variable *x* is represented by constraint variable , then *x* has abstract value *C _{i}* only if =

EXAMPLE 1. *Suppose we analyze the following function:*

*where* `y`

*models an environment choice and the only abstract values are* `C`

_{1}*and* `C`

. _{2}*Then*

*is the equation computed by the inference rules. Note that the substitution* [*C*_{1}/] *in the formula expresses that the argument of the recursive call to* **f** *is C*_{1}.

We briefly sketch the semantics of constraints. Constraints are interpreted over the standard four-point lattice with *true, false*, and *true, false* , where is meet, is join, and , ¬*true = false*, and ¬ *false = true.* Given an assignment for the choice variables , the meaning of a system of equations *E* is a standard limit of a series of approximations (*E*^{0}), (*E*^{1}), ... generated by repeatedly unfolding *E.* We are interested in both the least fixed point (where the first approximation of all variables is ) and greatest fixed point (where the first approximation is ) semantics. The value in the least fixed point semantics (resp. in the greatest fixed point) represents nontermination of the analyzed program.

**2.1. Reduction to Boolean constraints**

Our main technical result is a sound and complete method for answering satisfiability (may) and validity (must) queries for the constraints of Definition 1. As outlined in Section 1, the algorithm has four major steps:

- Eliminate choice variables by extracting strongest necessary and weakest sufficient conditions
- Rewrite the equations to preserve satisfiability/validity under substitution
- Eliminate recursion by a fixed point computation
- Finally, apply a decision procedure to the closed-form equations

Because our abstraction is finite, constraints from Definition 1 can be encoded using Boolean logic, and thus our target decision procedure for the last step is Boolean SAT. We must at some point translate the constraints from Figure 1 into equivalent Boolean constraints; we perform this translation first, before performing any of the steps above.

For every variable ( {, }) in the constraint language, we introduce Boolean variables _{i1}, ..., _{in} such that _{ij} is *true* if and only if _{i} = *C*_{j}. We map the equation variables to Boolean variables of the same name. A variable represents the condition under which *f* returns *C*_{i}, hence we refer to 's as *return variables.* We also translate each *s*_{1} = *s*_{2} occurring in the constraints as:

Note that subexpressions of the form _{i} = _{j} never appear in the constraints generated by the system of Figure 1. We replace every substitution [*C*_{j}/_{i}] by the Boolean substitution [*true*/_{ij}] and [*false*/_{ik}] for *j* *k*.

EXAMPLE 2. *The first row of Example 1 results in the following Boolean constraints (here Boolean variable* _{1} *represents the equation* = *C*_{1} *and* _{2} *represents* = *C*_{2}):

In the general case, the constraints from Figure 1 result in a recursive system of Boolean constraints of the following form:

SYSTEM OF EQUATIONS 1.

where and *b*_{i} {*true, false*} and the 's are quantifier-free formulas over , , and .

Observe that any solution to the constraints generated according to the rules from Figure 1 must assign exactly one abstract value to each variable. More specifically, in the original semantics, = *C*_{i} = *C*_{j} is unsatisfiable for any *i, j* such that *i* *j*, and is valid; however, in the Boolean encoding _{i} _{j} and are both still satisfiable. Hence, to encode these implicit uniqueness and existence axioms of the original constraints, we define satisfiability and validity in the following modified way:

where _{exist} and _{unique} are defined as:

As discussed in previous sections, a key step in our algorithm is extracting necessary/sufficient conditions from a system of constraints *E*. The necessary (resp. sufficient) conditions should be satisfiable (resp. valid) if and only if *E* is satisfiable (resp. valid). This section makes precise exactly what necessary/sufficient conditions we need; in particular, there are two technical requirements:

- The necessary (resp. sufficient) conditions should be as
*strong*(resp.*weak*) as possible. - The necessary/sufficient conditions should be only over observable variables.

In the following, we use ^{+} () to denote the set of observable variables in , and ^{-} () to denote the set of choice variables in .

DEFINITION 2. *Let* *be a quantifier-free formula. We say* *is the* strongest observable necessary condition *for* *if:*

The first condition says is necessary for , and the second condition ensures is stronger than any other necessary condition with respect to 's observable variables ^{+} (). The additional restriction ^{-} () = enforces that the strongest necessary condition for a formula has no choice variables.

DEFINITION 3. *Let* *be a quantifier-free formula. We say* *is the* weakest observable sufficient condition *for* *if:*

Let be the condition under which some program property *P* holds. Then, by virtue of being a strongest necessary condition, querying the satisfiability of is equivalent to querying the satisfiability of the original constraint for deciding if property *P may* hold. Since is a necessary condition for , the satisfiability of implies the satisfiability of . More interestingly, because is the strongest such necessary condition, the satisfiability of also implies the satisfiability of ; otherwise, a stronger necessary condition would be *false.* Analogously, querying the validity of is equivalent to querying the validity of the original constraint for deciding if property *P must* hold.

One can think of strongest necessary and weakest sufficient conditions of as defining a tight observable bound on . If has only observable variables, then the strongest necessary and weakest sufficient conditions of are equivalent to . If has only choice variables and is not equivalent to *true* or *false*, then the best possible bounds are = *true* and = *false.* Intuitively, the "difference" between strongest necessary and weakest sufficient conditions defines the amount of unknown information present in the original formula.

We now continue with an informal example illustrating the usefulness of strongest observable necessary and weakest sufficient conditions for statically analyzing programs.

EXAMPLE 3. *Consider the implementation of* `f`

*given in* *Figure 2*, *and suppose we want to determine the condition under which pointer* `p`

*is dereferenced in* `f`

. *It is easy to see that the exact condition for* `p`

*'s dereference is given by the constraint:*

*Since the return value of* `malloc`

(*i.e.*, `buf`

) *and the user input* (*i.e.*, *`buf`

) *are statically unknown, the strongest observable necessary condition for* `f`

*to dereference* `p`

*is given by the simpler condition:*

*On the other hand, the weakest observable sufficient condition for the dereference is* `false`

, *which makes sense because no restriction on the arguments to* `f`

*can guarantee that* `p`

*is dereferenced. Observe that these strongest necessary and weakest sufficient conditions are as precise as the original formula for deciding whether* `p`

*is dereferenced by* `f`

*at any call site of* `f`

, *and furthermore, these formulas are much more concise than the original formula.*

In this section, we now return to the problem of computing strongest necessary and weakest sufficient conditions containing only observable variables for each from System of Equations 1. Our algorithm first eliminates the choice variables from every formula. We then manipulate the system to preserve strongest necessary (weakest sufficient) conditions under substitution (Section 4.2). Finally, we solve the equations to eliminate recursive constraints (Section 4.3), yielding a system of (nonrecursive) formulas over observable variables. Each step preserves the satisfiability/validity of the original equations, and thus the original may/must query can be decided using a standard SAT solver on the final formulas.

**4.1. Eliminating choice variables**

To eliminate the choice variables from the formulas in Figure 1, we use the following well-known result for computing strongest necessary and weakest sufficient conditions for Boolean formulas^{4}:

LEMMA 1. *The strongest necessary and weakest sufficient conditions of Boolean formula* *not containing variable* *are given by:*

Since our definition of satisfiability and validity must also take into account the implicit existence and uniqueness conditions, this standard way of computing strongest necessary and weakest sufficient conditions of Boolean formulas must be slightly modified. In particular, let be a choice variable to be eliminated, and let _{exist} and _{unique} represent the existence and uniqueness conditions involving . Then, we compute strongest necessary and weakest sufficient conditions as follows:

After applying these elimination procedures to the constraint system from Figure 1, we obtain two distinct sets of equations of the form:

SYSTEM OF EQUATIONS 2.

*E*_{SC} is analogous to *E*_{NC}.

EXAMPLE 4. *Consider the function given in Example 1, for which Boolean constraints are given in Example 2. We compute the weakest sufficient condition for* :

*The reader can verify that the strongest necessary condition for* *is* true. *The existence and uniquencess constraints are omitted since they are redundant.*

**4.2. Preservation under substitution**

Our goal is to solve the recursive system given in System of Equations 2 by an iterative, fixed point computation. However, there is a problem: as it stands, System of Equations 2 may not preserve strongest necessary and weakest sufficient conditions under substitution for two reasons:

- Strongest necessary and weakest sufficient conditions are not preserved under negation (i.e., and ), and the formulas from System of Equations 2 contain negated return () variables. Therefore, substituting ¬ by ¬ and ¬ would yield incorrect necessary and sufficient conditions, respectively.
- The formulas from System of Equations 2 may contain contradictions and tautologies involving return variables, causing the formula to be weakened (for necessary conditions) and strengthened (for sufficient conditions) as a result of substituting the return variables with their respective necessary and sufficient conditions. As a result, the obtained necessary (resp. sufficient) conditions may not be as strong (resp. as weak) as possible.

Fortunately, both of these problems can be remedied. For the first problem, observe that while and , the following equivalences do hold:

In other words, the strongest necessary condition of ¬ is the negation of the weakest sufficient condition of , and similarly, the weakest sufficient condition of ¬ is the negation of the strongest necessary condition of . Hence, by simultaneously computing strongest necessary and weakest sufficient conditions, one can solve the first problem using the above equivalences.

To overcome the second problem, an obvious solution is to convert the formula to disjunctive normal form and drop contradictions before applying a substitution in the case of strongest necessary conditions. Similarly, for weakest sufficient conditions, the formula may be converted to conjunctive normal form and tautologies can be removed. This rewrite explicitly enforces any contradictions and tautologies present in the original formula such that substituting the variables with their necessary (resp. sufficient) conditions cannot weaken (resp. strengthen) the solution.

**4.3. Eliminating recursion**

Since we now have a way of preserving strongest necessary and weakest sufficient conditions under substitution, it is possible to obtain a closed form solution containing only observable variables to System of Equations 2 using a standard fixed point computation technique. To compute a least fixed point, we use the following lattice:

The lattice *L* is finite (up to logical equivalence) since there are only a finite number of variables _{ij} and hence only a finite number of logically distinct formulas. This results in a system of bracketing constraints of the form:

SYSTEM OF EQUATIONS 3.

Recall from Section 2 that the original constraints have four possible meanings, namely , *true, false*, and , while the resulting closed-form strong necessary and weakest sufficient conditions evaluate to either *true* or *false*. This means that in some cases involving nonterminating program paths, the original system of equations may have meaning in least fixed-point semantics (or in greatest fixed-point semantics), but the algorithm presented in this paper may return either *true* or *false*, depending on whether a greatest or least fixed point is computed. Hence, our results are qualified by the assumption that the program terminates.

EXAMPLE 5. *Recall that in Example 4 we computed* *for the function* `f`

*defined in Example 1 as:*

*To find the weakest sufficient condition for , we first substitute* true *for* . *This yields the formula* _{1} ¬_{1}, *a tautology. As a result, our algorithm finds the fixed point solution true for the weakest sufficient condition of* . *Since* `f`

*is always guaranteed to return C*_{1}, *the weakest sufficient condition computed using our algorithm is the most precise solution possible.*

While the technique proposed in this paper yields the strongest necessary and weakest sufficient conditions for a property *P* with respect to a finite abstraction, it is not precise for separately tracking the conditions for two distinct properties *P*_{1} and *P*_{2} and then combining the individual results. In particular, if _{1} and _{2} are the strongest necessary conditions for *P*_{1} and *P*_{2} respectively, then _{1} _{2} does *not* yield the strongest necessary condition for *P*_{1} and *P*_{2} to hold together because strongest necessary conditions do not distribute over conjunctions, and weakest sufficient conditions do not distribute over disjunctions. Hence, if one is interested in combining reasoning about two distinct properties, it is necessary to compute strongest necessary and weakest sufficient conditions for the combined property.

While it is important in our technique that the set of possible values can be exhaustively enumerated (to guarantee the convergence of the fixed point computation and to be able to convert the constraints to Boolean logic), it is not necessary that the set be finite, but only finitary, that is, finite for a given program. Furthermore, while it is clear that the technique can be applied to finite-state properties or enumerated types, it can also be extended to any property where a finite number of equivalence classes can be derived to describe the possible outcomes. However, the proposed technique is not complete for arbitrary nonfinite domains.

We have implemented our method in Saturn, a static analysis framework designed for checking properties of C programs.^{1} As mentioned in Section 1, sources of imprecision in the analysis appear as nondeterministic choices; in Saturn, sources of imprecision include, but are not limited to, reads from unbounded data structures, arithmetic, imprecise function pointer targets, imprecise loop invariants, and in-line assembly; all of these sources of imprecision in the analysis are treated as choice variables.

We conducted two sets of experiments to evaluate our technique on OpenSSH, Samba, and the Linux kernel. In the first set of experiments we compute necessary and sufficient conditions for pointer dereferences. Pointer dereferences are ubiquitous in C programs and computing the necessary and sufficient conditions for each and every syntactic pointer dereference to execute is a good stress test for our approach. As a second experiment, we incorporate our technique into a null dereference analysis and demonstrate that our technique reduces the number of false positives by close to an order of magnitude without resorting to ad-hoc heuristics or compromising soundness.

In our first set of experiments, we measure the size of necessary and sufficient conditions for pointer dereferences both at *sinks*, where pointers are dereferenced, and at *sources*, where

pointers are first allocated or read from the heap. In Figure 2, consider the pointer dereference (sink) at line 7. For the sink experiments, we would, for example, compute the necessary and sufficient conditions for `p`

's dereference as `p! = NULL flag!=0`

and `false`

respectively. To illustrate the source experiment, consider the following call site of function `f`

from Figure 2:

The line marked `/*source*/`

is the source of pointer `p`

; the necessary condition at `p`

's source for `p`

to be ultimately dereferenced is `x > MAX (x < = MAX p! = NULL flag! = 0)`

and the sufficient condition is `x > MAX`

.

The results of the sink experiments for Linux are presented in Figure 3. The table in Figure 4 presents a summary of the results of both the source and sink experiments for OpenSSH, Samba, and Linux. The histogram in Figure 3 plots the size of necessary (resp. sufficient) conditions against the number of constraints that have a necessary (resp. sufficient) condition of the given size. In this figure, red bars indicate necessary conditions, green bars indicate sufficient conditions, and note that the *y*-axis is drawn on a log-scale. Observe that 95% of all necessary and sufficient conditions have fewer than five subclauses, and 99% have fewer than ten subclauses, showing that necessary and sufficient conditions are small in practice. Figure 4 presents average necessary and sufficient condition sizes at sinks (rows 2 and 3) for all three applications we analyzed, confirming that average necessary and sufficient condition sizes are consistently small across all of our benchmarks.

Our second experiment applies these techniques to finding null dereference errors. We chose null dereferences as an application because checking for null dereference errors with sufficient precision often requires tracking complex path conditions. In the results presented in Figure 5, we compare two different setups: In the *interprocedurally path-sensitive* analysis, we use the technique described in the paper, computing strongest necessary conditions for a null pointer to be dereferenced. In the second setup (i.e., the *intraprocedurally path-sensitive* case), for each function, we only compute which pointers may be dereferenced in that function, but we do not track the condition under which pointers are dereferenced across functions. We believe this comparison is useful in quantifying the benefit of the technique proposed in the paper because, without the elimination of choice variables, (i) the interprocedurally path-sensitive analysis may not even terminate, and (ii) the number of choice variables grows linearly in the size of the program, overwhelming the constraint solver. In fact, for this reason, all previous analyses written in Saturn were either interprocedurally path-insensitive or adopted incomplete heuristics to decide which conditions to track across function boundaries.^{1}

The first three columns of Figure 5 give the results of the experiments for the first setup, and the last three columns of the same figure present the results of the second setup. One important caveat is that the numbers reported here exclude error reports arising from array elements and recursive fields of data structures. Saturn does not have a sophisticated shape analysis; hence, the overwhelming majority (>95%) of errors reported for elements of unbounded data structures are false positives. However, shape analysis is an orthogonal problem which we neither address nor evaluate in this work.

A comparison of the results of the intraprocedurally and interprocedurally path-sensitive analyses shows that our technique reduces the number of false positives by close to an order of magnitude without resorting to heuristics or compromising soundness in order to eliminate errors arising from interprocedural dependencies. Note that the existence of false positives does not contradict our previous claim that our technique is complete. First, even for finite domains, our technique can provide only *relative completeness*; false positives can still arise from orthogonal sources of imprecision in the analysis. Second, while our results are complete for finite domains, we cannot guarantee completeness for arbitrary domains.

We have presented a method for systematically reasoning about unknown values in static analysis systems. We argued that, while representing unknown values by choice variables adds useful precision by correlating multiple uses of the same unknown value, eliminating these choice variables at function boundaries is necessary to avoid both scalability as well as termination problems. We have presented a technique to eliminate these choice variables with no loss of information for answering may and must queries about program properties. We have also experimentally demonstrated that analyzing unknown values in this way leads to much better precision and better scalability.

1. Aiken, A., Bugrara, S., Dillig, I., Dillig, T., Hackett, B., Hawkins, P. An overview of the SATURN project. In *PASTE* (2007), 4348.

2. Ball, T. Rajamani, S. Bebop: A symbolic model checker for Boolean programs. In *SPIN* (2000), 113130.

3. Ball, T., Rajamani, S. Automatically validating temporal safety properties of interfaces. *LNCS 2057* (2001), 103122.

4. Boole, G. *An Investigation of the Laws of Thought.* Dover Publications, Incorporated, 1858.

5. Cook, B., Gotsman, A., Podelski, A., Rybalchenko, A., Vardi, M. Proving that programs eventually do something good. In *POPL* (2007), 265276.

6. Das, M., Lerner, S., Seigle, M. ESP: Path-sensitive program verification in polynomial time. In *PLDI* (2002), 5768.

7. Henglein, F. Type inference and semi-unification. In *Conference on LISP and Functional Programming* (1988), 184197.

8. Henzinger, T., Jhala, R., Majumdar, R., McMillan, K. Abstractions from proofs. In *POPL* (2004), 232244.

9. Mycroft, A. Polymorphic type schemes and recursive definitions. In *International Symposium on Programming* (1984), 217228.

10. Reps, T., Horwitz, S., Sagiv, M. Precise interprocedural dataflow analysis via graph reachability. In *POPL* (1995), 4961.

11. Schmidt, D. A calculus of logical relations for over-and underapproximating static analyses. *Science of Computer Programming 64*, 1 (2007), 2953.

a. While this function would typically be written using a loop, the same problem arises both for loops and recursive functions, and we use a recursive function because it is easier to explain.

b. Note that rules 3, 10, 11, and 12 implicitly quantify over multiple hypotheses; we have omitted explicit quantifiers to avoid cluttering the rules.

The original version of this paper is entitled "Sound, Complete, and Scalable Path-Sensitive Analysis" and was published in the *Proceedings of Programming Language Design and Implementation (PLDI)* 2008, ACM.

DOI: http://doi.acm.org/10.1145/1787234.1787259

Figure 3. Frequency of necessary and sufficient condition sizes (in terms of the number of Boolean connectives) at sinks for Linux.

Figure 4. Necessary and sufficient condition sizes (in terms of number of Boolean connectives in the formula) for pointer dereferences.

**©2010 ACM 0001-0782/10/0800 $10.00**

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc.