Sign In

Communications of the ACM

Research highlights

Reasoning About the Unknown in Static Analysis


View as: Print Mobile App ACM Digital Library Full Text (PDF) In the Digital Edition Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook
'Unknown' by Andy Cunningham

"Unknown" by Andy Cunningham

Credit: Andy Cunningham

Static program analysis techniques cannot know certain values, such as the value of user input or network state, at analysis time. While such unknown values need to be treated as nondeterministic choices made by the program's execution environment, it is still possible to glean very useful information about how such statically unknown values may or must influence computation. We give a method for integrating such nondeterministic choices with an expressive static analysis. Interestingly, we cannot solve the resulting recursive constraints directly, but we give an exact method for answering all may and must queries. We show experimentally that the resulting solved forms are concise in practice, enabling us to apply the technique to very large programs, including an entire operating system.

Back to Top

1. Introduction

Preventing software errors is a central challenge in software engineering. The many tool-based approaches to the problem can be grouped roughly into two categories. Dynamic analysis techniques discover properties by monitoring program executions for particular inputs; standard testing is the most commonly used form of dynamic analysis. In contrast, a static analysis discovers properties that hold for all possible inputs; a sound static analysis concludes a program is error-free only if the program indeed has no errors.

Unlike dynamic analyses, sound static analyses have the advantage of never missing any potential errors, but, unfortunately, there is no free lunch: Soundness usually comes at the cost of reporting false positives (i.e., spurious warnings about error-free code) because static analyses must approximate some aspects of program behavior. This approximation is inevitable as analyzing even very simple properties of programs' behavior is undecidable. Hence, a key challenge for static analysis techniques is achieving a satisfactory combination of precision, soundness, and scalability by reporting as few false positives as possible while still being sound and scaling to real systems.

This goal of obtaining satisfactory precision is further complicated by the fact that certain values are simply unknown statically: For example, if a program queries the user for an input, this input appears as a nondeterministic environment choice to the static analysis. Similarly, the result of receiving arbitrary data from the network or the result of reading operating system state are all unknowns that need to be treated as nondeterministic environment choices by the analysis.

Even in the special case where all program inputs are known, static analyses still need to deal with unknowns that arise from approximating program behavior. A static analysis cannot simply carry out an exact program simulation; if nothing else, we usually want to guarantee the analysis terminates even if the program does not. Thus, static analysis always has some imprecision built in. For example, since lists, sets, and trees may have an unbounded number of elements, many static techniques do not precisely model the data structure's contents. Reading an element from a data structure is modeled as a nondeterministic choice that returns any element of the data structure. Similarly, if the chosen program abstraction cannot express nonlinear arithmetic, the value of a "complicated" expression, such as coef*a*b+size, may also need to treated as an unknown by the static analysis.

The question of what, if any, useful information can be garnered from such unknown values is not much discussed in the literature. It is our impression that if the question is considered at all, it is left as an engineering detail in the implementation; at least, this is the approach we have taken ourselves in the past. But two observations have changed our minds: First, unknown values are astonishingly pervasive when statically analyzing programs; there are always calls to external functions not modeled by the analysis as well as approximations that lose information. Second, in our experience, analyses that do a poor job handling unknown values either end up being unscalable or too imprecise. For these reasons, we now believe a systematic approach for dealing with unknown values is a problem of the first order in the design of an expressive static analysis.

We begin by informally sketching a very simple, but imprecise, approach to dealing with unknown values in static analysis. Consider the following code snippet:

ins01.gif

Suppose we want to prove that for every call to fopen, there is exactly one matching call to fclose. For the matching property to be violated, it must be the case that the value of input is 'y' on line 2, but the value of input is not 'y' on line 4. Since the value of the input is unknown, one simple approach is to represent the unknown value using a special abstract constant *. Now, programs may have multiple sources of unknown values, all of which are represented by *. Thus, * is not a particular unknown but the set of all unknowns in the program. Hence, the predicates * = 'y' (which should be read as: 'y' is equal to some element of values represented by *) and * 'y' (which should be read as: 'y' is not equal to some element of values represented by *) are simultaneously satisfiable. As a result, program paths where input is equal to 'y' at line (2), but not equal to 'y' at line (4) (or vice versa) cannot be ruled out, and the analysis would erroneously report an error.

A more precise alternative for reasoning about unknown values is to name them using variables (called choice variables) that stand for a single, but unknown, value. Observe that this strategy of introducing choice variables is a refinement over the previous approach because two distinct environment choices are modeled by two distinct choice variables, and '. Thus, while a choice variable may represent any value, it cannot represent two distinct values at the same time. For instance, if we introduce the choice variable for the unknown value of the result of the call to get_user_input on line 1, the constraint characterizing the failure condition is = y and.gif y, which is unsatisfiable, establishing that the call to fopen is matched by a call to fclose. The insight is that the use of choice variables allows the analysis to identify when two values arise from the same environment choice without imposing any restrictions on their values.

While this latter strategy allows for more precise reasoning, it leads to two difficultiesone theoretical and one

ins02.gif

practicalthat the simpler, but less precise, strategy does not suffer from. Consider the following function:a

Suppose we want to know when query_user returns true. The return value of get_user_input is statically unknown; hence it is identified by a choice variable . The variable feature_enabled, however, is definitely not a nondeterministic choice, as its value is determined by the function's caller. We represent feature_enabled by an observable variable, , provided by callers of this function. The condition, , under which query_user returns true (abbreviated T) in any calling context, is then given by the constraint:

ueq01.gif

This formula is read as follows. The term = T captures that the function returns true only if feature_enabled is true (line A). Furthermore, the user input must either be 'y' (term = 'y' and line C) or it must not be 'n' (term (= 'n') and line D) and the recursive call on line G must return true (term [T/]). Observe that because the function is recursive, so is the formula. In the term [T/], the substitution [T/] models that on the recursive call, the formal parameter is replaced by actual parameter true. Finally, the binding . reminds us that is a choice variable. When the equation is unfolded to perform the substitution [T/] we must also make the environment choice for . The most general choice we can make is to replace with a fresh variable ', indicating that we do not know what choice is made, but it is potentially different from any other choice on subsequent recursive calls. Thus, [T/] unfolds to:

ueq02.gif

While the equation (*) expresses the condition under which query_user returns true, the recursive definition means it is not immediately useful. Furthermore, it is easy to see that there is no finite nonrecursive formula that is a solution of the recursive equation (*) because repeated unfolding of [T/] introduces an infinite sequence of fresh choice variables ', ", '", .... Hence, it is not always possible to give a finite closed-form formula describing the exact condition under which a program property holds.

On the practical side, real programs have many sources of unknowns; for example, assuming we do not reason about the internal state of the memory management system, every call to malloc in a C program appears as a nondeterministic choice returning either NULL or newly allocated memory. In practice, the number of choice variables grows rapidly with the size of the program, overwhelming the constraint solver and resulting in poor analysis scalability. Therefore, it is important to avoid tracking choice variables whenever they are unnecessary for proving a property.

Our solution to both the theoretical and the practical problems can be understood only in the larger context of why we want to perform static analysis in the first place. Choice variables allow us to create precise models of how programs interact with their environment, which is good because we never know a priori which parts of the program are important to analyze precisely and so introducing unnecessary imprecision anywhere in the model is potentially disastrous. But the model has more information than needed to answer most individual questions we care about; in fact, we are usually interested in only two kinds of 1-bit decision problems, may and must queries. If one is interested in proving that a program does not do something "bad" (so-called safety properties), then the analysis needs to ask may questions, such as "May this program dereference NULL?" or "May this program raise an exception?". On the other hand, if one is interested in proving that a program eventually does something good (so called liveness properties), then the analysis needs to ask must questions, such as "Must this memory be eventually freed?".

May questions can be formulated as satisfiability queries; if a formula representing the condition under which the bad event happens is satisfiable, then the program is not guaranteed to be error-free. Conversely, must questions are naturally formulated as validity queries: If a formula representing the condition under which something good happens is not valid, then the program may violate the desired property. Hence, to answer may and must questions about programs precisely, we do not necessarily need to solve the exact formula characterizing a property, but only formulas that preserve satisfiability (for may queries) or validity (for must queries).

The key idea underlying our technique is that while choice variables add useful precision within the function invocation in which they arise, the aggregate behavior of the function can be precisely summarized in terms of only observable variables for answering may and must queries. Given a finite abstraction of the program, our technique first generates a recursive system of equations, which is precise with respect to the initial abstraction but contains choice variables. We then eliminate choice variables from this recursive system to obtain a pair of equisatisfiable and equivalid systems over only observable variables. After ensuring that satisfiability and validity are preserved under syntactic substitution, we then solve the two recursive systems via standard fixed-point computation. The final result is a bracketing constraint NC, SC for each initial equation, corresponding to closed-form strongest necessary and weakest sufficient conditions.

We demonstrate experimentally that the resulting bracketing constraints are small in practice and, most surprisingly, do not grow in the size of the program, allowing our technique to scale to analyzing programs as large as the entire Linux kernel. We also apply this technique for finding null dereference errors in large open source C applications and show that this technique is useful for reducing the number of false positives by an order of magnitude.

Back to Top

2. From Programs to Constraints

As mentioned in Section 1, static analyses operate on a model or abstraction of the program rather than the program itself. In this paper, we consider a family of finite abstractions where each variable has one of abstract values C1, ..., Ck. These abstract values can be any fixed set of predicates, typestates, dataflow values, or any chosen finite domain. We consider a language with abstract values C1, ..., Ck; while simple, this language is sufficiently expressive to illustrate the main ideas of our techniques:

ueq03.gif

Expressions are true, false, abstract values Ci, variables x, function calls, conditional expressions, let bindings and comparisons between two expressions. Boolean-valued expressions can be composed using the standard Boolean connectives, and.gif, or.gif, and ¬. In this language, we model unknown values by references to unbound variables, which are by convention taken to have a nondeterministic value chosen on function invocation. Thus, any free variables occurring in a function body are choice variables. Observe that this language has an expressive set of predicates used in conditionals, so the condition under which some program property holds may be nontrivial.

To be specific, in the remainder of this paper, we consider the program properties "May a given function return constant (i.e., abstract value) Ci?" and "Must a given function return constant Ci?". Hence, our goal is to compute the constraint under which each function returns constant Ci. These constraints are of the following form:

DEFINITION 1 (Constraints).

ueq04.gif

Symbols s in the constraint language are abstract values Ci, choice variables whose corresponding abstract values are unknown, and observable variables representing function inputs provided by callers. Because the values of inputs to each function f are represented by variables , the constraints generated by the analysis are polymorphic, i.e., can be used in any calling context of f. Constraints cacm5308_c.gif are equalities between symbols (s1 = s2), constraint variables with a substitution [Ci /], or Boolean combinations of constraints. The substitutions [Ci /] on constraint variables are used for the substitution of formals by actuals, and recall that the vector of choice variables cacm5308_b.gif named with the variable is replaced by a vector of fresh choice variables cacm5308_b.gif' in each unfolding of the equation. More formally, if .cacm5308_b.gif = cacm5308_c.gif, then:

ueq05.gif

This renaming is necessary both to avoid naming collisions and to model that a different environment choice may be made on different recursive invocations. Constraints express the condition under which a function f with input returns a particular abstract value Ci; we usually index the corresponding constraint variable f, , c for clarity. So, for example, if there are only two abstract values C1 and C2, the equation

ueq06.gif

describes the function f that always returns C1, and

ueq07.gif

describes the function f that returns C1 if its input has abstract value C2 and vice versa. As a final example, the function

ueq08.gif

where the unbound variable y models a nondeterministic choice is described by the equation:

ueq09.gif

Note that is shared by the two constraints; in particular, in any solution must be either C1 or C2, capturing that a function call returns only one value.

Our goal is to generate constraints characterizing the condition under which a given function returns an abstract value Ci. Figure 1 presents most of the constraint inference rules for the language given above; the remaining rules are omitted for lack of space but are all straightforward analogs of the rules shown. In these inference rules, an environment A maps program variables to variables , in the constraint language. Rules 15 prove judgments cacm5308_d.gif where b isin.gif {true, false}, describing the constraints cacm5308_c.gif under which an expression e evaluates to true or false in environment A. Rules 6ndash;11 prove judgments cacm5308_f.gif that give the constraint under which expression e evaluates to Ci. Finally, rule 12 constructs systems of equations, giving the (possibly) mutually recursive conditions under which a function returns each abstract value.b

We briefly explain a subset of the rules in more detail. In Rule 3, two expressions e1 and e2 are equal whenever both have the same abstract value. Rule 8 says that if under environment A, the abstract value of variable x is represented by constraint variable , then x has abstract value Ci only if = Ci. Rule 11 presents the rule for function calls: If the input to function f has the abstract value Ck under constraint cacm5308_c.gifk, and the constraint under which f returns Ci is cacm5308_e.gif, then f (e) evaluates to Ci under the constraint cacm5308_g.gif.

EXAMPLE 1. Suppose we analyze the following function:

ueq10.gif

where y models an environment choice and the only abstract values are C1 and C2. Then

ueq11.gif

is the equation computed by the inference rules. Note that the substitution [C1/] in the formula expresses that the argument of the recursive call to f is C1.

We briefly sketch the semantics of constraints. Constraints are interpreted over the standard four-point lattice with cacm5308_i.gif true, false, cacm5308_h.gif and cacm5308_i.gif true, false cacm5308_h.gif, where and.gif is meet, or.gif is join, and cacm5308_j.gif, ¬true = false, and ¬ false = true. Given an assignment for the choice variables , the meaning of a system of equations E is a standard limit of a series of approximations (E0), (E1), ... generated by repeatedly unfolding E. We are interested in both the least fixed point (where the first approximation of all variables is cacm5308_i.gif) and greatest fixed point (where the first approximation is cacm5308_h.gif) semantics. The value cacm5308_i.gif in the least fixed point semantics (resp. cacm5308_h.gif in the greatest fixed point) represents nontermination of the analyzed program.

* 2.1. Reduction to Boolean constraints

Our main technical result is a sound and complete method for answering satisfiability (may) and validity (must) queries for the constraints of Definition 1. As outlined in Section 1, the algorithm has four major steps:

  • Eliminate choice variables by extracting strongest necessary and weakest sufficient conditions
  • Rewrite the equations to preserve satisfiability/validity under substitution
  • Eliminate recursion by a fixed point computation
  • Finally, apply a decision procedure to the closed-form equations

Because our abstraction is finite, constraints from Definition 1 can be encoded using Boolean logic, and thus our target decision procedure for the last step is Boolean SAT. We must at some point translate the constraints from Figure 1 into equivalent Boolean constraints; we perform this translation first, before performing any of the steps above.

For every variable ( isin.gif{, }) in the constraint language, we introduce Boolean variables i1, ..., in such that ij is true if and only if i = Cj. We map the equation variables cacm5308_c.gif to Boolean variables of the same name. A variable cacm5308_c.gif represents the condition under which f returns Ci, hence we refer to cacm5308_c.gif's as return variables. We also translate each s1 = s2 occurring in the constraints as:

ueq12.gif

Note that subexpressions of the form i = j never appear in the constraints generated by the system of Figure 1. We replace every substitution [Cj/i] by the Boolean substitution [true/ij] and [false/ik] for j k.

EXAMPLE 2. The first row of Example 1 results in the following Boolean constraints (here Boolean variable 1 represents the equation = C1 and 2 represents = C2):

ueq13.gif

In the general case, the constraints from Figure 1 result in a recursive system of Boolean constraints of the following form:

SYSTEM OF EQUATIONS 1.

ueq14.gif

where cacm5308_l.gif and bi isin.gif {true, false} and the 's are quantifier-free formulas over cacm5308_b.gif, cacm5308_m.gif, and cacm5308_n.gif.

Observe that any solution to the constraints generated according to the rules from Figure 1 must assign exactly one abstract value to each variable. More specifically, in the original semantics, = Ci and.gif = Cj is unsatisfiable for any i, j such that i j, and cacm5308_o.gif is valid; however, in the Boolean encoding i and.gif j and cacm5308_p.gif are both still satisfiable. Hence, to encode these implicit uniqueness and existence axioms of the original constraints, we define satisfiability and validity in the following modified way:

ueq15.gif

where exist and unique are defined as:

ueq16.gif

Back to Top

3. Strongest Necessary and Weakest Sufficient Conditions

As discussed in previous sections, a key step in our algorithm is extracting necessary/sufficient conditions from a system of constraints E. The necessary (resp. sufficient) conditions should be satisfiable (resp. valid) if and only if E is satisfiable (resp. valid). This section makes precise exactly what necessary/sufficient conditions we need; in particular, there are two technical requirements:

  • The necessary (resp. sufficient) conditions should be as strong (resp. weak) as possible.
  • The necessary/sufficient conditions should be only over observable variables.

In the following, we use + () to denote the set of observable variables in , and - () to denote the set of choice variables in .

DEFINITION 2. Let be a quantifier-free formula. We say lceil.gif is the strongest observable necessary condition for if:

ueq17.gif

The first condition says lceil.gif is necessary for , and the second condition ensures lceil.gif is stronger than any other necessary condition with respect to 's observable variables + (). The additional restriction - (lceil.gif) = enforces that the strongest necessary condition for a formula has no choice variables.

DEFINITION 3. Let be a quantifier-free formula. We say lfloor.gifrfloor.gif is the weakest observable sufficient condition for if:

ueq18.gif

Let be the condition under which some program property P holds. Then, by virtue of lceil.gif being a strongest necessary condition, querying the satisfiability of lceil.gif is equivalent to querying the satisfiability of the original constraint for deciding if property P may hold. Since lceil.gif is a necessary condition for , the satisfiability of implies the satisfiability of lceil.gif. More interestingly, because lceil.gif is the strongest such necessary condition, the satisfiability of lceil.gif also implies the satisfiability of ; otherwise, a stronger necessary condition would be false. Analogously, querying the validity of lfloor.gifrfloor.gif is equivalent to querying the validity of the original constraint for deciding if property P must hold.

One can think of strongest necessary and weakest sufficient conditions of as defining a tight observable bound on . If has only observable variables, then the strongest necessary and weakest sufficient conditions of are equivalent to . If has only choice variables and is not equivalent to true or false, then the best possible bounds are lceil.gif = true and lfloor.gifrfloor.gif = false. Intuitively, the "difference" between strongest necessary and weakest sufficient conditions defines the amount of unknown information present in the original formula.

We now continue with an informal example illustrating the usefulness of strongest observable necessary and weakest sufficient conditions for statically analyzing programs.

EXAMPLE 3. Consider the implementation of f given in Figure 2, and suppose we want to determine the condition under which pointer p is dereferenced in f. It is easy to see that the exact condition for p's dereference is given by the constraint:

ueq19.gif

Since the return value of malloc (i.e., buf) and the user input (i.e., *buf) are statically unknown, the strongest observable necessary condition for f to dereference p is given by the simpler condition:

ueq20.gif

On the other hand, the weakest observable sufficient condition for the dereference is false, which makes sense because no restriction on the arguments to f can guarantee that p is dereferenced. Observe that these strongest necessary and weakest sufficient conditions are as precise as the original formula for deciding whether p is dereferenced by f at any call site of f, and furthermore, these formulas are much more concise than the original formula.

Back to Top

4. Solving the Constraints

In this section, we now return to the problem of computing strongest necessary and weakest sufficient conditions containing only observable variables for each cacm5308_q.gif from System of Equations 1. Our algorithm first eliminates the choice variables from every formula. We then manipulate the system to preserve strongest necessary (weakest sufficient) conditions under substitution (Section 4.2). Finally, we solve the equations to eliminate recursive constraints (Section 4.3), yielding a system of (nonrecursive) formulas over observable variables. Each step preserves the satisfiability/validity of the original equations, and thus the original may/must query can be decided using a standard SAT solver on the final formulas.

* 4.1. Eliminating choice variables

To eliminate the choice variables from the formulas in Figure 1, we use the following well-known result for computing strongest necessary and weakest sufficient conditions for Boolean formulas4:

LEMMA 1. The strongest necessary and weakest sufficient conditions of Boolean formula not containing variable are given by:

ueq21.gif

Since our definition of satisfiability and validity must also take into account the implicit existence and uniqueness conditions, this standard way of computing strongest necessary and weakest sufficient conditions of Boolean formulas must be slightly modified. In particular, let be a choice variable to be eliminated, and let exist and unique represent the existence and uniqueness conditions involving . Then, we compute strongest necessary and weakest sufficient conditions as follows:

ueq22.gif

After applying these elimination procedures to the constraint system from Figure 1, we obtain two distinct sets of equations of the form:

SYSTEM OF EQUATIONS 2.

ueq23.gif

ESC is analogous to ENC.

EXAMPLE 4. Consider the function given in Example 1, for which Boolean constraints are given in Example 2. We compute the weakest sufficient condition for cacm5308_r.gif:

ueq24.gif

The reader can verify that the strongest necessary condition for cacm5308_r.gif is true. The existence and uniquencess constraints are omitted since they are redundant.

* 4.2. Preservation under substitution

Our goal is to solve the recursive system given in System of Equations 2 by an iterative, fixed point computation. However, there is a problem: as it stands, System of Equations 2 may not preserve strongest necessary and weakest sufficient conditions under substitution for two reasons:

  • Strongest necessary and weakest sufficient conditions are not preserved under negation (i.e., cacm5308_t.gif and cacm5308_u.gif), and the formulas from System of Equations 2 contain negated return () variables. Therefore, substituting ¬ by ¬lceil.gif and ¬lfloor.gifrfloor.gif would yield incorrect necessary and sufficient conditions, respectively.
  • The formulas from System of Equations 2 may contain contradictions and tautologies involving return variables, causing the formula to be weakened (for necessary conditions) and strengthened (for sufficient conditions) as a result of substituting the return variables with their respective necessary and sufficient conditions. As a result, the obtained necessary (resp. sufficient) conditions may not be as strong (resp. as weak) as possible.

Fortunately, both of these problems can be remedied. For the first problem, observe that while cacm5308_t.gif and cacm5308_u.gif, the following equivalences do hold:

ueq25.gif

In other words, the strongest necessary condition of ¬ is the negation of the weakest sufficient condition of , and similarly, the weakest sufficient condition of ¬ is the negation of the strongest necessary condition of . Hence, by simultaneously computing strongest necessary and weakest sufficient conditions, one can solve the first problem using the above equivalences.

To overcome the second problem, an obvious solution is to convert the formula to disjunctive normal form and drop contradictions before applying a substitution in the case of strongest necessary conditions. Similarly, for weakest sufficient conditions, the formula may be converted to conjunctive normal form and tautologies can be removed. This rewrite explicitly enforces any contradictions and tautologies present in the original formula such that substituting the variables with their necessary (resp. sufficient) conditions cannot weaken (resp. strengthen) the solution.

* 4.3. Eliminating recursion

Since we now have a way of preserving strongest necessary and weakest sufficient conditions under substitution, it is possible to obtain a closed form solution containing only observable variables to System of Equations 2 using a standard fixed point computation technique. To compute a least fixed point, we use the following lattice:

ueq26.gif

The lattice L is finite (up to logical equivalence) since there are only a finite number of variables ij and hence only a finite number of logically distinct formulas. This results in a system of bracketing constraints of the form:

SYSTEM OF EQUATIONS 3.

ueq27.gif

Recall from Section 2 that the original constraints have four possible meanings, namely cacm5308_i.gif, true, false, and cacm5308_h.gif, while the resulting closed-form strong necessary and weakest sufficient conditions evaluate to either true or false. This means that in some cases involving nonterminating program paths, the original system of equations may have meaning cacm5308_i.gif in least fixed-point semantics (or cacm5308_h.gif in greatest fixed-point semantics), but the algorithm presented in this paper may return either true or false, depending on whether a greatest or least fixed point is computed. Hence, our results are qualified by the assumption that the program terminates.

EXAMPLE 5. Recall that in Example 4 we computed lfloor.gifcacm5308_r.gifrfloor.gif for the function f defined in Example 1 as:

ueq28.gif

To find the weakest sufficient condition for cacm5308_r.gif, we first substitute true for lfloor.gifcacm5308_r.gifrfloor.gif. This yields the formula 1 or.gif ¬1, a tautology. As a result, our algorithm finds the fixed point solution true for the weakest sufficient condition of cacm5308_r.gif. Since f is always guaranteed to return C1, the weakest sufficient condition computed using our algorithm is the most precise solution possible.

Back to Top

5. Limitations

While the technique proposed in this paper yields the strongest necessary and weakest sufficient conditions for a property P with respect to a finite abstraction, it is not precise for separately tracking the conditions for two distinct properties P1 and P2 and then combining the individual results. In particular, if 1 and 2 are the strongest necessary conditions for P1 and P2 respectively, then 1 and.gif 2 does not yield the strongest necessary condition for P1 and P2 to hold together because strongest necessary conditions do not distribute over conjunctions, and weakest sufficient conditions do not distribute over disjunctions. Hence, if one is interested in combining reasoning about two distinct properties, it is necessary to compute strongest necessary and weakest sufficient conditions for the combined property.

While it is important in our technique that the set of possible values can be exhaustively enumerated (to guarantee the convergence of the fixed point computation and to be able to convert the constraints to Boolean logic), it is not necessary that the set be finite, but only finitary, that is, finite for a given program. Furthermore, while it is clear that the technique can be applied to finite-state properties or enumerated types, it can also be extended to any property where a finite number of equivalence classes can be derived to describe the possible outcomes. However, the proposed technique is not complete for arbitrary nonfinite domains.

Back to Top

6. Experimental Results

We have implemented our method in Saturn, a static analysis framework designed for checking properties of C programs.1 As mentioned in Section 1, sources of imprecision in the analysis appear as nondeterministic choices; in Saturn, sources of imprecision include, but are not limited to, reads from unbounded data structures, arithmetic, imprecise function pointer targets, imprecise loop invariants, and in-line assembly; all of these sources of imprecision in the analysis are treated as choice variables.

We conducted two sets of experiments to evaluate our technique on OpenSSH, Samba, and the Linux kernel. In the first set of experiments we compute necessary and sufficient conditions for pointer dereferences. Pointer dereferences are ubiquitous in C programs and computing the necessary and sufficient conditions for each and every syntactic pointer dereference to execute is a good stress test for our approach. As a second experiment, we incorporate our technique into a null dereference analysis and demonstrate that our technique reduces the number of false positives by close to an order of magnitude without resorting to ad-hoc heuristics or compromising soundness.

In our first set of experiments, we measure the size of necessary and sufficient conditions for pointer dereferences both at sinks, where pointers are dereferenced, and at sources, where

ins03.gif

pointers are first allocated or read from the heap. In Figure 2, consider the pointer dereference (sink) at line 7. For the sink experiments, we would, for example, compute the necessary and sufficient conditions for p's dereference as p! = NULL and.gif flag!=0 and false respectively. To illustrate the source experiment, consider the following call site of function f from Figure 2:

The line marked /*source*/ is the source of pointer p; the necessary condition at p's source for p to be ultimately dereferenced is x > MAX or.gif (x < = MAX and.gif p! = NULL and.gif flag! = 0) and the sufficient condition is x > MAX.

The results of the sink experiments for Linux are presented in Figure 3. The table in Figure 4 presents a summary of the results of both the source and sink experiments for OpenSSH, Samba, and Linux. The histogram in Figure 3 plots the size of necessary (resp. sufficient) conditions against the number of constraints that have a necessary (resp. sufficient) condition of the given size. In this figure, red bars indicate necessary conditions, green bars indicate sufficient conditions, and note that the y-axis is drawn on a log-scale. Observe that 95% of all necessary and sufficient conditions have fewer than five subclauses, and 99% have fewer than ten subclauses, showing that necessary and sufficient conditions are small in practice. Figure 4 presents average necessary and sufficient condition sizes at sinks (rows 2 and 3) for all three applications we analyzed, confirming that average necessary and sufficient condition sizes are consistently small across all of our benchmarks.

Our second experiment applies these techniques to finding null dereference errors. We chose null dereferences as an application because checking for null dereference errors with sufficient precision often requires tracking complex path conditions. In the results presented in Figure 5, we compare two different setups: In the interprocedurally path-sensitive analysis, we use the technique described in the paper, computing strongest necessary conditions for a null pointer to be dereferenced. In the second setup (i.e., the intraprocedurally path-sensitive case), for each function, we only compute which pointers may be dereferenced in that function, but we do not track the condition under which pointers are dereferenced across functions. We believe this comparison is useful in quantifying the benefit of the technique proposed in the paper because, without the elimination of choice variables, (i) the interprocedurally path-sensitive analysis may not even terminate, and (ii) the number of choice variables grows linearly in the size of the program, overwhelming the constraint solver. In fact, for this reason, all previous analyses written in Saturn were either interprocedurally path-insensitive or adopted incomplete heuristics to decide which conditions to track across function boundaries.1

The first three columns of Figure 5 give the results of the experiments for the first setup, and the last three columns of the same figure present the results of the second setup. One important caveat is that the numbers reported here exclude error reports arising from array elements and recursive fields of data structures. Saturn does not have a sophisticated shape analysis; hence, the overwhelming majority (>95%) of errors reported for elements of unbounded data structures are false positives. However, shape analysis is an orthogonal problem which we neither address nor evaluate in this work.

A comparison of the results of the intraprocedurally and interprocedurally path-sensitive analyses shows that our technique reduces the number of false positives by close to an order of magnitude without resorting to heuristics or compromising soundness in order to eliminate errors arising from interprocedural dependencies. Note that the existence of false positives does not contradict our previous claim that our technique is complete. First, even for finite domains, our technique can provide only relative completeness; false positives can still arise from orthogonal sources of imprecision in the analysis. Second, while our results are complete for finite domains, we cannot guarantee completeness for arbitrary domains.

Back to Top

7. Conclusion

We have presented a method for systematically reasoning about unknown values in static analysis systems. We argued that, while representing unknown values by choice variables adds useful precision by correlating multiple uses of the same unknown value, eliminating these choice variables at function boundaries is necessary to avoid both scalability as well as termination problems. We have presented a technique to eliminate these choice variables with no loss of information for answering may and must queries about program properties. We have also experimentally demonstrated that analyzing unknown values in this way leads to much better precision and better scalability.

Back to Top

References

1. Aiken, A., Bugrara, S., Dillig, I., Dillig, T., Hackett, B., Hawkins, P. An overview of the SATURN project. In PASTE (2007), 4348.

2. Ball, T. Rajamani, S. Bebop: A symbolic model checker for Boolean programs. In SPIN (2000), 113130.

3. Ball, T., Rajamani, S. Automatically validating temporal safety properties of interfaces. LNCS 2057 (2001), 103122.

4. Boole, G. An Investigation of the Laws of Thought. Dover Publications, Incorporated, 1858.

5. Cook, B., Gotsman, A., Podelski, A., Rybalchenko, A., Vardi, M. Proving that programs eventually do something good. In POPL (2007), 265276.

6. Das, M., Lerner, S., Seigle, M. ESP: Path-sensitive program verification in polynomial time. In PLDI (2002), 5768.

7. Henglein, F. Type inference and semi-unification. In Conference on LISP and Functional Programming (1988), 184197.

8. Henzinger, T., Jhala, R., Majumdar, R., McMillan, K. Abstractions from proofs. In POPL (2004), 232244.

9. Mycroft, A. Polymorphic type schemes and recursive definitions. In International Symposium on Programming (1984), 217228.

10. Reps, T., Horwitz, S., Sagiv, M. Precise interprocedural dataflow analysis via graph reachability. In POPL (1995), 4961.

11. Schmidt, D. A calculus of logical relations for over-and underapproximating static analyses. Science of Computer Programming 64, 1 (2007), 2953.

Back to Top

Authors

Isil Dillig {isil@cs.stanford.edu}, Computer Science Department, Stanford University.

Thomas Dillig {tdillig@cs.stanford.edu}, Computer Science Department, Stanford University.

Alex Aiken {aiken@cs.stanford.edu}, Computer Science Department, Stanford University.

Back to Top

Footnotes

a. While this function would typically be written using a loop, the same problem arises both for loops and recursive functions, and we use a recursive function because it is easier to explain.

b. Note that rules 3, 10, 11, and 12 implicitly quantify over multiple hypotheses; we have omitted explicit quantifiers to avoid cluttering the rules.

The original version of this paper is entitled "Sound, Complete, and Scalable Path-Sensitive Analysis" and was published in the Proceedings of Programming Language Design and Implementation (PLDI) 2008, ACM.

DOI: http://doi.acm.org/10.1145/1787234.1787259

Back to Top

Figures

F1Figure 1. Inference rules.

F2Figure 2. Example code.

F3Figure 3. Frequency of necessary and sufficient condition sizes (in terms of the number of Boolean connectives) at sinks for Linux.

F4Figure 4. Necessary and sufficient condition sizes (in terms of number of Boolean connectives in the formula) for pointer dereferences.

F5Figure 5. Results of null dereference experiments.

Back to top


©2010 ACM  0001-0782/10/0800  $10.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc.