Abstract
In this paper, we take the fundamental perspective of fuzzing as a learning process. Suppose before fuzzing, we know nothing about the behaviors of a program P: What does it do? Executing the first test input, we learn how P behaves for this input. Executing the next input, we either observe the same or discover a new behavior. As such, each execution reveals “some amount” of information about P‘s behaviors. A classic measure of information is Shannon’s entropy. Measuring entropy allows us to quantify how much is learned from each generated test input about the behaviors of the program. Within a probabilistic model of fuzzing, we show how entropy also measures fuzzer efficiency. Specifically, it measures the general rate at which the fuzzer discovers new behaviors. Intuitively, efficient fuzzers maximize information. From this information theoretic perspective, we develop ENTROPIC, an entropy-based power schedule for greybox fuzzing that assigns more energy to seeds that maximize information. We implemented ENTROPIC into the popular greybox fuzzer LIBFUZZER. Our experiments with more than 250 open-source programs (60 million LoC) demonstrate a substantially improved efficiency and confirm our hypothesis that an efficient fuzzer maximizes information. ENTROPIC has been independently evaluated and integrated into the main-line LIBFUZZER as the default power schedule. ENTROPIC now runs on more than 25,000 machines fuzzing hundreds of security-critical software systems simultaneously and continuously.
1. Introduction
Fuzzing is an automatic software testing technique where the test inputs are generated in a random manner. Due to its efficiency, fuzzing has become one of the most successful vulnerability discovery techniques. For instance, in the three years since its launch, the ClusterFuzz project alone has found about 16,000 bugs in the Chrome browser and about 11,000 bugs in over 160 open-source projects—only by fuzzing.a A fuzzer typically generates random inputs for the program and reports those inputs that crash the program. But, what is fuzzer efficiency?
In this paper, we take an information-theoretic perspective and understand fuzzing as a learning process.b We argue that a fuzzer’s efficiency is determined by the average information that each generated input reveals about the program’s behaviors. A classic measure of information is Shannon’s entropy.22 If the fuzzer exercises mostly the same few program behaviors, then Shannon’s entropy is small, the information content for each input is low, and the fuzzer is not efficient at discovering new behaviors. If however, most fuzzer-generated inputs exercise previously unseen program behaviors, then Shannon’s entropy is high and the fuzzer performs much better at discovering new behaviors.
We leverage this insight to develop the first entropy-based power schedule for greybox fuzzing. ENTROPIC assigns more energy to seeds revealing more information about the program behaviors. The schedule’s objective is to maximize the efficiency of the fuzzer by maximizing entropy. A greybox fuzzer generates new inputs by slightly mutating so-called seed inputs. It adds those generated inputs to the corpus of seed inputs which increase code coverage. The energy of a seed determines the probability with which the seed is chosen. A seed with more energy is fuzzed more often. A power schedule implements a policy to assign energy to the seeds in the seed corpus. Ideally, we want to assign the most energy to those seeds that promise to increase coverage at a maximal rate.
We implemented our entropy-based power schedule into the popular greybox fuzzer LIBFUZZER16 and call our extension ENTROPIC. LIBFUZZER is a widely-used greybox fuzzer that is responsible for the discovery of several thousand security-critical vulnerabilities in open-source programs. Our experiments with more than 250 open-source programs (60 million LoC) demonstrate a substantially improved efficiency and confirm our hypothesis that an efficient fuzzer maximizes information.
Since the conference article6 was published, upon which this CACM Research Highlight is based, Entropic has become the default power schedule in LIBFUZZER which powers the largest fuzzing platforms at Google and Microsoft and fuzzes hundreds of security-critical projects, including the Chrome browser, on 100k machines round the clock.
In order to stand our information theoretic perspective on solid foundations, we explore a probabilistic model of the fuzzing process. We show how non-deterministic grey-box fuzzing can be reduced to a series of non-deterministic blackbox fuzzing campaigns, which allows us to derive the local entropy for each seed and the current global entropy for the fuzzer. To efficiently approximate entropy, we introduce several statistical estimators.
Our information-theoretic model explains the performance gains of our entropy-based power schedule for grey-box fuzzing. Our schedule assigns more energy to seeds that have a greater local entropy, that is, that promise to reveal more information. However, we also note that this information-theoretic model does not immediately apply to deterministic fuzzers which systematically enumerate all inputs they can generate or whitebox fuzzers which systematically enumerate all (interesting) paths they can explore. For our probabilistic model to apply, the fuzzer should generate inputs in a random manner (with replacement). If a deterministic phase is followed by a non-deterministic phase (for example, in the default configuration of the AFL, American Fuzzy Lop, greybox fuzzer), we can compute Shannon’s entropy only for the non-deterministic phase.
In summary, our work makes the following contributions:
- We develop an information-theoretic foundation for non-deterministic fuzzing which studies the average information each test reveals about a program’s behaviors. We formally link Shannon’s entropy to a fuzzer’s behavior discovery rate, that is, we establish efficiency as an information-theoretic quantity.
- We present the first entropy-based power schedule to boost the efficiency of greybox fuzzers. We provide an open-source implementation, called ENTROPIC, and present a substantial empirical evaluation on over 250 widely-used, open-source C/C++ programs producing over 2 CPU years worth of data (scripts and data available at https://doi.org/10.6084/m9.figshare.12415622.v2).
2. Probabilistic Blackbox Fuzzing
Fuzzing is an automatic software testing technique where the test inputs are generated in a random manner. Based on the granularity of the runtime information that is available to the fuzzer, we can distinguish three fuzzing approaches. A blackbox fuzzer does not observe or react to any runtime information. A greybox fuzzer leverages coverage or other feedback from the program’s execution to dynamically steer the fuzzer. A whitebox fuzzer has a perfect view of the execution of an input. For instance, symbolic execution enumerates interesting program paths.
Non-deterministic blackbox fuzzing lends itself to probabilistic modeling because of the small number of assumptions about the fuzzing process. Unlike greybox fuzzing, blackbox fuzzing is not subject to adaptive bias.c We adopt the recently proposed Software Testing As Species Discovery of Species (STADS) probabilistic model for non-deterministic blackbox fuzzing3: Each generated input can belong to one or more species. Beyond the probabilistic model, STADS provides biostatistical estimators, for example, to estimate, after several hours of fuzzing, the probability of discovering a new species or the total number of species.
2.1. Software testing as discovery of species
Let P be the program that we wish to fuzz. We call as P‘s input space D the set of all inputs that P can take in. The fuzzing of P is a stochastic process
of sampling N inputs with replacement from the program’s input space. We call F as fuzzing campaign and a tool that performs F as a non-deterministic blackbox fuzzer.
Suppose, we can subdivide the search space D into S individual subdomains called species.3 An input Xn ∈ F is said to discover species Di if Xn ∈ Di and there does not exist a previously sampled input Xm ∈ F such that m < n and Xm ∈ Di (that is, Di is sampled for the first time). An input’s species is defined based on the dynamic program properties of the input’s execution. For instance, each branch that is exercised by input Xn ∈ D can be identified as a species. The discovery of the new species then corresponds to an increase in branch coverage.
Global species discovery. We let pi be the probability that the n-th generated input Xn belongs to species Di, Pi = P[Xn ∈ Di] for i : 1 ≤ i ≤ S and n : 1 ≤ n ≤ N. We call the fuzzer’s global species distribution. The expected number of discovered species . The discovery rate Δ(n),5 that is, the expected number of species discovered with the (n+1)-th generated test input is defined as Δ(n) = S(n + 1) − S(n).
Mutation-based blackbox fuzzing. We extend the STADS framework with a model for mutation-based fuzzing. Let C be a set of seed inputs, called the seed corpus and qt be the probability that the fuzzer chooses the seed t ∈ C.d For each seed t, let Dt be the set of all inputs that can be generated by applying the available mutation operators to t. The mutational fuzzing of t is a stochastic process
of sampling Nt inputs with replacement by random mutation of the seed t. We call all species that can be found by fuzzing a seed t as the species in t‘s neighborhood.
Local species discovery. We let be the probability that the n-th input which is generated by mutating the seed t ∈ C belongs to species for i : 1 ≤ i ≤ S. and n : 1 ≤ n ≤ N. We call the local species distribution in the neighborhood of the seed t. For a locally unreachable species Dj, we have . The discovery rate Δ(n), that is, the expected number of species discovered with the (n+1)-th generated test input is defined as Δ(n) = S(n + 1) − S(n).
3. Information Theory of Fuzzing
We provide an information-theoretic foundation for non-deterministic blackbox fuzzing. In the context of fuzzing, Shannon’s entropy H has several interpretations. It quantifies the average information a generated test input reveals about the behaviors (that is, species) of the program. Alternatively, we say, a generated test input reduces our uncertainty by H information units (for example, nats or bits), on average. Entropy also gives the minimum number of information units needed to reliably store the entire set of behaviors the fuzzer is capable of testing. Moreover, entropy is a measure of diversity. Low entropy means that the program does either not exhibit many behaviors or most generated inputs test the same behaviors (that is, belong to an abundant species).
3.1. Information theory in a nutshell
Shannon’s entropy H22 measures the average amount of information in each sample Xn ∈ F about the species that can be observed by executing the program P. When there are S distinct species, the entropy H is:
Figure 1 illustrates the concept informally. Each color represents a different species. We learn about the colors in each urn by sampling. Just how much we learn from each sampling differs from urn to urn. For instance, in Urn 1 it is three times more likely to draw a white ball than a black. It takes more attempts to learn about black balls in Urn 1 compared to Urn 2. Hence, we expect less information about the urn’s colors in a draw from Urn 1. In fact, given the same number of colors S, the entropy is maximal when all colors are equiprobable p1 = … = pS. Among the three urns, we expect to get the maximal amount of information about the urn’s colors by drawing from Urn 3. Even though there is still a dominating color (black), there is now an additional color (blue) that can be discovered.
Figure 1. Learning colors by sampling with replacement.
3.2. If each input belongs to multiple species
Shannon’s entropy is defined for the multinomial distribution where each input belongs to exactly one species (for example, exercise exactly one path). However, an input can belong to several species so that . For instance, considering a branch in P as a species, each input exercises multiple branches. The top-level branch is exercised with probability one. When it is possible that , Chao et al.11 and Yoo et al.25 suggest to normalize the probabilities and compute such that . This normalization maintains the fundamental properties of information based on which Shannon developed his formula, that is, that information due to independent events is additive, that information is a non-negative quantity, etcetera. The normalized entropy H is computed as
We note that Equation (4) reduces to Equation (3) for the special case where . We also note that the resulting quantity is technically not the average information per input.
3.3. The Local Entropy of a Seed
Recall from Section 2.1, that we call the probabilities that fuzzing a seed t ∈ C generates an input that belongs to species Di as the local species distribution of t. Moreover, we call the set of species as the neighborhood of the seed t. From the local species distribution of t, we can compute the local entropy Ht of t as a straight-forward application of Equation 4,
The local entropy Ht of t quantifies the information that fuzzing t reveals about the species in t‘s neighborhood.
3.4. Information-theoretic efficiency measure
Intuitively, the rate at which we learn about program behaviors, that is, the species in the program, also quantifies a blackbox fuzzer’s efficiency. We formally demonstrate how Shannon’s entropy H characterizes the general discovery rate Δ(n) as follows.
THEOREM 1. Let Shannon’s entropy be defined as in Equation (4). Let Δ(n) be the expected number of new species the fuzzer discovers with the (n + 1)-th generated test input, then
characterizes the rate at which species are discovered in an infinitely long-running campaign, where is a normalizing constant.
PROOF. We refer to the extended version.6
According to Theorem 1, entropy measures the species discovery rate Δ(n) over an infinitely long-running fuzzing campaign where discovery is gradually discounted as the number of executed tests n goes to infinity. Notice that Δ(n) ≥ 0 for all n ≥ 0. If we simply took the sum of Δ(n) over all n, we would compute the total number of species . However, S provides no insight into the efficiency of the discovery process. Instead, the diminishing factor 1/n in Equation (6) reduces the contribution of species discovery as testing effort n increases. The number of species discovered at the beginning of the campaign has a higher contribution to H than the number of species discovered later. In other words, a shorter “ramp up” time yields a higher entropy.
Estimation. We estimate Shannon’s entropy H based on how often we have seen each observed species. The incidence frequency Yi for species Di is the number of generated test inputs that belong to Di. Undetected species yield Yi = 0. An unbiased estimator of the local discovery probability pi is , where n is the total number of generated test inputs. By plugging into Equation (4), we can estimate the entropy H using maximum likelihood estimation (MLE). In our model, the estimated entropy ĤMLE of H is
Greybox. Now, non-deterministic blackbox fuzzing satisfies the assumption that the global species distribution is invariant during the campaign. However, this assumption does not hold for greybox fuzzing which leverages program feedback. Generated inputs that have discovered new species (that is, increased coverage) are added to the corpus. The availability of added seeds changes the global species distribution (but not the local distributions for each seed) and thus the global entropy for a greybox fuzzer. In other words, a greybox fuzzer becomes more efficient over time. We refer to the longer conference version6 for a discussion of the required extension of the probabilistic model to estimate global entropy for greybox fuzzing.
Algorithm 1 ENTROPIC Algorithm.
Input: Program P, Initial Seed Corpus C
- while ¬TIMEOUT() do
- for all t ∈ C. ASSIGNENERGY(t) // power schedule
- total = Σt ∈ C t.energy // normalizing constant
- for all // normalized energy
- t = sample t from C with probability t.energy
- t’ = MUTATE(t) // fuzzing
- if P(t’) crashes then return crashing seed t’
- else if P(t’) increases coverage then add t’ to C
- for all covered elements i ∈ P exercised by t’ do
- // local incidence freq.
- end for
- end while
return Augmented Seed Corpus C
4. Information Theoretic Boosting
We present an entropy-based boosting strategy for greybox fuzzing that maximizes the information each generated input reveals about the species (that is, behaviors) in a program. Our technique ENTROPIC is implemented into the popular greybox fuzzer LIBFUZZER,16 which is responsible for at least 12,000 bugs reported in security-critical open-source projects and more than 16,000 bugs reported in the Chrome browser.20 After a successful independent evaluation of ENTROPIC by Google, our entropy-based power schedule has been integrated into the main-line LIBFUZZER and made the default power schedule.
A greybox fuzzer starts with a corpus of seed inputs and continuously fuzzes these by applying random mutations. Generated inputs that increase coverage are added to the corpus. The probability (that is, frequency) with which a seed is chosen for fuzzing is called the seed’s energy. The procedure that assigns energy to a seed is called the fuzzer’s power schedule. For instance, LIBFUZZER’s standard schedule assigns more energy to seeds that were found later in the fuzzing campaign. It is this power schedule that we modify.
Algorithm 1 shows how greybox fuzzing is implemented in LIBFUZZER; our changes for ENTROPIC are shown as green boxes. In a continuous loop, the fuzzer samples a seed t ∈ C from a distribution that is given by the seeds’ normalized energy. This energy is computed using ASSIGNENERGY which implements one of our information-theoretic power schedules. The seed t is then mutated using random bit flips and other mutation operators to generate an input t’. If the execution crashes or terminates unexpectedly, for example, due to limits on execution time, memory usage, or sanitizers,21 t’ is returned as crashing input, and LIBFUZZER stops. If the execution increases coverage, t’ is added to the corpus. We call the number of inputs generated by fuzzing a seed t ∈ C and that belongs to species Di as local incidence frequency .
4.2. Entropy-based power schedule
Our entropy-based schedule assigns more energy to seeds that elicit more information about the program’s species. In other words, the fuzzer spends more time fuzzing seeds which lead to the more efficient discovery of new behaviors. The amount of information about the species in the neighborhood of a seed t that we expect for each generated test input is measured using the seed’s local entropy Ht.
The entropy-based power schedule is inspired by Active Simultaneous Localization And Mapping (SLAM),10 a problem in robot mapping: An autonomous robot is placed in an unknown terrain; the objective is to learn the map of the terrain as quickly as possible. General approaches approximate Shannon’s entropy of the map under hypothetical actions.8,10 The next move is chosen such that the reduction in uncertainty is maximized. Similarly, our schedule chooses the next seed such that the information about the program’s species is maximized.
Improved estimator. During our experiments, we quickly noticed that the maximum likelihood estimator ĤMLE in Equation (7) cannot be used. A new seed t that has never been fuzzed will always be assigned zero energy . Hence, it would never be chosen for fuzzing and forever remain with zero energy. We experimented with a screening phase to compute a rough estimate. Each new seed was first fuzzed for a fixed number of times. However, we found that too much energy was wasted gaining statistical power that could have otherwise been spent discovering more species.
To overcome this challenge we took a Bayesian approach. We know that entropy is maximal when all probabilities are equal. For a new seed t, we assume an uninformative prior for the probabilities , that is, , where is the probability that fuzzing t generates an input that belongs to species Di. With each input that is generated by fuzzing t, the probabilities are incrementally updated. The posterior is a Beta distribution over . The estimate of is thus the mean of this beta distribution which is also known as the Laplace estimator (LAP) or add-one smoothing,
where Sg = S(n) is the number of globally discovered species.
We define the improved entropy estimator as
Figure 2 illustrates the main idea. Both estimators are nearly unbiased from two hours onwards. In other words, they are within 1% from the true value, that is, .e In the beginning, the MLE is negatively biased and approaches the true value from below while the LAP is positively biased and approaches the true value from above. Both estimators robustly estimate the same quantity, but only LAP assigns high energy when seed t has not been fuzzed enough for an accurate estimate of the seed’s information Ht.
Figure 2. Mean estimator bias over time. We monitored estimates for the same seed t over 6h across 20 runs. Estimator bias is the difference between the mean estimate and the true value Ht divided by the true value Ht, where Ht is the average of both mean estimates at 6 hours.
Measuring information only about rare species. During our initial experiments, we also noticed that the entropy estimates for different seeds were almost the same. We found that the reason is a small number of very abundant species which have a huge impact on the entropy estimate. There are some abundant species to which each and every generated input belongs. Hence, we defined a global abundance threshold ϑ and only maintain local incidence frequencies of globally rare species Di that have a global incidence frequency Yi ≤ ϑ. Intuitively, rare species are the ones creating new behaviors. We refer to the extended version6 for an evaluation of the sensitivity of the boosting technique on the abundance threshold ϑ.
5. Experimental Evaluation
Our main hypothesis is that increasing information per generated input increases fuzzer efficiency. To evaluate our hypothesis, we ask the following research questions.
RQ.1 What is the coverage improvement over the baseline?
RQ.2 How much faster are bugs detected vs to the baseline?
RQ.3 What is the cost of maintaining incidence frequencies?
Implementation and baseline. We implemented our entropy-based power schedule into LIBFUZZER (363 lines of change) and call our extension as ENTROPIC. LIBFUZZER is a state-of-the-art vulnerability discovery tool developed at [blinded] which has found almost 30k bugs in hundreds of closed- and open-source projects.
As a coverage-based greybox fuzzer (see Alg. 1), LIBFUZZER seeks to maximize code coverage. Hence, our species is a coverage element, called a feature. A feature is a combination of the branch covered and hit count. For instance, two inputs (exercising the same branches) have a different feature set if one exercises a branch more often. Hence, feature coverage subsumes branch coverage. In contrast to LIBFUZZER, ENTROPIC also maintains the local and global incidence frequencies for each feature. We study the performance hit in RQ3.
* Our extension ENTROPIC has been independently evaluated by the company that is developing LIBFUZZER and was found to improve on LIBFUZZER with statistical significance. ENTROPIC was integrated into the main-line LIBFUZZER and is currently subject to public code review. Once integrated, ENTROPIC is poised to run on more than 25,000 machines fuzzing hundreds of security-critical software systems simultaneously and continuously.
Benchmark subjects. We compare ENTROPIC with LIBFUZZER on two benchmarks containing 250+ open-source programs used in many different domains, including browsers. We conducted almost 1000 one-hour fuzzing campaigns and 2,000 six-hour campaigns to generate almost two CPU years’ worth of data.
FTSf (12 programs, 1.2M LoC, 1 hour, 40 repetitions) is a standard set of real-world programs to evaluate fuzzer performance. The subjects are widely-used implementations of file parsers, protocols, and databases (for example, libpng, openssl, and sqlite), amongst others. Each subject contains at least one known vulnerability (CVE), some of which require weeks to be found. The Fuzzer Test Suite (FTS) allows to compare the coverage achieved as well as the time to find the first crash on the provided subjects. There are originally 25 subjects, but we removed those programs where more than 15% of runs crashed (leaving 12 programs with 1.2M LoC). As LIBFUZZER aborts when the first crash is found, the coverage results for those subjects would be unreliable. We set an 8GB memory limit and ran LIBFUZZER for 1 hour. To gain statistical power, we repeated each experiment 40 times. This required 40 CPU days.
OSS-Fuzzg (263 programs, 58.3M LoC, 6 hours, 4 repetitions) is an open-source fuzzing platform developed by Google for the large-scale continuous fuzzing of security-critical software. At the time of writing, OSS-Fuzz featured 1326 executable programs in 176 open-source projects. We selected 263 programs totaling 58.3 million lines of code by choosing subjects that did not crash or reach the saturation point in the first few minutes and that generated more than 1000 executions per second. Even for the chosen subjects, we noticed that the initial seed corpora provided by the project are often for saturation: Feature discovery has effectively stopped shortly after the beginning of the campaign. It does not give much room for further discovery. Hence, we removed all initial seed corporates. We ran LIBFUZZER for all programs for 6 hours and, given a large number of subjects, repeated each experiment four times. This required 526 CPU days.
Computational resources. All experiments for FTS were conducted on a machine with Intel(R) Xeon(R) Platinum 8170 2.10GHz CPUs with 104 cores and 126GB of main memory. All experiments for OSS-Fuzz were conducted on a machine with Intel(R) Xeon(R) CPU E5-2699 v4 2.20GHz with a total of 88 cores and 504GB of main memory. To ensure a fair comparison, we always ran all schedules simultaneously (same workload), each schedule was bound to one (hyperthread) core, and 20% of cores were left unused to avoid interference. In total, our experiments took more than 2 CPU years which amounts to more than two weeks of wall clock time.
Setup for Figure 2. Throughout the paper, we reported on the results of small experiments. In all cases when not otherwise specified, we used LIBFUZZER with the original power schedule to fuzz the LibPNG project from the fuzzer-test-suite (FTS) started with a single seed input. For Figure 2, we conducted 20 runs of 6 hours and monitored the single seed input that we started LIBFUZZER with. We printed all four estimates in regular intervals.
* Empirical results confirm our hypothesis that increasing the average information each generated input reveals about the program’s species increases the rate at which new species are discovered. By choosing the seed that reveals more information, efficiency is improved.
Figure 3 shows the mean coverage over time and the Vargha-Delaney2 effect size Â12. Values above 0.56, 0.63, and 0.71 indicate a small, medium, and large effect size, respectively. More intuitively, the values indicate how much more likely it is for an ENTROPIC run to cover more features than a LIBFUZZER run (or less likely if under 0.5). Values in stars (*Â12*) indicate statistical significance (Wilcoxon rank-sum test; p < 0.05). For instance, for vorbis-2017-12-11 ENTROPIC (with our entropy-based schedule) covers about 700 features in under 15 min while LIBFUZZER (with the original schedule) takes one hour.
Figure 3. Mean coverage in a 60-min fuzzing campaign (12 subjects × 2 schedules × 40 runs × 1 hour ≈ 40 CPU days). The dashed, vertical lines show when ENTROPIC achieves the same coverage as LIBFUZZER in 1 hour. The values at the bottom right give the Vargha-Delaney effect size Â12.
ENTROPIC substantially outperforms LIBFUZZER within the one-hour time budget for 9 of 12 subjects. For two out of three cases where the Â12 effect size is considered medium, the mean difference in feature coverage is substantial (30% and 80% increase for libpng and openssl-1.1.0c, resp.). In almost all cases, ENTROPIC is more than twice as fast (2x) as LIBFUZZER. The same coverage that LIBFUZZER achieves in one hour, ENTROPIC can achieve in less than 30 min. All differences are statistically significant. The coverage trajectories seem to indicate that the benefit of our entropy schedule becomes even more pronounced for longer campaigns. We increase the campaign length to 6h for our experiments with OSS-Fuzz (RQ2).
RQ1.2. Large Scale Validation on OSS-Fuzz
* Results for 263 open-source C/C++ projects validate our empirical findings. ENTROPIC generally achieves more coverage than LIBFUZZER. The coverage increase is larger than 10% for a quarter of programs. ENTROPIC is more than twice as fast as LIBFUZZER for half the programs. The efficiency boost increases with the length of the campaign.
Figure 4(a) shows the mean coverage increase of ENTROPIC over LIBFUZZER on a logarithmic scale over all 263 subjects. The dashed line represents the coverage increase of LIBFUZZER over ENTROPIC for the cases when LIBFUZZER achieves more coverage. We can see that ENTROPIC achieves more coverage than LIBFUZZER after six hours of fuzzing for about 77% of subjects. ENTROPIC covers at least 10% more features than LIBFUZZER for about 25% of subjects. We investigated more closely the 23% of the subjects where ENTROPIC achieves less coverage. First, for half of them, the coverage difference was marginal (less than 2%). Second, these subjects were much larger (twice the number of branches on average). As we will see in RQ3 that the performance overhead incurred by ENTROPIC grows linearly with the number of branches.
Figure 4. OSS-Fuzz coverage results (263 subjects × 2 schedules × 4 runs × 6 hours ≈ 1.5 CPU years).
Figure 4(b) shows how much faster ENTROPIC is in achieving the coverage that LIBFUZZER achieves in six hours. Again, the dashed line shows the inverse when LIBFUZZER achieves more coverage at the six-hour mark. We can see that ENTROPIC achieves the same coverage twice as fast for about 50% of subjects and four times as fast for 25% of subjects. More specifically, ENTROPIC achieves the same coverage in 1.5h as LIBFUZZER achieves in 6h for 66 subjects.
Figure 4(c) shows the proportion of subjects where ENTROPIC achieves more coverage than LIBFUZZER (that is, wins) over time. Both fuzzers break even at about 10 min. After 30 min, ENTROPIC already wins for 64% of subjects, until at 6 hours, ENTROPIC wins for about 77% of subjects. We interpret this result as ENTROPIC becoming more effective at boosting LIBFUZZER as saturation is being approached. At the beginning of the campaign, almost every new input leads to species discovery. Later in the fuzzing campaign, it becomes more important to choose high-entropic seeds. Moreover, estimator bias is reduced when more inputs have been generated.
* In the OSS-Fuzz benchmark, ENTROPIC found most crashes faster than LIBFUZZER. Some crashes were found only by ENTROPIC. These crashes are potential zero-day vulnerabilities in widely-used security-critical open-source libraries.
Figure 5 shows the time it takes to find each crash as an aggregate statistic in ascending order overall crashes that have been discovered in any of the four runs of all subjects for both fuzzers. For instance, for 2.5% of 263 · 4 runs, ENTROPIC finds a crash in 1.5 hours or less while LIBFUZZER takes two hours or less. The crashes are real and potentially exploitable. All subjects are security-critical and widely used. Google maintains a responsible disclosure policy for bugs found by OSS-Fuzz. This gives maintainers some time to patch the crash before the bug report is made public. Three bugs are discovered only by ENTROPIC.
Figure 5. OSS-Fuzz crash Time-To-Error results (≈ 1.5 CPU years). X% of runs crashed in Y hours or less. ENTROPIC (dashed) and LIBFUZZER (solid). Lower is better.
* In the FTS benchmark, there is a 2% median overhead for maintaining incidence frequencies compared to the entire fuzzing process. There is a 12% median overhead compared to the time spent only in the fuzzer (not in the subject).
Figure 6 shows the proportion of the time that ENTROPIC spends in the different phases of the fuzzing process. In all cases, the most time is spent executing the subject (bright gray). ENTROPIC executes the subject between 10,000 and 100,000 times per second. The remainder of the time is spent in the fuzzer, where the darker gray bars represent functions that LIBFUZZER normally performs while the black bars represent the overhead brought by ENTROPIC.
Figure 6. ENTROPIC instrumentation overhead.
The maintenance of incidence frequencies takes more time away from the fuzzing process than we expected, given the substantial performance gains discussed in RQ1 and RQ2. Note that ENTROPIC outperforms LIBFUZZER despite this additional overhead. ENTROPIC is a prototype. We are confident that there are plenty of opportunities to reduce this overhead to further boost ENTROPIC’s efficiency.
6. Threats to Validity
Like for any empirical study, there are threats to the validity of our results. We adopted several strategies to increase internal validity. In order to put no fuzzer at a disadvantage, we used default configurations, provided the exact same starting condition, and executed each technique several times and under the same workload. The time when the fuzzer crashes identify unambiguously when a bug is found. To define species in our experiments, we use the natural measure of progress for LIBFUZZER and its extension ENTROPIC. To mitigate threats to construct validity such as bugs in ENTROPIC or observed performance differences that are not due to our discussed improvements, we extended the baseline LIBFUZZER using a readily comprehensible 363 lines of code. We adopted several strategies to increase external validity. We repeated all experiments from which we derive empirical statements (RQ1, RQ2, RQ3) at least 40 times. To increase the generality of our results, we conducted experiments on OSS-Fuzz totaling 263 C/C++ programs and 58.3 million LoC.
For a sound statistical analysis, we followed the recommendations of Arcuri et al.2 and Klees et al.15 to the extent to which our computational resources permitted.
7. Related Work
Information theory has previously found application in software test selection. Given a test suite T and the probability p(t) that test case t ∈ T fails, one may seek to minimize the number of test cases t ∈ T to execute while maximizing the information that executing t would reveal about the program’s correctness. Yang et al.23,24 give several strategies to select a test case t’ ∈ T (or a size-limited set of test cases T’ ⊆ T) such that—if we were to execute t’ (or T’)—the uncertainty about test case failure (that is, entropy) is minimized. Unlike our model, the model of Yang et al. requires specifying for each input the expected output as well as the probability of failure (that is, the observed not matching the expected output). Hence, Yang’s model is practical only in the context of test selection, but not in the context of automated test generation. Similarly, Feldt et al.12 propose an information-theoretic approach to measure the distance between test cases, based on Kolmogorov complexity and uses it to maximize the diversity of selected tests. Although their idea is complementary to ours, it is computationally too expensive to be directly applied to test generation. Finally, by considering fuzzing as a random process in a multidimensional space, Ankou18 enables the detection of a different combination of species in fuzzers’ fitness function.
Information theory has also found application in software fault localization. Given a failing test suite T, suppose we want to localize the faulty statement as quickly as possible. Yoo, Harmann, and Clark25 discuss an approach to execute test cases in the order of how much information they reveal about the fault location. Specifically, test cases—which most reduce the uncertainty that a statement is the fault location—will be executed first. Campos et al.9 propose a search-based test generation technique with a fitness function that maximizes the information about the fault location. In contrast, our objective is to quantify and maximize the efficiency of the test generation process in learning about the program’s behaviors (incl. whether or not it contains faults).
Bug finding efficiency and scalability are important properties of a fuzzing campaign. Böhme and Paul7 conduct a probabilistic analysis of the efficiency of blackbox versus whitebox fuzzing, and provide concrete bounds on the time a whitebox fuzzer can take per test input in order to remain more efficient than a blackbox fuzzer. Böhme and Falk4 empirically investigate the scalability of non-deterministic black- and greybox fuzzing and postulate an exponential cost of vulnerability discovery. Specifically, they make the following counter-intuitive observation: Finding the same bugs linearly faster requires linearly more machines. Yet, finding linearly more bugs at the same time requires exponentially more machines. For recent improvements to fuzzing, we refer to Manès et al.17
Alshahwan and Harman1 introduced the concept of “output uniqueness” as (blackbox) coverage criterion, where one test suite is considered as more effective than another if it elicits a larger number of unique program outputs. This blackbox criterion turns out to be similarly effective as whitebox criteria (such as code coverage) in assessing test suite effectiveness. In our conceptual framework, a unique output might be considered as a species.
8. Conclusion
In this paper, we presented ENTROPIC, the first greybox fuzzer that leverages Shannon’s entropy for scheduling seeds. The key intuition behind our approach is to prefer seeds that reveal more information about the program under test. Our extensive empirical study confirms that our information-theoretic approach indeed helps in boosting fuzzing performance in terms of both code coverage and bug-finding ability.
Information theory. We formally link entropy (as a measure of information) to fuzzer efficiency, develop estimators, and boosting techniques for greybox fuzzing that maximize information, and empirically investigate the resulting improvement of fuzzer efficiency. We extend the STADS statistical framework3 to incorporate mutation-based blackbox fuzzing where a new input is generated by modifying a seed input. We hope that our information-theoretic perspective provides a general framework to think about efficiency in software testing irrespective of the chosen measure of effectiveness (that is, independent of the coverage criterion).
Practical impact. Our implementation of ENTROPIC has been incorporated into LIBFUZZER, one of the most popular industrial fuzzers. At the time of writing, ENTROPIC was enabled for 50% of fuzzing campaigns that are run on more than 25,000 machines for finding bugs and security vulnerabilities in over 350 open-source projects, including Google Chrome. After several additional improvements, Entropic now outperforms all other fuzzers available on FuzzBench,19 Google’s fuzzer benchmarking platform. This resulvt highlights the practical impact of our approach.
Open science and reproducibility. The practical impact of ENTROPIC is a testament to the effectiveness of open science, open source, and open discourse. There is a growing number of authors that publicly release their tools and artifacts.13 Conferences are adopting artifact evaluation committees to support reproducibility,14 but, as always, more can be done to accommodate reproducibility as first-class citizens in our peer-reviewing process. We strongly believe that openness is a reasonable pathway to foster rapid and sound advances in the field and to enable a meaningful engagement between industry and academia.
Acknowledgments
We thank the anonymous nominator(s) and the ACM SIGSOFT as well as the CACM Research Highlight selection committees for allowing us to present our work to such a broad and distinguished readership. This work was partly funded by the Australian Research Council (ARC) through a Discovery Early Career Researcher Award (DE190100046). This research was supported by the use of the Nectar Research Cloud, a collaborative Australian research platform supported by the NCRIS-funded Australian Research Data Commons (ARDC). This work was partly supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2019-0-01697, Development of Automated Vulnerability Discovery Technologies for Block-chain Platform Security).
Join the Discussion (0)
Become a Member or Sign In to Post a Comment