Research for Practice: The Fun in Fuzzing

three flat targets with either a dart or swordfish in bullseye, illustration

For this edition of Research for Practice (RfP), we enlisted the help of Stefan Nagy, an assistant professor in the Kahlert School of Computing at the University of Utah. We thank John Regehr—who has written for RfP before—for making this introduction.

Nagy takes us on a tour of recent research in software fuzzing, or the systematic testing of programs via the generation of novel or unexpected inputs. The first paper he discusses extends the state of the art in coverage-guided fuzzing (which measures the testing progress in terms of program syntax) with the semantic notion of “likely invariants,” inferred via techniques from property-based testing. The second explores encoding domain-specific knowledge about certain bug classes (for example, use-after-free errors) into test-case generation. His last selection takes us through the looking glass, randomly generating entire C programs and using differential analysis to compare traces of optimized and unoptimized executions, in order to find bugs in the compilers themselves.

I learned a lot from reading these papers and hope you will too.
—Peter Alvaro

Peter Alvaro is an associate professor of computer science at the University of California Santa Cruz, where he leads the Disorderly Labs research group (disorderlylabs.github.io).

Fuzzing (shorthand for fuzz testing) represents perhaps the most historically successful technique for automated software bug and vulnerability discovery. Fuzzing’s core workflow is simple: Generate lots of test cases; run each on your target program and observe what happens; and then group test cases by their observed execution behaviors (for example, crashes, code coverage, and timeouts), keeping those that exhibit some notion of “interesting” behavior (although what “interesting” means here is entirely up to you).

Over the past several decades, fuzzing has seen a significant evolution: from its origins as a crude, brute-force testing strategy to becoming one of today’s most sophisticated—and essential—techniques for uncovering critical security vulnerabilities at scale. While there are, by now, a countless number of amazing fuzzing papers—with more and more being published each year—this article looks at three particularly innovative and inspiring papers from the past two years.

Moving on from Coverage-Only Fuzzing
A. Fioraldi, D.C. D’Elia, and D. Balzarotti.
The use of likely invariants as feedback for fuzzers. In Proceedings of the 30^th Usenix Security Symposium, 2021; https://www.usenix.org/conference/usenixsecurity21/presentation/fioraldi

Most real-world fuzzing today adopts a code-coverage-maximizing feedback strategy: saving just the inputs that cover previously unseen target code and mutating them, ideally to create new coverage-increasing inputs. You can think of this like horse racing, where the goal is to breed future generations of winners—coverage-increasing inputs—from prior winners. Yet, because coverage-guided fuzzers’ only focus is expanding their frontier of code coverage, they will often miss the many bugs hidden in already-covered code.

To overcome coverage-only fuzzing’s exploration limitations, this 2021 Usenix paper augments fuzzing with property testing’s semantic invariants: a set of “rules” about what states a program’s variables are expected to take on. For example, the following assertion statement

assert (x < 5)

mandates that variable x be less than 5; and should this invariant be violated, the program will signal that this variable has taken on an unusual state. Automatically mining invariants, however, often brings significant false positives and noise. Naively tracking every possible invariant will both overwhelm a fuzzer with too much feedback and deteriorate its throughput from all of the per-invariant instrumentation overhead.

To mitigate this, the authors replay previously generated fuzzing inputs—a great way to leverage ordinarily discarded fuzzing data—and introduce a set of heuristics to prune impossible-to-violate, redundant, and otherwise erroneous invariants. These final invariants are then each instrumented with a check that signals any violations to the fuzzer.

By combining invariant and coverage feedback, the authors’ approach helps fuzzers uncover many more program states than code coverage alone—revealing 34%–56% more bugs at a negligible 8% overhead over coverage-only fuzzing.

With coverage-only fuzzing missing so many bugs, invariant-guided fuzzing is likely soon to become a first-class mode in popular fuzzing frameworks such as AFL++ and libFuzzer. Although the authors’ implementation supports only source-available targets, advancements in binary lifting, translation, and rewriting leave me hopeful that invariant-guided fuzzing techniques will soon be extended to harder-to-fuzz closed-source software and systems.

Tailoring Fuzzing to Specific Vulnerabilities
M-D. Nguyen, S. Bardin, R. Bonichon, R. Groz, and M. Lemerre.
Binary-level directed fuzzing for use-after-free vulnerabilities. In Proceedings of the 23^rd International Symposium on Research in Attacks, Intrusions and Defenses, 2020; https://www.usenix.org/conference/raid2020/presentation/nguyen

A major challenge in fuzzer design is balancing bug-finding speed and breadth. Although coverage-guided fuzzing is quick to expose low-hanging-fruit bugs such as buffer overflows, its coverage-maximizing search is ineffective at triggering bugs that require complex program states (for example, temporal memory-safety violations). Prior attempts to improve general-purpose bug discovery explored finer-grained levels of code coverage (for example, n-gram control-flow edges) but often deteriorated fuzzing speed because of their higher instrumentation and coverage bookkeeping costs. Thus, improving software security requires not only strong general-purpose fuzzers, but also those specialized to certain classes of vulnerabilities.

This paper introduces a directed fuzzing technique tailored to use-after-free vulnerabilities—a dangerous class of temporal memory safety errors that occurs when a program attempts to access a previously freed region of memory. General-purpose fuzzers often struggle to synthesize the series of program events needed to trigger use-after-frees: an allocation; a free; and an access. To overcome this, the authors devise a directing fuzzing strategy to hit these function calls in sequence—effectively incorporating a temporal awareness to guide fuzzing’s exploration toward these vulnerabilities.

Unsurprisingly, the authors’ approach sees great success on use-after-free vulnerabilities: finding known use-after-frees two times faster than existing temporal-agnostic directed fuzzers, while also revealing 11 new use-after-frees in well-fuzzed programs such as Perl and MuPDF.

Although the authors’ prototype targets binary executables, sequence-directed fuzzing easily scales to source-available fuzzing targets, as well as other temporal bugs such as double-frees. I anticipate there will be many more opportunities to improve fuzzing by combining general-purpose and bug-tailored methodologies.

Applying Fuzzing to Other Problems
G.A. Di Luna, D. Italiano, L. Massarelli, S. Osterlund, C. Giuffrida, and L. Querzoni.
Who’s debugging the debuggers? Exposing debug information bugs in optimized binaries. In Proceedings of the 26^th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021; https://dl.acm.org/doi/10.1145/3445814.3446695

Many post-deployment quality-assurance tasks rely on debug information (for example, DWARF): a compiler-emitted mapping of an executable’s code to its higher-level source representation (for example, variables, types, and instructions). Discrepancies in debug information—missing or irrelevant artifacts—are often caused by subtle bugs in development tool chains and exacerbated by compilers’ aggressive optimizations. Preventing developers from being led astray by mishandled debug information demands effective, automated approaches for uncovering these hidden bugs in modern development toolchains.

Improving software security requires not only strong general-purpose fuzzers, but also those specialized to certain classes of vulnerabilities.

This paper introduces a debug-information fuzzing framework based on differential analysis: randomly generating C programs and comparing their optimized and unoptimized versions’ debugging traces for discrepancies. For their differential analysis, the authors formalize a set of trace invariants, which stipulate that an optimized binary maintain consistency with the debugging information (for example, variables, functions, and source lines) present in its unoptimized version’s debugging trace at any point during execution.

The authors then triage discrepancy-inducing test cases to identify where bugs occur, either in the compiler’s emitting of debug information or in the debugger’s parsing and interpretation of it. The authors apply their automated fuzzing technique to a variety of widely used software development toolchains.

Overall, the authors’ differential fuzzing approach demonstrates impressive effectiveness: uncovering 34 debug information errors across the LLVM, GNU, and Rust compiler/debugger toolchains—with more than a dozen reported issues already confirmed as being fixed. As vetting the correctness of debug information is still a fairly new application of fuzzing, I expect future fuzzers will improve upon the techniques in this paper toward even greater scrutiny of development toolchains. Moreover, I am optimistic the power of differential fuzzing will yield useful strategies for other nontrivial classes of bugs, software, and systems.

Happy fuzzing!