Opinion
Software Engineering and Programming Languages

It’s the Specification, Stupid!

Seeking a significant shift in the traditional software development and verification paradigm.

Posted
contemporary art collage

Key components of the software supply chain can and should be designed with reasonable confidence that they will not fail. But this requires a shift from the prevailing test-fix-test coding cycles to a better paradigm where software is generated from rigorously validated “big” specifications.

Software is the primary driver of the modern digital society. But software is also a major source of failure due to countless bugs and vulnerabilities waiting to be triggered or maliciously exploited. The consequences of software failure are often severe:

  • The estimated engineering cost of fixing poor-quality software exceeds $1 trillion annually in the U.S. alone,a with failure to patch known vulnerabilities being the largest contributor to these costs, and

  • Cybercrime, which thrives on software vulnerabilities, is estimated to be another $8 trillion a year business and growing;b that is nearly $1 billion every hour.

Given this track record, traditional software development and verification techniques that rely heavily on testing and manual inspection have proven both costly and largely ineffective in dealing with the complexity of today’s software supply chain. In fact, the prevailing approach of piling band-aids on top of band-aids on top of poor-quality software only seems to fuel a futile and unwinnable race to the bottom. The code generated by large language models (LLMs) is likely to make things worse, because prompting LLMs is an art in itself, LLMs can respond with unintended and buggy software, and reliable self-checking capabilities of LLMs are currently more wishful thinking than reality.4

Formal Verification

A viable, if not the only, alternative to traditional bug hunting is formal verification, which is a set of mathematically based verification techniques designed to support the principled development of software-intensive systems. There has been impressive progress in integrating formal verification with industrial software development (for example, Chong, N. et al.5). Moreover, complex microprocessors, separation kernels, real-time operating systems, fault-tolerant algorithms, and cryptographic libraries, are nowadays formally verified almost routinely, and billions of small theorems are proved by machines every day (for example, Runta8). The key driving force behind these success stories is the ongoing satisfiability revolution that is making Vannevar Bush’s prophecy, “We may someday click off arguments on a machine with the same assurance that we now enter sales on a cash register,” come true.2 However, despite all the progress, formal verification is not widely used and there is still a large gap between promise and practice.

The Specification Gap

Specification is undoubtedly the Achilles’ heel of formal verification, since specifications tend to be imprecise, ambiguous, inconsistent, even outright wrong, and constantly changing. There is also a lack of agreed-upon specifications of basic software building blocks that are ready for use, since modeling real-world software with all its idiosyncrasies is a very difficult task to solve satisfactorily. This is exemplified by the messy realities of programming languages, programming interfaces with emergent behavior that is often largely undocumented, and the myriad poorly defined standards for data and exchange protocols. The specification gap is also real. Even the most precise specification can at best be a shadow on the wall,7 in the sense that it is a sufficiently precise reflection of the realities of both the underlying technical computational system and the uncertainties of the physical world in which it operates and on which it operates. As a direct result of the specification gap, it is not uncommon for even formally verified software to still contain alarming bugs (for example, Monniaux6), which is why formal software verification is often regarded as an academic Glasperlenspiel.

Verify First

We are certainly not alone in arguing that debugging is most effective and efficient in the earlier stages of system development, because the cost of finding and fixing bugs increases dramatically as the system development life cycle progresses. But verification, formal or otherwise, currently still comes rather late in this life cycle. By then, the most serious design errors have already been made, and errors are to be found at least as often in specifications as in their implementation. In this way, verification is like locking the barn door after most of the horses have already left, and verification is mainly reduced to finding and fixing remaining bugs, often against flawed specifications.

There are more efficient (and often more effective) late life-cycle verification methods than formal verification, such as fuzzing. These simulation-based techniques do not need to rely on a full set of formal specifications, they are accurate because they work from the bare-metal computational environment, and are needed anyway to complement formal verification. For all these reasons, simply trying to replace traditional test-based software verification with formal verification is not a winning proposition. But in the early requirements and specification stages of design, when the most egregious and costly errors can still be avoided, formal verification is likely to be most effective.

Generate Later

The formal verification of software is often made excruciatingly difficult. One reason is that algorithms are expressed in real-world programming languages with their machine-oriented data types and numerous side effects (state updates, interrupts, and so on). As a result, many fundamental laws such as replacing equals with equals do not apply. In addition, the underlying program verification calculi, which are usually based on a variant of the Hoare calculus, while intuitive and downright elegant for toy programming languages, become unwieldy for the full range of programming constructs and the overwhelming complexity of underlying state spaces found in modern software systems. This is especially the case for shared variable concurrency mechanisms. Given these limitations and complexities, formal verification based on real-world programming languages can sometimes feel like driving a car with the handbrake on. Worse, verification is often performed on software whose functionality is embedded in complex structures of operating systems, communication buses, and middleware. Therefore, the correctness of the software depends not only on its functional behavior, but also on the underlying technical infrastructure, which must be verified over and over again, instead of verifying the correctness of abstract code once and for all, and then using verified translations to generate concrete realizations. This kind of autocoding of production-quality software from more abstract executable models is standard industry practice because it has the huge advantage of reusing and adapting golden reference models across different applications, operational contexts, and technical platforms. Because performant code can now also be generated from declarative models such as PVS and Lean, traditional code verification will increasingly be decoupled into verification of the correct functional behavior, and X-by-construction generation of technical realizations (where X equals functionally correct, secure, resilient, and so on).

Big Specification

A non-negligible but often neglected side effect of large-scale verification efforts is the formal specification of core elements of the software and hardware infrastructure. Therefore, building on the achievements of formal verification, and as a next logical step, it should be possible to construct and maintain a set of “big” interface specifications for essential components of the software supply chain. This includes instruction set architectures, separation kernels, cryptographic libraries, real-time operating systems, and databases as well as high-level development steps such as architectural refinement, optimizing deployments, and compilers.

Structuring big specifications in terms of precise interface behavior and modular composition is well understood,1 and ongoing big-specification efforts between practitioners and academics include the formalization of WebAssembly, the RISC-V instruction set architecture, and the seL4 hypervisor. But the recent Big Specification meeting at the Isaac Newton Institute5 confirmed that a sufficiently large set of big specifications that can be composed to create real-world applications are lacking.

The development of big specifications is largely a social process. It involves a variety of stakeholders with different backgrounds and interests. The challenge is therefore to understand, integrate, and build on each other’s proposals, change requests, and findings in order to develop and maintain coherent big specifications. SRI’s semantic browser supports this goal by structured, hyperlinked browsing of stakeholder-specific generated views of evolving standards documents (for example, the PDF data format), and by semantic parsing to generate formal models from standards documents, typically expressed in stylized natural language. Standard documents can therefore be formally analyzed with all kinds of verification techniques including static analysis, model checking, and theorem proving. Moreover, machine-executable reference implementations are generated from the formal model, and the effect of proposed changes of the standard is highlighted and traced down to the implementation. For example, there may be confirmation that a proposed change to the standard will have no effect on the correctness of a particular implementation. Note that there is a need for specialized semantic parsers for formalizing standards documents, since current LLMs produce seemingly convincing but ultimately incorrect results, hallucinate about the contents of standards documents, and seem to rely more on reproducing memorized content than actually parsing the provided input specifications. Nevertheless, LLMs are often helpful, at least in the early stages of exploration and specification writing.

Front-Loading

Key components of the software supply chain can and should be designed, with reasonable confidence, to be free of failure in operation. But this requires a shift away from the dominant test-fix-test cycles of von Neumann-style programs and their generally poor support for separating small-scale from large-scale programming. Untrusted software should not be the source of a formal process. Instead, trusted software should be the target. A better paradigm therefore focuses on early life cycle design, assurance, and verification activities followed by the generation of correct and resilient technical realizations from big specifications. This shift in effort can simultaneously unlock several orders of magnitude in productivity and reuse potential while achieving extremely high levels of resilience and security.

Resilient Software Infrastructure

All of this effort is a tiny fraction of the resources we currently spend on the Sisyphean tasks of continually patching software and on repairing the damage caused by largely preventable software failures. Based on our experience with large-scale verification we estimate a coordinated community effort of only 100-to-120 person years to develop a core product line of 8-to-10 composable big specifications of the global software supply stack. It’s also the right time to be doing this, as we have entered an era where generating code is cheap. But, ultimately, societal impetus must create the right incentives to redirect available resources toward a maintainable and resilient software infrastructure, with the promise of greater innovation, competitiveness, and prosperity.

    • 1. Broy, M. and Selic, B. Specifying and composing layered architectures, J. Object Technol23, 1 (Jan. 2024).
    • 2. Bush, V. As we may think. The Atlantic  (1945); https://bit.ly/47sa3GV
    • 3. Chong, N. et al. Code-level model checking in the software development workflow at Amazon Web Services. Software: Practice and Experience 51, 4 (2021).
    • 4. Hampson, M. ChatGPT is terrible at checking its own code. IEEE Spectrum (Dec. 2024); https://bit.ly/4qaL2Yt
    • 5. Krishnaswami, N., Sewell, P., and Shankar, N. Big specification: Specification, proof, and testing at scale. Isaac Newton Institute (2024); https://bit.ly/4qavsMp
    • 6. Monniaux, D. and Boulme, S. The trusted computing base of the CompCert verified compiler. In Proceedings of the European Symp. on Programming (2022).
    • 7. Plato. Allegory of the cave. Republic, Book VII (∼380 BCE).
    • 8. Rungta, N. A billion SMT queries a day. In Proceedings of the Intern. Conf. on Computer Aided Verification. Springer (2022).

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More