Building on Shaky Ground

Dear KV,

By now, I am sure you have seen the CrowdStrike news, and I cannot help but imagine you have opinions on what really went wrong to cause so many problems. The news, both tech and nontech, is full of explanations of what went wrong: poor testing, relying too much on one company for critical infrastructure, allowing a critical module failure to prevent further remote updates … the list goes on. With all that has already been said about this topic, it seems a lot of it is just finger-pointing, and I wonder if anyone has gotten to the heart of the matter, or if they never will and we will just have to live with these sorts of outages—like mini-Y2Ks, only worse.

Missing the Heart

Dear Heartless,

Like anyone else who was not living under a rock, I did see the CrowdStrike issue hit the news, and I am glad my flight made it out before the systems shut down because it was a lot easier to point and laugh after the flight than it was for others who were stuck on the ground. The flight interruptions were only the most visible component of the failure, because huge queues at airports make for great news coverage. But CrowdStrike didn’t just disrupt travel for days: it also affected hospitals and doctors’ offices, banks and ATMs, as well as many other systems that people use daily.

For people who work in computer security, it was actually the day we had all been waiting for: a clear example of, “I told you so!” that is explainable to nontechnical as well as technical folk. You do not need to understand NULL pointer exceptions to get that when your computer does not work, the world doesn’t either. And this is probably the best part (if there is a best part) of this disaster: As far as we know, no one died from this, which is good, and everyone now knows that the world we’ve built rests upon pretty shaky ground. It is the wake-up call a lot of us have been waiting for. The question now is: Will we answer the phone?

A lot of ink has been spilled about how all this came about, from the low-level explanation of the NULL pointer exception to the way that testing missed the issue and how, maybe, we ought not to allow a remote push of software to prevent a system from booting without manual intervention.

The questions to ask now are not: “How do we better lock things down in the current state?” or “How do we have better development practices so we do not push a NULL pointer bug in the lowest level of OS code?” These are good questions, but they aren’t the heart of the matter.

That has to do with how systems are built—systems software in particular—with unsafe languages on unsafe hardware that’s connected to a network on which nobody can be trusted.

Let’s start at the hardware layer and work our way up. Current computer hardware is a wildly complex beast, composed of a diverse set of interconnected elements that, overall, trust each other.

What do I mean by trust? Take the issue of computer memory, from whence our NULL pointer errors arrive. The majority of in-memory software vulnerabilities come from the fact that anyone can do pointer arithmetic. Take a memory address, add or subtract a number and—voila!—you have another possibly valid memory address.

It is game-playing of this type that is at the heart of most computer viruses and has inspired many types of security protections that have been attempted over the years, such as ASLR (address space layout randomization), no-execute bits in hardware, W^X (write xor execute) permissions, and many others. These protections have a checkered history, sometimes working for short periods only to be overcome in the course of the arms race that is computer security. The heart of the matter for hardware is that we continue to pretend we are working with a minicomputer from the 1970s and that transistors are at a premium, which they are not.

There are solutions to the pointer arithmetic problem, but they require dedicating more resources to building a safer computer architecture. I am referring here to capability machines, such as the one built at the University of Cambridge in the 1970s. In a capability machine, there are only capabilities, not just bare addresses, and these are cryptographically protected by the hardware. You cannot simply add an integer to a capability to get a different one, because the hardware does not allow this kind of math to occur. But capabilities require a larger address space, doubling the size of a native pointer in some cases, which has an effect on memory, the TLB (translation lookaside buffer), and other parts of the computer. In the 1970s, this cost was prohibitive, but now it should not be, certainly not if the benefit is a more reliable and secure system.

Prototypes of such systems have been developed and are part of active research. It is time now to get these into production, especially around critical infrastructure. But then, what isn’t critical infrastructure these days? Who knew that you could destroy the world’s check-in kiosks with one bad push? When capabilities were first proposed, the world was not run on computers. Today it is a different story, and it is time to change our calculus. There surely are also other ways, even beyond capability machines, to use the embarrassment of riches that Moore’s Law has bestowed upon us to upset the balance in favor of security. It is high time we discovered what those ways are. One option with an emphasis on security is CHERI (Capability Hardware Enhanced RISC Instructions). For an overview, see “An Introduction to CHERI,” by Robert N.M. Watson et al. (see https://bit.ly/41iwls0).

Moving up to software, we confront two main problems. There is now a ton of low-level software, built in C, that is based on a very old understanding of the hardware—models of compute that were appropriate in the 1970s. C is an unsafe language, basically assembler with for loops. C has the advantage of producing efficient code for modern processors. That fact, along with its long and historical use first in Unix and then other systems software, has meant the lowest levels of nearly every computing system—from Windows, to Linux and Android, to macOS, to the BSDs as well as nearly every real-time or embedded operating system—is written in C. When a packet comes to a computer from the Internet, it is always processed by code written in C.

And this, coupled with the aforementioned hardware problems, poses a huge security challenge. When C was written, there was no Internet and all the nodes of the ARPANET (for those who even remember that name) could be written down on a dinner napkin. It simply is not appropriate to write code that will be connected to the Internet in an unsafe language such as C. We have tried to make this work, and we can see the results.

The second problem is the systems software itself. Unix was considered a great win over Multics because Unix was simpler, introducing only two different domains: the kernel and user space. User-space programs are protected from each other by virtual memory. But the kernel—any kernel—is a huge blob of shared state with millions of lines of code in it. This is true for any operating system now in day-to-day use—Windows, Unix, or whatever tiny embedded OS is running on your Wi-Fi-connected light switches.

What makes up this huge blob? Device drivers! Device drivers of varying—and some might say questionable—quality, any one of which can poke and prod any part of the system for which it can manufacture a valid memory address. Once something breaches the operating system’s kernel boundary, the game is over, because the operating system is “shared everything.”

A modern approach to systems software suggests that we not only write all new systems in type-safe languages, such as Rust, but also rewrite what we already have in the same way. But this is not economically practical. Imagine how much it would cost to rewrite any major OS in a new language, test it, deploy it, and so forth.

A multipronged approach is the only way out of the current morass, one in which we leverage type-safe languages such as Rust when possible and decide which hardware is actually critical and must be replaced.

The whole CrowdStrike catastrophe exists because of architectural issues in hardware and in systems software.

We should be building systems that make writing a virus difficult, not child’s play. But that is an expensive proposition now.

We conned humanity into using computers for everything. Now we owe it to the world to make those systems work safely and reliably.

The Time I Stole $10,000 from Bell Labs
Thomas Limoncelli
https://queue.acm.org/detail.cfm?id=3434773

The Reliability of Enterprise Applications
Sanjay Sha
https://queue.acm.org/detail.cfm?id=3374665

The Calculus of Service Availability
Ben Treynor, Mike Dahlin, Vivek Rau, and Betsy Beyer
https://queue.acm.org/detail.cfm?id=3096459

Missing the Heart

Related articles

Building on Shaky Ground

DOI

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Missing the Heart

Related articles

Building on Shaky Ground

DOI

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.