Dear KV,
I recently developed an unhealthy interest in learning how operating systems and systems software work because I had reached the end of an application debugging session that seemed to point to a bug not in the application but in the code it was calling, which resided in the operating system. Luckily, the OS I am working with is open source, so I hoped to be able to continue debugging my problem, as I was told many years ago as an undergraduate that an operating system is just another program, albeit one with special powers. When I attempted to debug the problem, I found that, unlike the tools I am used to in application development, the ones used to debug an OS are primitive at best. In comparison to my IDE and its tooling, the tools I had on hand to continue debugging had more in common with stone knives and bear skins than with modern software. Since I know from your bio that you work on operating systems, I thought I would write and ask: “Is that all there is?” Or perhaps the people who write operating systems are simply so much better at software they do not feel they lack good tools for their work. I feel like the cobbler’s child who has no shoes.
Cobb
Dear Cobb,
A venture capitalist once told me, “There is no money in tools.” Since this person was pretty smart at investing in companies, I was willing to take their word for it. If you look at the software tooling landscape, you see the majority of developers work with either open source tools (LLVM and gcc for compilers, gdb for debugger, vi/vim or Emacs for editor); or tools from the recently reformed home of proprietary software, Microsoft, which has figured out its Visual Studio Code system is a good way to get people to work with its platforms; or finally Apple, whose tools are meant only for its platform. In specialized markets, such as deeply embedded, military, and aerospace, there are proprietary tools that are often far worse than their open source cousins, because the market for such tools is small but lucrative.
I first dispense with a myth you bring up in your letter: Those who write operating systems are somehow better developers than those who write applications or any other type of software. Writing code in a difficult environment—such as directly on top of hardware—can definitely improve your coding skills. It will certainly make you more careful because a failure in your code can have dire side effects, such as crashing the whole computer. Learning to be careful in this way makes you no more of a software genius than any other attempt to understand and extend a large corpus of software.
The difficulties of programming an operating system come from two major places: Hardware does not allow for certain types of illusions, and there is a lack of good tooling, as you point out.
Many of the conveniences available for application programming exist because of software illusions created by the operating system on top of the hardware. Consider what happens when your application program hits a fault: It crashes, but it does not crash anything else on the system, and it often leaves a record of what went wrong in the form of a core dump. The fact the program cannot crash others on the system is due to the illusion, provided by the virtual memory system, that each program has all the memory and cannot affect memory owned by other programs. An OS could act in this way, and, indeed, microkernel OS designs, which are common enough in research, can exploit this feature to make more of the code in the OS restartable. But this feature comes at a cost in terms of overhead that OS designers have been loath to pay, and so operating systems remain “one large program” that, when there is an error, die.
Hardware limitations are not the major roadblock to better tooling for operating systems, since application writers are provided with plenty of conveniences using software alone. In fact, an operating system’s major purpose is to be a software library to aid writing applications, since no one actually cares about the OS except those who work on it.
Systems could be built so they were more amenable to good tooling, and better tooling could be built, if we wanted to pause long enough to think about what that might mean. In systems software, the pressure is always on to “just make the machine work!” This means hacking up hardware drivers and other bits to make the box work—not even work well—just work at all. People are so pleased the OS works and the applications do not crash, they never go back to consider whether the design of the system they are using is amenable to the application or the hardware. Making a system work does not mean the design was the right design, just that you actually made the machine go without the magic smoke escaping.
In my more philosophical moments, I think of OS software as being like the child in the Ursula Le Guin short story, “The Ones Who Walk Away from Omelas.” A child in the story is locked in a basement, barely fed, and suffering greatly, but the child’s existence ensures a happy life for the rest of the town. The reader is informed that if the child were ever let out and treated properly, everyone in the town would suffer. The child exists so that others can have happy lives, much as an operating system exists so that applications can have happy lives/runtimes.
Writing code in a difficult environment can definitely improve your coding skills.
Were we to step back and think about how to make systems software better, we might have principles to bring to the table when designing such systems. They might be something like, all large pieces of software in an operating system should be designed to be extended, measured, and debugged, and these principles would relate to how tools interact with the system overall.
Extending a system is easiest to do when it is built around a set of well-known and well-documented patterns; I need a thing that looks like X, so I will make a Y that looks mostly like X but with changes. The only place any such patterns exist is often in the device drivers for an operating system, and even there, the hardware usually dictates the form, and the driver has to twist itself into knots to provide data in the right form and format for the rest of the system.
The computing industry has spent untold amounts of money trying to solve this problem for applications, from the original introduction of software libraries to decades of work with object oriented languages and tooling. Not a single piece of this kind of work has made it into a major operating system in the past 50 years. The code used to build operating systems has only the most primitive of data types (lists, hashes, the occasional tree), while the libraries used in applications are a veritable cornucopia of modern data types. The original argument against complex data types in the operating system was size, but this argument holds no water in a world where 16GB of RAM is the starting point for a watch or a phone, let alone a modern server.
The idea of extending something complex and intrinsic to the system, such as how memory is handled, or the scheduler, is nearly anathema because the interfaces are poorly documented and brittle.
Therefore, the first principle to follow when designing systems software is extensibility. Every subsystem that makes up the operating system itself must be designed by default to be replaced with clearly documented APIs, unit tests, and all the other attributes demanded from application software.
The second principle, measurement of software, has improved over the past 20 years with systems such as DTrace and its child, bpftrace, now available for both application and systems code, but DTrace is not designed for measurement. Current measurement tools were created long after the software they were meant to measure, twisting themselves into knots to unscramble the underlying system and providing a useful, if primitive, method for looking at what the system is doing. A system designed for measurement would already have built-in trace points that call out important transitions in system state so that the tooling—or worse yet, the humble programmer—does not have to hunt around trying to figure out what is going on in the system. Most software, not just operating systems, is created without an idea of measurement, which is brought in only later when people say, “The system is slow,” a common and infuriating bug report message.
Lastly, but not leastly, we come to the tool you probably needed and is why you wrote in the first place: the debugger. An awful lot goes into making it possible to debug applications, not the least of which is OS support for debugging via special system calls. OS designers know that not being able to debug an application is a nonstarter, but for some reason, they still often think it is perfectly fine to debug the operating system itself with print statements (printk()
or (printf()
; take your pick).
When bringing up a system on new hardware, maybe “this is all you have,” but a properly designed system would start with debugging hooks, not with something as complex as a variable argument call into a complex console output system. In fact, all that is required to do something smart with a debugger is a small monitor program that exposes direct memory reads and writes (gdb supports this via gdb stubs). Generally, though, when someone brings up systems software on new hardware, the race is on to make (printf()
work, because that is familiar, and humans seem to love the familiar, even when it leads to poorer outcomes.
If systems were designed with these questions in mind (How do I extend this? How do I measure this? How do I debug this?), it would also be easier to build better tools. The tooling would have something to hang its hat on, rather than guessing what might be the meaning of some random bytes in memory. Is that a buffer? Is it an important buffer? Who knows, it’s all memory!
Someday I hope to have tooling that is as good for systems software as what exists for applications, but first we will all have to walk away from Omelas.
KV
Related articles
on queue.acm.org
The Flame Graph
Brendan Gregg
https://queue.acm.org/detail.cfm?id=N2927301
Postmortem Debugging in Dynamic Environments
David Pacheco
https://queue.acm.org/detail.cfm?id=2039361
Black Box Debugging
James A. Whittaker, Herbert H. Thompson
https://queue.acm.org/detail.cfm?id=966807
Join the Discussion (0)
Become a Member or Sign In to Post a Comment