Sign In

Communications of the ACM

Practice

The Most Expensive One-Byte Mistake


The Most Expensive One-Byte Mistake, illustration

back to top 

Information technology (IT) both drives and implements the modern Western-style economy. Thus, we regularly see headlines about staggeringly large amounts of money connected with IT mistakes. Which IT or CS decision has resulted in the most expensive mistake?

Not long ago, a fair number of pundits were doing a lot of hand waving about the financial implications of Sony's troubles with its PlayStation Network, but an event like that does not count here. In my school days, I talked with an inspector from The Guinness Book of World Records who explained that for something to be "a true record," it could not be a mere accident; there had to be direct causation starting with human intent (such as, we stuffed 26 high-school students into our music teacher's Volkswagon Beetle and closed the doors).

Sony (probably) did not intend to see how big a mess it could make with the least attention to security, so this and other such examples of false economy will not qualify. Another candidate could be IBM's choice of Bill Gates over Gary Kildall to supply the operating system for its personal computer. The damage from this decision is still accumulating at breakneck speed, with StuxNet and the OOXML perversion of the ISO standardization process being exemplary bookends for how far and wide the damage spreads. But that was not really an IT or CS decision. It was a business decision that, as far as history has been able to uncover, centered on Kildall's decision not to accept IBM's nondisclosure demands.

A better example would be the decision for MS-DOS to invent its own directory/filename separator, using the backslash (\) rather than the forward slash (/) that Unix used or the period that DEC used in its operating systems. Apart from the actual damage being relatively modest, however, this does not qualify as a good example either because it was not a real decision selecting a true preference. IBM had decided to use the slash for command flags, eliminating Unix as a precedent, and the period was used between filename and filename extension, making it impossible to follow DEC's example.

Space exploration history offers a pool of well-publicized and expensive mistakes, but interestingly, I did not find any valid candidates there. Fortran syntax errors and space shuttle computer synchronization mistakes do not qualify for lack of intent. Running one part of a project in imperial units and the other in metric is a "random act of management" that has nothing to do with CS or IT.

The best candidate I have been able to come up with is the C/Unix/Posix use of NUL-terminated text strings. The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end? This is a decision that the dynamic trio of Ken Thompson, Dennis Ritchie, and Brian Kernighan must have made one day in the early 1970s, and they had full freedom to choose either way. I have not found any record of the decision, which I admit is a weak point in its candidacy: I do not have proof that it was a conscious decision.

As far as I can determine from my research, however, the address + length format was preferred by the majority of programming languages at the time, whereas the address + magic _ marker format was used mostly in assembly programs. As the C language was a development from assembly to a portable high-level language, I have a difficult time believing Ken, Dennis, and Brian gave it no thought.

Using an address + length format would cost one more byte of overhead than an address + magic _ marker format, and their PDP computer had limited core memory. In other words, this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day; but this one had quite atypical economic consequences.

Hardware development costs. Initially, Unix had little impact on hardware and instruction set design. The CPUs that offered string manipulation instructions—for example, Z-80 and DEC VAX—did so in terms of the far more widespread adr+len model. Once Unix and C gained traction, however, the terminated string appeared on the radar as an optimization target, and CPU designers started to add instructions to deal with them. One example is the Logical String Assist instructions IBM added to the ES/9000 520-based processors in 1992.1

Adding instructions to a CPU is not cheap, and it happens only when there are tangible and quantifiable monetary reasons to do so.

Performance costs. IBM added instructions to operate on NUL-terminated strings because its customers spent expensive CPU cycles handling such strings. That bit of information, however, does not tell us if fewer CPU cycles would have been required if a ptr+len format had been used.

Thinking a bit about virtual memory (VM) systems settles that question for us. Optimizing the movement of a known-length string of bytes can take advantage of the full width of memory buses and cache lines, without ever touching a memory location that is not part of the source or destination string.

One example is FreeBSD's libc, where the bcopy(3)/memcpy(3) implementation will move as much data as possible in chunks of "unsigned long," typically 32 bits or 64 bits, and then "mop up any trailing bytes" as the comment describes it, with byte-wide operations.2

If the source string is NUL terminated, however, attempting to access it in units larger than bytes risks attempting to read characters after the NUL. If the NUL character is the last byte of a VM page and the next VM page is not defined, this would cause the process to die from an unwarranted "page not present" fault.

Of course, it is possible to write code to detect that corner case before engaging the optimized code path, but this adds a relatively high fixed cost to all string moves just to catch this unlikely corner case—not a profitable trade-off by any means.

If we have out-of-band knowledge of the strings, things are different.

Compiler development cost. One thing a compiler often knows about a string is its length, particularly if it is a constant string. This allows the compiler to emit a call to the faster memcpy(3) even though the programmer used strcpy(3) in the source code.

Deeper code inspection by the compiler allows more advanced optimizations, some of them very clever, but only if somebody has written the code for the compiler to do it. The development of compiler optimizations has historically been neither easy nor cheap, but obviously Apple is hoping this will change with Low-level Virtual Machine (LLVM), where optimizers seem to come en gros.

The downside of heavy-duty compiler optimization—in particular, optimizations that take holistic views of the source code and rearrange it in large-scale operations—is that the programmer must be really careful that the source code specifies his or her complete intention precisely. A programmer who worked with the compilers on the Convex C3800 series supercomputers related his experience as "having to program as if the compiler was my ex-wife's lawyer."

Security costs. Even if your compiler does not have hostile intent, source code should be written to hold up to attack, and the NUL-terminated string has a dismal record in this respect. Utter security disasters such as gets(3), which "assume the buffer will be large enough," are a problem "we have relatively under control."3

Getting it under control, however, takes additions to compilers that would complain if the gets (3) function were called. Despite 15 years of attention, over- and underrunning string buffers is still a preferred attack vector for criminals, and far too often it pays off.

Mitigation of these risks has been added at all levels. Long-missed no-execute bits have been added to CPUs' memory management hardware; operating systems and compilers have added address-space randomization, often at high costs of performance; and static and dynamic analyses of programs have soaked up countless hours, trying to find out if the byzantine diagnostics were real bugs or clever programming.

Yet, absolutely nobody would be surprised if Sony's troubles were revealed to start with a buffer overflow or false NUL-termination assumption.

Back to Top

Slashdot Sensation
Prevention Section

We learn from our mistakes, so let me say for the record, before somebody comes up with a catchy but totally misleading Internet headline for this article, that there is absolutely no way Ken, Dennis, and Brian could have foreseen the full consequences of their choice some 30 years ago, and they disclaimed all warranties back then. For all I know, it took at least 15 years before anybody realized why this subtle decision was a bad idea, and few, if any, of my own IT decisions have stood up that long.

In other words, Ken, Dennis, and Brian did the right thing.

Back to Top

But That Doesn't Solve the Problem

To a lot of people, C is a dead language, and ${lang} is the language of the future, for ever-changing transient values of ${lang}. The reality of the situation is that all other languages today directly or indirectly sit on top of Posix API and the NUL-terminated string of C.

When your Java, Python, Ruby, or Haskell program opens a file, its runtime environment passes the file-name as a NUL-terminated string to open(3), and when it resolves cacm.acm.org to an IP number, it passes the host name as a NUL-terminated string to getaddrinfo(3). As long as you keep doing that, you retain all the advantages when running your programs on a PDP/11, and all of the disadvantages if you run them on anything else.

I could write a straw-man API proposal here, suggest representations, operations, and error-handling strategies, and I am quite certain it would be a perfectly good waste of a nice afternoon. Experience shows that such proposals go nowhere because the backward compatibility with the PDP/11 and the finite number of programs written are much more important than the ability to write the potentially infinite number of programs in the future in an efficient and secure way.

Thus, the costs of the Ken, Dennis, and Brian decision will keep accumulating, like the dust that over the centuries has almost buried the monuments of ancient Rome.

q stamp of ACM QueueRelated articles
on queue.acm.org

Massively Multiplayer Middleware
Michi Henning
http://queue.acm.org/detail.cfm?id=971591

The Seven Deadly Sins of Linux Security
Bob Toxen
http://queue.acm.org/detail.cfm?id=1255423

B.Y.O.C. (1,342 Times and Counting)
Poul-Henning Kamp
http://queue.acm.org/detail.cfm?id=1944489

Back to Top

References

1. Computer Business Review. Partitioning and Escon enhancements for top-end ES/9000s (1992); http://www.cbronline.com/news/ibm_announcements_71.

2. ViewVC. Contents of /head/lib/libc/string/bcopy.c (2007); http://svnweb.freebsd.org/base/head/lib/libc/string/bcopy.c?view=markup.

3. Wikipedia. Lifeboat sketch (2011); http://en.wikipedia.org/wiki/Lifeboat_sketch.

Back to Top

Author

Poul-Henning Kamp (phk@FreeBSD.org) has programmed computers for 26 years and is the inspiration behind bikeshed.org. His software has been widely adopted as "under the hood" building blocks in both open source and commercial products. His most recent project is the Varnish HTTP accelerator, which is used to speed up large Web sites such as Facebook.


©2011 ACM  0001-0782/11/0900  $10.00

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.


Comments


Seth Powsner

Ahh... Brings back fond memories: hot soldering irons, cold heat sink grease, and fiber glass splinters from circuit boards..

One vague recollection is that the PDP-11 MOV instruction set the conditions codes. Zero / null terminated loops were just two instructions: MOV with auto increment addressing and BNE (branch not equal). Don't recall any two instruction loops that worked with a length / count.

Another vague recollection was that Jensen & Wirth commented on the problem of end of line / string markers in their book from that era. This was probably during their description of text file io.

I would be inclined to believe Thompson, Ritchie, & Kernighan if they said they gave it little thought. Programming in PDP-11 assembler made the choice clear to me at the time, especially with strings or search lists greater than 255.

Still your point is very, very good one.


Henry Ledgard

This is a neat article. The author has a wonderful conversational style and a thoughtful set of issues to bring to our attention.

I have always thought strings in Java were odd, and wondered why they were not treated as a primitive type. Also, the so-called null character itself is odd.

But back to the author's major points. I guess I always took such things for granted, never really questioning them.

In a similar vein, I view Java and C++ as old, and wonder why we cannot find a more contemporary, more readable, and simple common programming language.


CACM Administrator

The following letter was published in the Letters to the Editor of the April 2012 CACM (http://cacm.acm.org/magazines/2012/4/147353).
--CACM Administrator

In "The Most Expensive One-Byte Mistake" (Sept. 2011), Poul-Henning Kamp addressed the decision taken by "the dynamic trio" Ken Thompson, Dennis Ritchie, and Brian Kernighan when they designed C to represent strings as null-terminated rather than measure them through a preceding length count. Kamp said he found no record of such a decision and no proof it was even a conscious decision in the first place. However, he did speculate it might have been motivated by a desire to conserve memory at a time when memory was an expensive resource.

I fully agree with Kamp that the decision, conscience or not, was a mistake, and ultimately a very costly one. In my own C programming in the 1970s I found it a frequent source of frustration but believe I understand its motivation: C was designed with the PDP-11 instruction set very much in mind. The celebrated C one-liner

while (*s++ = *t++) ;

copies the string at s to the string at t. Elegant indeed! But what may have been lost in the fog of time is the fact that it compiles into a loop of just two PDP-11 instructions, where register R0 contains the location of s and register R1 contains the location of t:

A mov (@R0)++,(@R1)++
bne A test result for nonzero and
branch

Such concise code was seductive and, as I recall, mentioned often in discussions of C. A preceding length count would have required at least three instructions.

But even at this level of coding, the economy of the code for moving a string was purchased for a high price: having to search for the terminating null in order to determine the length of a string; that price was also paid when concatenating strings. The security issues resulting from potential buffer overruns could hardly have been anticipated at the time, but, even then, such computational costs were apparent.

"X-Rays will prove to be a hoax": Lord Kelvin, 1883. Even the most brilliant scientists sometimes get it wrong.

Paul W. Abrahams
Deerfield, MA


CACM Administrator

The following letter was published in the Letters to the Editor in the December 2011 CACM (http://cacm.acm.org/magazines/2011/12/142534).
--CACM Administrator

In Poul-Henning Kamp's article "The Most Expensive One-Byte Mistake" (Sept. 2011), did Ken, Dennis, and Brian indeed choose wrong with NUL-terminated text strings? I say they chose correctly, then and now. The reason C is dying and nobody has used PL/I, Algol, or Pascal for real work for the past 30 years is that C makes it possible to accomplish a lot in a few lines of intuitive code despite requiring little memory or CPU power. Searching and comparing NUL-terminated strings can be accomplished with such short code segments; programmers hardly need a standard library, and code compiles into a few PDP-11 machine instructions. Failing to check untrusted data is fatal in any language.

C allows fast simple code written by competent programmers, and simple code tends to be less buggy and more readable than complex code. For programmers who still want to use address + length strings, such use can be accomplished in just a few lines. There is, of course, the strlen() function to measure the string's length and the fgets() function to limit how many characters to read into a string from a file.

Sure, copying large strings can run faster with newer hardware if the string lengths are known. This is a trade-off, and programmers can, if desired, use address + length strings in C and even word-align them. For others, there is always "C with Training Wheels," a.k.a. Pascal or Java, if one is in no special hurry for results.

Good programmers write secure code; bad programmers write insecure, buggy code. Good practices are more valuable than "magic" language features. The largest Java application I know is also the buggiest application I know.

Bob Toxen
Atlanta, GA


Displaying all 4 comments

Comment on this article

Signed comments submitted to this site are moderated and will appear if they are relevant to the topic and not abusive. Your comment will appear with your username if published. View our policy on comments

(Please sign in or create an ACM Web Account to access this feature.)

Create an Account

Log in to Submit a Signed Comment

Sign In »

Sign In

Signed comments submitted to this site are moderated and will appear if they are relevant to the topic and not abusive. Your comment will appear with your username if published. View our policy on comments
Forgot Password?

Create a Web Account

An email verification has been sent to youremail@email.com
ACM veriŽes that you are the owner of the email address you've provided by sending you a veriŽcation message. The email message will contain a link that you must click to validate this account.
NEXT STEP: CHECK YOUR EMAIL
You must click the link within the message in order to complete the process of creating your account. You may click on the link embedded in the message, or copy the link and paste it into your browser.