The Data-Structure Canon

Dear KV,

In most areas of science there are a few basic underlying laws that inform the rest of the study of a given subject. Physics, chemistry, and electrical engineering all have these basic equations. What are the basic equations in computer science? Or is computer science baseless?

Looking for a Base

Dear Baseless,

While I’m sure that several mathematically minded readers will take me to task for this, what pops into my mind when I read your question are the basic data structures used in programming, rather than a set of equations. It’s likely this reflects the way I look at software, which is as a set of structures, or perhaps it just reflects the neural damage I’ve suffered through reading code written by hundreds of monkeys. I’m not really in a position to say at the moment.

There are certainly plenty of joke equations, mostly having to do with the ratio between quality of code, and cost and speed of its production, but those have never been adequately specified, at least not so that they could be considered physical laws such as Ohm’s.

The idea of a software canon—a set of ideas, algorithms, or data structures that are worthy of repeated study—seems at first to be somewhat ludicrous. After all, this is not a church: churches have canons; rational people, such as software engineers, do not. So, let’s not think of these as sacred concepts, merely the basic ones to which we should return on a regular basis so as to regain our footing in the heady world of objects, methods, and algorithms.

If there were any set of basics that I wanted to hammer into software engineers and programmers, it would be the basic data structures, their features and pitfalls. Here, then, is a brief overview of the data-structure canon, which, unfortunately, is metaphorical rather than actual. I mean, no one is going to let you have a real canon (or cannon) in the office—trust me, I’ve asked.

The basic data structures are the array, list, tree, and hash table. That seems like a short list, and I’m sure some of the more experienced readers of this column are warming up their mail applications to write me nasty notes about how there are more basic data structures than just these four. To these people I say, “Go for it!”

What I find most interesting about this list is how many programmers take it for granted and how few of them understand the implications of using one data structure versus another. Try as I might, I cannot lay all the blame at the feet of the programmers using these data structures; a good deal of the blame has to go to their teachers and the legions of poorly written “how to program” books that emphasize how to use a data structure rather than the underlying implications of what happens when they are used. Abstraction is a fine thing-it makes understanding large systems easier—but it also can obscure important facts: in particular, the performance implications of certain choices.

The array is not only the most basic data structure used in computer programs, but it is also probably the most abused and misunderstood as well. There are very few languages, other than assembler, that do not have some easy-to-use form of array. The array’s salient features are ease of element insertion, retrieval, and deletion, as well as its ability to contain only elements of the same data type. Arrays are most often preal-located, which is to say that once you declare an array of a certain size, that memory is allocated to the array and the array can be no larger or smaller than its initial allocation. The static nature of arrays leads to all kinds of problems, some of which are hidden by tricks such as creating vectors, which are really glorified arrays that are reallocated behind your back when they grow.

One of the array’s hidden weaknesses is that, when used as a stack variable, it can take a toll on the performance of an application, particularly if it is used as a dumping ground for the program’s data. One programmer I worked with would allocate multiple-megabyte arrays on the stack, never thinking of the extra overhead this incurred, nor that his code would stop functioning on a system that was not a huge honking server.

Of course, you might argue that everyone now has a huge honking server and that stack or system memory just doesn’t matter; but amusingly enough for younot for those of us who worked on this codeit was eventually ported to an embedded device. The funny thing was that it crashed, pretty often, and debugging embedded devices is not very fun when they’re located in a remote environment where they’re supposed to run unattended for years, rather than crashing unattended after only a few days. Do not shove everything into an array just because you can, and don’t put big arrays on the stack just because you can. Studying how arrays are implemented on your system or in the language you use can definitely pay off in the long run, and relearning this knowledge from time to time seems to be required for just about everyone.

Lists are what people use when they think they’re too clever to use arrays. Many modern languages have built-in support for lists, and one ancient language was built seemingly just to process them, as if once you had a screwdriver (list), you could make every problem look like a screw (list to process). Lists are more expensive than arrays when you’re inserting, finding, or removing elements, but they do have the advantage that they need be no longer than the sum of their elements.

Do not shove everything into an array just because you can, and don’t put big arrays on the stack just because you can.

The most (actually least) fun thing about lists is that everyone and his brother thinks they should create their own macros, class, or library to handle them. A common assignment for burgeoning computer programmers is to have them create their own list code as a way of teaching them about pointers, when they are still taught about such things. Unfortunately, the experience of having created their own list implementation, and having it actually work, leads them to think that theirs is the best implementation of a list that ever was. They show it to all their friends, post it on the Internet, or perhaps create an open source project around it. The fact is the list classes already in existence are likely good enough for most people. If you think they’re not, then test yours with a reasonable benchmark, the creation of which will tell you more about what you know about how lists work than writing a new data structure will.

Trees are for people who have mastered, or think they have mastered, lists. A well-tended tree generally has a lower insertion, deletion, and retrieval time than a similar-size list because of how searches are performed on trees versus lists. The problem with a tree is knowing how to tend one. Some programmers seemingly have black thumbs, which means their trees are simply lists where the head is named “root.” They don’t consider how the data they’re putting in the tree will be organized once their list insertion routine returns. I would be much happier if people who decide to use or implement tree structures would consider how the data in them is organized.

Last, we come to the hash table. The purpose of a hash table is to give you a flexible data structure that is faster to access than a list or a tree. Here, the biggest challenge is selecting or designing the hashing function. Choose the wrong hash function and you’ll wind up with a list (see the discussion of trees in the previous paragraph). I would tell you that you should just use a well-known hash-table data structure and leave it at that, but, again, the quality of the hash depends on the data. Some hashing functions are good for some types of data and some are good for others, and so you have to pick your hash carefully.

Given the different complexities I’ve just mentioned, I think you’ll see that these four data structures, which form the basis of most computer programs, are worth studying time and again. A good place to start is Knuth’s The Art of Computer Programming, Volume 1, Chapter 2, which covers all of these subjects more seriously and in depth. When you’re done reading all that he has to say, please write me, though I should be retired by then.