Dear KV,
I’ve been working with some code that generates massive data sets, and while I’m perfectly happy looking at the raw textual output to find patterns, I’m finding that more and more often I have to explain my data to people who are either unwilling to or incapable of understanding the data in a raw format. I’m now being required to generate summaries, reports, and graphs for these people, and, as you can imagine, they are the ones with control over the money in the company, so I also have to be nice to them. I know this isn’t exactly a coding question, but what do you do when you have to take the bits you understand quite well and summarize them for people like this?
If It Ain’t Text…
Dear Text,
Since I often come across as some sort of hard-core, low-level, bits-and-bytes kind of guy, I gather that you’re assuming my answer will be to tell management—and from your description these people must be management—to take their fancy graphs and, well, do something that would give them paper cuts in hard-to-reach places. Much as I like to give just that kind of advice, the fact is, it’s just as important to be able to transform large data sets from columns and lines of numbers into something that is a bit more compact and still as descriptive. For the polite and well-written version of this type of advice, please see the classic work by Edward Tufte, The Visual Display of Quantitative Information. Now for the Kode Vicious Kwik Kourse on Visualization, please read on.
While I agree it is cool to be able to look at some incredibly confusing output in text and be able to pick out the needle you’re looking for, and while I’m sure this impresses many of your coder friends, this is just not a skill that’s going to take you very far. I also find that programmers who cannot understand the value in a single-page graph of their results are the same kinds of programmers who should not be allowed to code on their own.
One should approach any such problem as a science experiment, and scientists know how to represent their results in many ways, including plotting them on paper. At some point in your career you’re going to have to figure out how to get easy-to-read results that you can look at and compare side by side. A plot of your data can, when done well, give you a lot of information and tell you a lot about what might be happening with your system. Note the all-important phrase, when done well, in that previous sentence. As is the case with many tools, the plotting of data can mislead you as easily as it can lead you somewhere.
There are plenty of tools with which to plot your data, and I usually shy away from advocating particular tools in these responses, but I can say that if you were trying to plot a lot of data, where a lot is more than 32,767 elements, you would be wise to use something like gnuplot. Every time I’ve seen people try to use a certain vendor’s spreadsheet to plot data sets larger than 32,767, things have gone awry—I might even say that they were brought up “short” by that particular program. The advantage of gnuplot is that as long as you have a lot of memory (and memory is inexpensive now), you can plot very large data sets. KV recently outfitted a machine with 24GB of RAM just to plot some important data. I’m a big believer in big memory for data, but not for programs, but let’s just stop that digression here.
Let’s now walk through the important points to remember when plotting data. The first is that if you intend to compare several plots, your measurement axis—the one on which you’re showing the magnitude of a value—absolutely must remain constant or be easily comparable among the total set of graphs that you generate. A plot with a y-axis that goes from 0 to 10 and another with a y-axis from 0 to 25 may look the same, but their meaning is completely different. If the data you’re plotting runs from 0 to 25, then all of your graphs should run from, for example, 0 to 30. Why would you waste those last five ticks? Because when you’re generating data from a large data set, you might have missed something, perhaps a crazy outlier that goes to 60, but only on every 1,000th sample. If you set the limits of your axes too tightly initially, then you might never find those outliers, and you would have done an awful lot of work to convince yourself—and whoever else sees your pretty little plot—that there really isn’t a problem, when in fact it was right under your nose, or more correctly, right above the limit of you graph.
Since you mention you are plotting large data sets, I’ll assume you mean more than 100,000 points. I have routinely plotted data that runs into the millions of individual points. When you plot the data the first time, it’s important not only to get the y-axis limits correct, but also to plot as much data as absolutely possible, given the limits of the system on which you’re plotting the data. Some problems or effects are not easily seen if you reduce the data too much. Reduce the data set by 90% (look at every 10th sample), and you might miss something subtle but important. If your data won’t all fit into main memory in one go, then break it down by chunks along the x-axis. If you have one million samples, graph them 100,000 at a time, print out the graphs, and tape them together. Yes, it’s kind of a quick-and-dirty solution but it works, trust me.
Another problem occurs when you want to compare two data sets directly on the same plot. Perhaps you have data from several days and you want to see how Wednesday and Thursday compare, but you don’t have enough memory to plot both days at once, only enough for one day at a time. You could beg your IT department for more memory or, if you have a screw-driver, “borrow” some memory from a coworker, but such measures are unnecessary if you have a window. Print both data sets, making sure both axes line up, and then hold the pages up to the window. Oh, when I said “window,” I meant one that allows light from that bright yellow ball in the sky to enter your office, not one that is generated by your computer.
Thus far I have not mentioned the x-axis, but let’s remedy that now. If you’re plotting data that changes over time, then your x-axis is actually a time axis. The programmers who label this “samples,” and then do all kinds of internal mental transformations, are legion—and completely misguided. While you might know that your samples were taken at 1KHz and therefore that every 1,000 samples is one second, and 360,000 samples is an hour, most of the people who see your plots are not going to know this, even if you cleverly label your x-axis “1KHz.” If you’re plotting something against time, then your x-axis really should be time.
This recommendation is even more important when graphing long-running data—for example, a full working day. It turns out that computers are slaves to people and while many people have predicted that the work done by computers would be far more consistent over a 24-hour day than work done by humans, all of those people have been, and continue to be, dead wrong. If you’re plotting data over a day, then it is highly likely that you will see changes when people wake up, when they go to work, take meals, go home, and sleep. It might be vitally important for you to notice that something happens every day at 4 P.M. Perhaps your systems in England are recording when people take tea, rather than an odd slowdown in the system. The system you’re watching might be underutilized because the tea trolley just arrived! If your plot has time, then use time as an axis.
As I wrap this up, you may have noticed that I did not mention color, fonts, font size, or anything else related to how the graph looks on paper. I didn’t leave these factors out because I’m a total nerd who can’t match any of his own clothes. I can easily match clothes, since black goes with everything. Most people I’ve seen generating graphs spend far too much time picking a color or a font. Take the defaults; just make sure the lines on the graph are consistently representing your data. Choosing a graph color or text font before getting the data correctly plotted is like spending hours twiddling your code highlighting colors in your IDE instead of doing the actual work of coding. It’s a time waster. Now, get back to work.
KV
Related articles
on queue.acm.org
Code Spelunking Redux
George V. Neville-Neil
http://queue.acm.org/detail.cfm?id=1483108
Unifying Biological Image Formats with HDF5
Matthew T. Dougherty, Michael J. Folk, Erez Zadok, Herbert J. Bernstein, Frances C. Bernstein, Kevin W. Eliceiri, Werner Benger, Christoph Best
http://queue.acm.org/detail.cfm?id=1628215
A Conversation with Jeff Heer, Martin Wattenberg, and Fernanda Viégas
http://queue.acm.org/detail.cfm?id=1744741
Join the Discussion (0)
Become a Member or Sign In to Post a Comment