Big Data, Big Problems

MIT Adjunct Professor Michael Stonebraker

I was at a conference recently and talked with a science professor at another university. He made the following startling statement.

He has close to 1 petabyte (PB) of data that he uses in his research. In addition, he surveyed other scientific research groups at his university and found 19 other groups, each with more than 100 terabytes (TB) of data. In other words, 20 research groups at his university have data sets between 100 TB-1 PB in size.

I immediately said, "Why not ask your university’s IT services to stand up a 20-petabyte cluster?"

His reply: "Nobody thinks they are ready to do this. This is research computing, very different from regular IT. The tradeoffs for research computing are quite different from corporate IT."

I then asked, "Why not put your data up on EC2?" [EC2 is Amazon’s Elastic Compute Cloud service.]

His answer: "EC2 storage is too expensive for my research budget; you essentially have to buy your storage every month. Besides, how would I move a PB to Amazon? Sneaker net [disks sent to Amazon via U.S. mail] is not very appealing."

As a result, he is in the process of starting a 20-research group federation that will stand up the required server. In other words, this consortium will run its own massive data server.

I am reminded of a talk given a couple of years ago by James Hamilton, then at Amazon. He claimed there are unbelievable economies of scale in running grid-oriented data centers (i.e., if you run 100,000 nodes, your costs are a small fraction of the costs of running a 1000-node data center). Many of these cost savings come from unexpected places. For example, designing a physical data center (raised flooring, uninterrupted power supply, etc.) is something the small guy does once and the big guy has down to a science. Also, personnel costs rise much more slowly than the number of nodes.

I assume at least 20 universities have the same characteristics as the one noted above. Also, my assumption is these 20 x 20 = 400 research groups that get their funding from a small number of government agencies. It would make unbelievably good sense to have a single 400-PB system that all of the researchers share.

In effect, this blog post is a "call to arms." Agencies of the U.S. government are spending boatloads of money on pushing the envelope of massive compute servers. However, they appear to be ignoring the fact that many research groups have serious data-management problems.

Why not invest a small fraction of the "massive computing" budget on "massive data management"? Start by standing up a 400-PB data server run by somebody who understands big data. Several organizations with the required expertise come readily to mind. This would be a much better solution than a whole bunch of smaller systems run by consortiums of individual science groups.

There must be a better way. After all, the problem is only going to get worse.