Q&A: The Path to Clean Data

2014 ACM A.M. Turing Award Recipient Michael Stonebraker

A serial entrepreneur long before the term became commonplace—and a data geek before the age of big data—ACM A.M. Turing Award recipient Michael Stonebraker pioneered techniques that were not just crucial to making relational databases a reality, but that continue to be used in almost all modern systems. Stonebraker spent the first 30 years of his career at the University of California at Berkeley, where he helped develop the still-popular Ingres relational database management system (DBMS) and the object-relational DBMS PostgreSQL (or Postgres). Since 2001, he has served as an adjunct professor at the Massachusetts Institute of Technology (MIT), while forming a total of six start-ups in 14 years around innovations like stream processing and column stores. Here, he looks back at a few career highlights and forward to the future of database systems.

You graduated from Princeton in 1965 and have said that you only went to graduate school to avoid being sent to Vietnam.

Have you heard of the TV show “Route 66,” about two guys who drive around the country and have interesting experiences finding themselves? That’s what I wanted to do when I graduated college, but that was not an option because of the Selective Service System. So I went to the University of Michigan to sit out the war.

From there, you went to Berkeley and began your groundbreaking work on database management systems.

I was hired by Berkeley with the idea that I was going to do urban engineering—using analytic techniques to tell people where to put fire stations and so on. But it became clear that getting clean data was agony. I had a mentor, Eugene Wong, who suggested that we look at database management, so we read Ted Codd’s pioneering paper.

You refer to the 1970 Communications paper that first formulated the relational model for database management.

It was pretty obvious from that paper that building a relational system was a good idea, so Gene and I started doing it. Neither of us had ever written a significant software system before, but we persevered and made it work. Ingres was essentially the only relational database system where you could put your hands on the code, and it was very widely run in the late 1970s.

System R, on the other hand—the other working relational database of the era—was proprietary to IBM. What was your relationship with the System R team?

We didn’t really work together, but they were 50 miles away and we would get together periodically to exchange ideas. I guess you could say both groups were disciples of Ted Codd. There was no prior art, and I think we all believed we were pathfinders on a mission. Between the two systems, most of the techniques used in relational databases were invented by one project or the other.

One of the techniques you introduced with Ingres is query rewriting—supporting views, protection, and integrity constraints with a unified mechanism.

It was clear to us that you had to support the notion of relational views. There’s how the data is stored, and if the user wants to see it differently, then you ought to present him with what’s called a virtual relation—a table that doesn’t really exist, but can be derived from the actual stored data. I wrote a paper in 1976 that showed how to do it in a very simple way. You just take the user’s query and you modify it, and then it runs on stored data, even though you’re running a query on data that’s not stored. It’s high-performance, doesn’t take a lot of code, and works.

In 1980, you formed the Ingres Corporation to commercialize your work.

It was clear that if we wanted to make a difference, we had to form a commercial company. If you have innovative technology, the best way to have it see the light of day is to form a start-up.

A few years later, you began work on Ingres’s successor, Postgres. Postgres introduced a range of other innovations like the object-relational model, which enables users to extend the database by defining and manipulating their own objects.

In 1983 or 1984, we pushed the original academic Ingres code line off a cliff and started over. That was the genesis of Postgres.

At the time, I was also working with the commercial Ingres guys, who had just implemented the ANSI SQL standard for adding date and time as a data type. I got a call from an unhappy customer. Turned out, the standard said to implement the Julian calendar, so that if you have two dates, and you subtract them, then the answer is Julian calendar subtraction. But he was computing interest on financial bonds, and he ran on a calendar where every month has the same length, because you get the same interest on a bond regardless of how long the month is. In modern times, there are lots and lots of different semantics for time, so if you implement one of them, everyone on a different definition is out of luck. When FedEx does two-day delivery, Sunday doesn’t count, and when Wall Street says something settles in five days, they mean trading days, not calendar days. So what this guy wanted was a new data type called bond time. And he was happy to store date and time exactly the way it’s stored in the Julian calendar; he just wanted his own definition of subtraction. Postgres allowed him to do that, and it made a huge difference in performance and maintainability.

In 2001, you became an adjunct professor at MIT, where you’ve been incredibly productive.

When we moved here, MIT had nothing going on in database management whatsoever—no professors, courses, or students. But the nice thing about the Boston area is there are so many universities here, and every single one of them has a database person or two. So an extended research group began to form, with people at Brandeis, Brown, Worcester Polytech, and UMass Boston. The Brown guys wanted to build a stream-processing engine, so we built the prototype and it got commercialized. Then it occurred to me, in the data warehouse market, that column stores were wildly better than row stores, so we built a thing called C-store, which turned into Vertica.

You are currently CTO at three start-ups and co-director of MIT’s Intel Science and Technology Center for Big Data. How do you organize your time?

I fight whatever fire is most pressing. Typically I’m at MIT three days a week, and at one or another startup two days a week, but it varies. There are travel requests and customer visits, and they come on their own schedules. I’m so privileged to be able to talk to customers; they are a great source of problems. They’ll tell you why your stuff doesn’t work, and then you get to fix it.

Let’s talk about the legacy relational database market. You’ve said—and your career certainly reflects—that “one size fits none” when it comes to database applications.

The database market is in a watershed transition right now. In the 1990s, traditional relational database systems were the answer in search of a question. Right now, the legacy implementations are not good for anything. They will be replaced by new implementations that are largely the result of better ideas like column stores, on the one hand, and changing technology on the other. Very large main memories are now routine, and you can leverage that. There are also all kinds of new problem areas.

I think what’s going to happen is that there will be at least a half-dozen database architectures that will be vertical-market specific. It’s great to be a practicing professional in an industry in transition. The elephants, as I affectionately call them, will only change when they’re threatened.