DBMSs For Science Applications: A Possible Solution

MIT Adjunct Professor Michael Stonebraker

I know quite a few scientists who deal with the processing and storage of large amounts of data. All are unhappy with relational DBMS. There are three common reasons that they cite:

Tables are not a good data model for their application. The most common requirement among earth scientists, oceanographers, high energy physicists, and astronomers is support for large arrays. It has been shown repeatedly that simulating arrays on top of tables is an unnatural act and gives very bad performance. Biologists and chemists seem equally unhappy with tables, although they want something other than arrays.

The operators in relational DBMSs do not meet scientific needs. For example a scientist with remote sensing data often wants to regrid that data to match the co-ordinate system of some other data set. Regriding array data using SQL operations is nearly impossible. Hence, earth scientists want science-specific operations, such as regrid, as a primitive operation.

Required features are not supported in current commercial DBMSs. For example, all scientific data has uncertainty; however, RDBMSs assume data is precise (as is typically true in business data processing). A scientist must deal with uncertainty in application logic, and the result is not pretty. An extra field with the imprecision must be added, and then all operations (such as filtering) must be recoded to take such uncertainty into account. In addition, scientists never want to update data in place, thereby losing the old value. If a data value is bad, they want to add the new value to the DBMS, preserving the old value. In this way, they can see historical states of the data and correction procedures. Such a “no overwrite” strategy is also universal in accounting systems who perform double entry bookkeeping. However, current RDBMSs update data in place, and scientists must deal with history in painful application logic. Equally troubling in current RDBMSs are the lack of support for data lineage (how was the data derived) and named versions (so a scientist can make local changes to a data set without affecting other users).

The net result of these problems is that scientists either don’t use commercial DBMSs or use them grudgingly.

Unfortunately, science applications are not currently a billion dollar market. As such, science requirements have been largely ignored by the major commercial vendors. Moreover, there is no evidence that this state of affairs will change anytime soon. This leaves science users out in the cold, and often they must resort to “rolling their own” on top of the “bare metal.”

Personally, I believe that there are a collection of planet-threatening problems, such as climate change and ozone depletion, that only scientists are in a position to solve. Hence, the sorry state of DBMS support in particular (and system software support in general) for this class of users is very troubling.

Science users, of course, want a commercial quality DBMS, i.e., one that is reliable, scalable and comes with good documentation and support. They also want something that is open source. There is no hope that such a software system can be built in a research lab or university. Such institutions are good at prototypes, but not production software. Hence, the obvious solution is a nonprofit foundation, along the lines of Apache or Mozilla, whose charter would be to build such a DBMS. It could not be financed by venture capital, because of market size issues. As such support must come from governments and foundations.

It is high time that the United States got behind such an initiative.