Stonebraker on Data Warehouses

http://cacm.acm.org/blogs/blog-cacm/98136

Data warehouses, business intelligence, business analytics, and complex analytics are the subject of increasingly intense marketing wars, seemingly accelerated by Oracle’s introduction of the Exadata appliance. Here is my spin on the situation. Please note that I have a financial interest in several database companies, and may be biased in a number of ways. The reader should always keep this in mind.

1. Star and snowflake schemas are a good idea in the data warehouse world.

In short, data warehouses store a large collection of facts. The overwhelming majority of these facts are the “five Ws” (who, what, where, when, and why) along with a collection of attributes about the fact. For example, a typical retail organization stores facts about historical transactions. These facts include the customer (who), the retail store (where), the time of the sale (when), and the purchased product (what), along with attributes of the sale (e.g., price, sales tax, credit card, etc.).

One should organize such data as shown in the figure here. Such a schema is called a star schema, with a central fact table and surrounding dimension tables. If stores are organized into divisions, then the star schema has another table between store and fact, and becomes a snowflake schema. Star and snowflake schemas are clean, simple, easy to parallelize, and usually result in very high-performance database management system (DBMS) applications.

If you are a data warehouse designer and come up with something other than a snowflake schema, you should probably rethink your design.

However, you will often come up with a design having a large number of attributes in the fact table; 40 attributes are routine and 200 are not uncommon. Current data warehouse administrators usually stand on their heads to make “fat” fact tables perform on current relational database management systems (RDBMSs).

2. Column stores will dominate the data warehouse market over time, replacing row stores.

The problem with row stores is they store data in the fact table, row by row, on disk. A typical business intelligence query requires half-a-dozen attributes or less (e.g., find me the average price of widgets by store by month for the past two years). A row store will read all 200 attributes, even though only six are required. In contrast, a DBMS that organized data by column will read only the six required, a savings of a factor of 33.

Since fact tables are getting fatter over time as business analysts want access to more and more information, this architectural difference will become increasingly significant. Even when “skinny” fact tables occur or where many attributes are read, a column store is still likely to be advantageous because of its superior compression ability.

For these reasons, over time, column stores will clearly win.

3. The vast majority of data warehouses are not candidates for main memory or flash memory.

Data warehouses are increasing in size faster than storage is getting cheaper. Business analysts have an appetite for as much attribute data as they can get their hands on, and want to keep increasingly long periods of history. Hence, data warehouse problems are getting “net harder,” not “net easier.” Put differently, most data warehouses are measured in Gbytes today, Tbytes tomorrow, and Pbytes the next day.

4. Massively parallel processor (MPP) systems will be omnipresent in this market.

Massively parallel processor systems are the only kind of computer architecture that will scale to Pbytes. All vendors, with a very few exceptions, are or will soon support MPP. Don’t bet on anything that is not in the MPP camp.

5. “No knobs” is the only thing that makes any sense.

It is pretty clear that human operational costs dominate the cost of running a data warehouse. This is mainly the system administration and database administration that is involved in keeping a MPP system up and in managing a Pbyte-sized warehouse. Database administrator (DBA) costs include designing schemas, reprovisioning databases to add or drop resources, adding and dropping users, etc.

Almost all DBMSs have 100 or more complicated tuning “knobs.” This requires DBAs to be “4-star wizards” and drives up operating costs. The only thing that makes sense is to have a program that adjusts these knobs automatically. In other words, look for “no knobs” as the only way to cut down DBA costs.

6. Appliances should be “software only.”

In my 40 years of experience as a computer science professional in the DBMS field, I have yet to see a specialized hardware architecture—a so-called database machine—that wins.

In other words, one can buy general-purpose CPU cycles from the major chip vendors or specialized CPU cycles from a database machine vendor. Since the volume of the general-purpose vendors are 10,000 or 100,000 times the volume of the specialized vendors, their prices are an order of magnitude under those of the specialized vendor. To be a price-performance winner, the specialized vendor must be at least a factor of 2030 faster.

I have never seen a specialized hardware architecture that is faster by this amount.

Put differently, I think database appliances are a packaging exercise—i.e., preconfigure general-purpose hardware and preload the DBMS on it. This results in a software-only appliance.

7. Hybrid workloads are not optimized by “one size fits all.”

If one has a workload that is part online transaction processing (OLTP) and part data warehouse, then he or she has two options: 1) Run a general-purpose RDBMS that stores both kinds of data; and 2) Run two systems, an OLTP engine and a data warehouse engine, coupled together with a high-speed interchange to move operational data into the data warehouse.

Row stores are not good at data warehouse applications (see #1). Column stores are optimized for data warehouses and are not good at OLTP. Hence, neither is a good candidate for a one-size-fits-all implementation. Instead, there are a number of interesting new ideas to accelerate OLTP, including main-memory SQL engines, main memory caches, and flash systems. When coupled with a column store in a two-system configuration, I assert the result will be a factor of 50 or so faster than solution 1.

8. Essentially all data warehouse installations want high availability (HA).

If there is data corruption in a 10 Tbyte warehouse, recovering the database from the database log will take a very long time. Of course, the exact amount of time depends on log formats, number of disks allocated to the log, and so on. However, restoring a 10 Tbyte dump from 10 disks is likely to take hours, not minutes, and then one has to perform a roll forward from the dump. Hardly anybody is interested in taking this kind of downtime.

Instead, most installations use replication and fail over to a replica if there is data corruption. Then, the corrupted copy can be rebuilt as a background task. Hence, HA is used for recovery, not the DBMS log. Obviously, this requires the DBMS to support HA; otherwise, it is a manual DBA hassle to accomplish the same thing in user code.

9. DBMSs should support online reprovisioning.

Not always, but often, I hear a request for online reprovisioning. In other words, one initially allocates 10 nodes to accomplish warehouse processing. The load later rises, and the desire is to allocate 20 nodes to the task originally done by 10. This requires the database to be repartitioned over double the number of nodes.

Hardly anybody wants to take the required amount of downtime to dump and reload the DBMS. A much better solution is for the DBMS to support reprovisioning, without going offline.

10. Virtualization often has performance problems in a DBMS world.

I hear many users say their long-term goal is to move to the cloud, whether using the public cloud or inside the firewall on “an enterprise cloud.” Here, a collection of servers is allotted to several-to-many DBMS applications inside an enterprise firewall. Often, such systems are managed by using virtualization software to present virtual nodes to the DBMS and its applications.

My experience is that CPU resources can be virtualized with modest overhead (say, 10%). However, data warehouses entail disk-based data. In this world, all MPP DBMSs “move the query to the data.” Obviously, this requires knowing the physical data distribution. Virtualization will destroy this knowledge, and turn what were originally reads to a local disk into reads to nonlocal disks. In other words, local I/O gets replaced by remote I/O, with an obvious significant performance hit.

Until better and cheaper networking makes remote I/O as fast as local I/O at a reasonable cost, one should be very careful about virtualizing DBMS software.

Of course, the benefits of a virtualized environment are not insignificant, and they may outweigh the performance hit. My only point is to note that virtualizing I/O is not cheap.