June 16, 2011
Historically, Online Transaction Processing (OLTP) was performed by customers submitting traditional transactions (order something, withdraw money, cash a check, etc.) to a relational DBMS. Large enterprises might have dozens to hundreds of these systems. Invariably, enterprises wanted to consolidate the information in these OLTP systems for business analysis, cross selling, or some other purpose. Hence, Extract-Transform-and-Load (ETL) products were used to convert OLTP data to a common format and load it into a data warehouse. Data warehouse activity rarely shared machine resources with OLTP because of lock contention in the DBMS and because business intelligence (BI) queries were so resource-heavy that they got in the way of timely responses to transactions.
This combination of a collection of OLTP systems, connected to ETL, and connected to one or more data warehouses is the gold standard in enterprise computing. I will term it "Old OLTP." By and large, this activity was supported by the traditional RDBMS vendors. In the past I have affectionately called them "the elephants"; in this posting I refer to them as "Old SQL."
As noted by most pundits, "the Web changes everything," and I have noticed a very different collection of OLTP requirements that are emerging for Web properties, which I will term "New OLTP." These sites seem to be driven by two customer requirements:
The need for far more OLTP throughput. Consider new Web-based applications such as multiplayer games, social networking sites, and online gambling networks. The aggregate number of interactions per second is skyrocketing for the successful Web properties in this category. In addition, the explosive growth of smartphones has created a market for applications that use the phone as a geographic sensor and provide location-based services. Again, successful applications are seeing explosive growth in transaction requirements. Hence, the Web and smartphones are driving the volume of interactions with a DBMS through the roof, and New OLTP developers need vastly better DBMS performance and enhanced scalability.
The need for real-time analytics. Intermixed with a tidal wave of updates is the need for a query capability. For example, a Web property wants to know the number of current users playing its game, or a smartphone user wants to know "What is around me?" These are not the typical BI requests to consolidated data, but rather real-time inquiries to current data. Hence, New OLTP requires a real-time query capability.
In my opinion, these two characteristics are shared by quite a number of enterprise non-Web applications. For example, electronic trading firms often trade securities in several locations around the world. The enterprise wants to keep track of the global position for each security. To do so, all trading actions must be recorded, creating a fire hose of updates. Furthermore, there are occasional real-time queries. Some of these are triggered by risk exposurei.e., alert the CEO if the aggregate risk for or against a particular security exceeds a certain monetary threshold. Others come from humans, e.g., "What is the current position of the firm with respect to security X?"
Hence, we expect New OLTP to be a substantial application area, driven by Web applications as the early adopters. These applications will be followed by more traditional enterprise systems. Let's look at the deployment options.
In my opinion, these vendors have a couple of issues when presented with New OLTP. First, most New OLTP applications want real ACID. Replacing real ACID with either no ACID or "ACID lite" just pushes consistency problems into the applications where they are far harder to solve. Second, the absence of SQL makes queries a lot of work. In summary, NoSQL will translate into "lots of work for the application"i.e., this will be the full employment act for programmers for the indefinite future.
Hence, New SQL should be considered as an alternative to NoSQL or Old SQL for New OLTP applications. If New OLTP is as big a market as I foresee, I expect we will see many more New SQL engines employing a variety of architectures in the near future.
RavenDB, a document database of the NoSQL genre, has great ETL, OLTP support, and is backed by ACID storage. See http://t.co/fVxvmSV (and it has evolved a lot since then).
I was hoping to know the characteristics of New SQL. I am actually not convinced with your arguments about NoSQL databases as you have presented no arguments. Problems with Old SQL are known, promises of NoSQL are also known. What is New SQL?
In the context of transaction processing, I would define a New SQL DBMS as one having the following five characteristics: 1) SQL as the primary mechanism for application interaction; 2) ACID support for transactions; 3) A nonlocking concurrency control mechanism so real-time reads will not conflict with writes and thereby cause them to stall; 4) An architecture providing much higher per-node performance than available from the traditional "elephants"; 5) A scale-out, shared-nothing architecture capable of running on a large number of nodes without bottlenecking.
About point 3, isn't this achieved in most traditional databases by MVCC?
About point 4, most databases (Clustrix, Akiban, NimbusDB) that are New SQL candidates talk only about better query performance using distributed query or a kind of object storage. I am not sure if they have anything in better for DML performance.
VoltDB is an exception; I am not sure if it is much better than TeraData or Greenplum, which are based on Old RDBMS architecture.
About point 5, yes, this is a new feature. If I understand correctly, it means scaling the performance by adding new nodes without interrupting existing users.
My previous comment suggested five criteria that defined a New SQL DBMS. I would like to stress three points that I made previously. First, my posting focused on DBMSs for new OLTP applications. Data warehouse vendors, such as TeraData and Greenplum, are focused on a completely different market, and are not designed to perform high-velocity transactions. Hence, they are not considered as New SQL vendors.
Second, most of the Old SQL vendors use standard two-phase locking, although there are exceptions. Hence, there are Old SQL engines that satisfy some of my five criteria.
Third, one of the big problems with Old SQL engines is their mediocre per-node performance. One of my criteria for New SQL vendors is much better per-node performance. The proof of this is via performance on standard benchmarks. Hence, whether any particular vendor satisfies this criteria would have to be determined experimentally. As such, the list of vendors who satisfy the five criteria may well change over time.
In case any database has scale-out architecture, why is it most necessary that per-node performance also should be very high? Anyway, if performance is getting better by adding more nodes, it will be achieved.
My main focus is to understand why can't some existing database like PostgreSQL be considered an option similar to New SQL if we enhance it to support scale-out architecture in it. Saying this doesn't mean I have a clear idea about how to achieve it.
It will have benefits to existing customers as well even though the performance is somewhat less than New SQL engines. It can save a lot of effort to change applications to suit to new engines.
In round numbers, New SQL engines are a factor of 50 or more faster on New OLTP than Old SQL. That means that an Old SQL engine would require 500 nodes to do a task that can be accomplished by a New SQL engine with 10 nodes.
The downside of 500 nodes is increased hardware cost, increased power consumption, increased system administration complexity, increased database administration complexity, high availability complexity (if a node fails once a month, then New SQL fails every third day), while Old SQL fails once an hour, and less flexibility (if load doubles, have to add 500 more nodes, rather than 10).
Disclosure: Michael Stonebraker is associated with four startups that are either producers or consumers of database technology.
©2012 ACM 0001-0782/11/01 $15.00
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from firstname.lastname@example.org or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2012 ACM, Inc.
Good to see this topic and article; I enjoyed your recent very informative present at the NIST symposium on Big Data (http://www.nist.gov/itl/ssd/is/big-data.cfm).
Taking data in is import, but also sending data out and results information sharing.
Our open source project has developed code-free templates to allow rapid XML open data services to be built from SQL sources so you can quickly and easily present and share results and knowledge.
See our online video training materials for more details: http://www.youtube.com/user/TheCAMeditor
Oracle Public Sector
Meh, I'm not convinced. I think that the "Old SQL" solutions will adapt and be just fine. I think what we are seeing in data is a blip, and that the big vendors like Oracle and Microsoft will end up taking care of most of these issues within the next 1-2 years.
I think what will happen is that the best aspects of NoSQL and so-called New SQL will get integrated into their products and as a result those products will be much better than the one-off specialized products.
This is already happening with in-memory, which is largely what I assume you are talking about with "New SQL". MS and Oracle are both working on in-memory solutions and putting together "out of the box" systems with massive throughput. Yeah, they are expensive, but still. MS's new Parallel Data Warehouse can run massive queries on petabytes of data in milliseconds now, I mean...
With the implementation of writable columnstore indexes (coming soon on SQL Server) you'll be able to do blazingly fast queries against OLPT systems, without locking...
The problem with NoSQL and New SQL systems is typically that they are very specialized and very good for a very narrowly defined set of tasks. But the older systems have a much larger established infrastructure and ecosystems of tools, and so if you can add those new capabilities to the older systems you get a much more powerful and well rounded system than if you simply adopt some new specialized system, and it will be much more expensive for new players to build all of the stuff that the older guys than than the other way around.
I would also submit that while the VOLUME of "new forms of data" will definitely exceed the volume of older forms of transactional data, the VALUE of older forms of data will remain higher. There is a reason that systems evolved the way that they did. Essentially the most valuable problems were tackled first. Yes, all of that phone and web data is interesting, but a single row in a traditional RDBMS is worth thousands or millions of "rows" of unstructured or "New" data.
And this is the issue, yeah there is value in data mining weblogs and location data, etc., but its like the difference between mining for gold and mining for aluminum. In order to make money mining aluminum you have to process a lot more of it to get the same value you get from a few ounces of gold.
So every "ton" of "new data" is worth about "1 pound" of "old data", if you see what I mean.
What this means is that the "value density" still resides with the older systems, and that's why I think that things will end up coming back to them. They are the anchors of the data world.
Which open source database engines would you categorise as "New SQL"?
Displaying all 3 comments