Hadoop at a Crossroads?

MIT Adjunct Professor Michael Stonebraker

Since my last blog posting with Jeremy Kepner on this topic in 2012 [1], a lot of water has gone under the bridge, and I feel compelled to point out a few facts and opinions and report on a couple of announcements. I conclude with a prediction on where the "Hadoop stack" might be going.

The first announcement was Cloudera releasing a new DBMS, Impala [2], that runs on HDFS. Put simply, Impala is architected exactly like all of the shared-nothing parallel SQL DBMSs, serving the data warehouse market. Specifically, notice clearly that the MapReduce layer has been removed, and for good reason. As some of us have been pointing out for years, MapReduce is not a useful internal interface inside a SQL (or Hive) DBMS [3, 4]. Impala was architected by savvy DBMS developers, who know the above pragma. In fact, development activity similar to Impala is being done by both HortonWorks and FaceBook. This, of course, presents the Hadoop vendors with a dilemma. Historically, "Hadoop" referred to the open source version of MapReduce written by Yahoo. However, Impala has thrown this layer out of the stack. How can one be a Hadoop vendor, when Hadoop is no longer in the mainstream stack?

The answer is simple: redefine "Hadoop", and that is exactly what the Hadoop vendors have done. The word "Hadoop" is now used to mean the entire stack. In other words, HDFS is at the bottom, on top of which run Impala, MapReduce and other systems. On top of these systems run higher-level software such as Mahout. The word "Hadoop" is used to refer to the entire collection.

The second recent announcement comes from Google, who announced that MapReduce is yesterday’s news and they have moved on, building their software offerings on top of better systems such as Dremmel, Big Table, and F1/Spanner [5]. In fact, Google must be "laughing in their beer" about now. They invented MapReduce to support the web crawl for their search engine in 2004. A few years ago they replaced MapReduce in this application with BigTable, because they wanted an interactive storage system and MapReduce was batch-only. Hence, the driving application behind MapReduce moved to a better platform a while ago. Now Google is reporting that they see little-to-no future need for MapReduce.

It is indeed ironic that Hadoop is picking up support in the general community about five years after Google moved on to better things. Hence, the rest of the world followed Google into Hadoop with a delay of most of a decade. Google has long since abandoned it. I wonder how long it will take the rest of the world to follow Google’s direction and do likewise…

Notice that the Hadoop vendors are now on a collision course with the data warehouse vendors. They are now implementing (or have implemented) the same architecture supported by the data warehouse folks. Once they have a few years to solidify their implementations, they will probably offer competitive performance. Meanwhile most of the data warehouse vendors support HDFS, and many offer features to support semi-structured data. Hence, the data warehouse market and the Hadoop market will quickly converge. May the best systems win in the resulting head-to-head donneybrook!

Now I turn to HDFS, which is the only common building block left in the Hadoop stack. Notice clearly that HDFS is a file system, capable of storing bytes of data, a feature we have all come to expect on any computing platform. There are two possible world views of where HDFS might go in the future. If you take a file system view of the world, then users want a common distributed file system and HDFS is a perfectly reasonable alternative.

On the other hand, from the point of view of a parallel SQL/Hive DBMS, HDFS is a "fate worse than death". A DBMS ALWAYS wants to send the query (small kilobytes) to the data (lots of gigabytes) and never the other way around. Hence, hiding the location of data from the DBMS is death, and the DBMS will go to great lengths to circumvent this feature. All parallel DBMSs, either from the warehouse vendors or from the Hadoop vendors, will turn off location transparency, making HDFS look like a collection of Linux file systems, one per node. Likewise no DBMS wants file system replicas. See [6] for an extensive discussion of this point. In short, load-balancing, query optimization and transaction considerations favor a DBMS-supplied replication system.

If it turns out that the DBMS point of view prevails in the marketplace over time, then HDFS will atrophy as the DBMS vendors abandon its features. In such a world, there is a local file system at each node, a parallel DBMS supporting a high-level query language, and various tools on top or extensions defined via user-defined functions. In this scenario, Hadoop will morph into a standard shared-nothing DBMS with a collection of DBMS vendors competing for your software dollar.

On the other hand, if the file system point of view prevails, then HDFS will probably survive largely intact with a potpourri of tools running on top of it. Features that users take for granted in a DBMS environment, such as load-balancing, auditing, resource governors, data independence, data integrity, high availability, concurrency control, and quality-of-service will be slow to come to file system users. There will be no higher-level standard interfaces in this scenario. In other words, a DBMS view of the world offers a bunch of useful services, and users would be well advised to consider carefully if they want to run lower-level interfaces.

In either scenario, the only common piece of software is a file system, and the Hadoop vendors will be selling file-system based tools, either DBMS ones or other stuff (or maybe both). In effect, they will join the ranks of the system software vendors selling software or services. May the best products win!