A Valuable Lesson, and Whither Hadoop?

http://bit.ly/1s0eTgd
Oct. 16, 2014

I have just come home from the Grace Hopper Celebration of Women in Computing. Amazingly, I have now attended 13 of the 14 GHCs that have taken place, including the very first one in 1994. The conference was huge this year! There were 8,000 attendees, over 2,800 students from 441 schools, and attendees from 67 countries. The exhibition hall was a beehive of activity and I am happy to report we had a lot of traffic at the joint ACM/ACM-W/ACM CCECC/CSTA booth, making lots of great contacts among our various constituencies.

Of course, there has been a lot of commentary online about the remarks made by Satya Nadella, CEO of Microsoft. In case you missed it, in response to a question asked by Maria Klawe (president of Harvey Mudd College and member of Microsoft’s Board of Directors), Nadella said women should not ask for pay raises (in truth, he said nobody should ask for raises, but Maria’s question was specifically about women). He said he trusted in the review system, and that if someone’s work is good it would eventually be rewarded. As you can imagine, the conference and the Web lit up, followed soon thereafter by all other forms of media as well.

Here are my thoughts on this matter:

As a result of the ruckus, the Grace Hopper Celebration and the situation for women in computing got more press and more visibility than ever before. That is a very good outcome.
The situation provided a great opportunity for people to talk about the fact that meritocracy does not work when there is implicit bias. Nadella may think the Microsoft review process is fair and unbiased, but given that women in tech in the U.S. earn only $0.86 for every $1.00 earned by men, he should seriously research what the actual numbers are for Microsoft, and then adjust salaries accordingly.
This incident provided a very valuable lesson for the students at the conference. The companies recruiting at Hopper are trying very hard to improve their diversity statistics. The conference gives them access to a lot of women job candidates, and they treat the students very well (fancy swag, food, private events, interview booths, raffles, etc.). It would be easy for the students to be deluded into thinking everything is great now in the tech world and women are always well-treated. Nadella’s comments serve as a reminder that women entering the field still have to be prepared to advocate for themselves when they negotiate starting salaries and subsequent raises.
I do not doubt for a minute that Nadella, along with many other tech CEOs right now, considers himself a strong advocate for women in computing. He is noteworthy for being the first tech CEO of that level to come to Hopper, and he spent a lot of time there. He still has some things to learn, as do many people in this field. As we know, there are many hearts and minds that need to be changed, and even some of our best allies have a lot to learn.

Michael Stonebraker “Hadoop at a Crossroads?”

http://bit.ly/1mj9e2Q
Aug. 5, 2014

Since my last blog posting with Jeremy Kepner on this topic in 2012 (http://bit.ly/1tF6vZ7), a lot of water has gone under the bridge. I feel compelled to point out a few facts and opinions and report on a couple of announcements. I conclude with a prediction on where the “Hadoop stack” might be going.

The first announcement was Cloudera releasing a new DBMS, Impala (http://bit.ly/13EAVRv), which runs on HDFS. Put simply, Impala is architected exactly like all of the shared-nothing parallel SQL DBMSs, serving the data warehouse market. Specifically, notice the MapReduce layer has been removed, and for good reason. As some of us have been pointing out for years, MapReduce is not a useful internal interface inside a SQL (or Hive) DBMS (http://bit.ly/1x9MKYu and http://bit.ly/1AcBT5C). Impala was architected by savvy DBMS developers, who know the above pragma. In fact, development activity similar to Impala is being done by both HortonWorks and Facebook. This, of course, presents the Hadoop vendors with a dilemma. Historically, “Hadoop” referred to the open source version of MapReduce written by Yahoo. However, Impala has thrown this layer out of the stack. How can one be a Hadoop vendor, when Hadoop is no longer in the mainstream stack?

The answer is simple: redefine “Hadoop,” and that is exactly what the Hadoop vendors have done. The word “Hadoop” is now used to mean the entire stack. In other words, HDFS is at the bottom, on top of which run Impala, Map-Reduce, and other systems. On top of these systems run higher-level software such as Mahout. The word “Hadoop” is used to refer to the entire collection.

The second recent announcement comes from Google, who announced MapReduce is yesterday’s news and they have moved on, building their software offerings on top of better systems such as Dremmel, Big Table, and F1/Spanner (http://bit.ly/1pi7QVC). In fact, Google must be “laughing in their beer” about now. They invented MapReduce to support the Web crawl for their search engine in 2004. A few years ago they replaced MapReduce in this application with BigTable, because they wanted an interactive storage system and Map Reduce was batch-only. Hence, the driving application behind MapReduce moved to a better platform a while ago. Now Google is reporting they see little-to-no future need for MapReduce.

It is indeed ironic that Hadoop is picking up support in the general community about five years after Google moved on to better things. Hence, the rest of the world followed Google into Hadoop with a delay of most of a decade. Google has long since abandoned it. I wonder how long it will take the rest of the world to follow Google’s direction and do likewise…

Notice the Hadoop vendors are now on a collision course with the data warehouse vendors. They are now implementing (or have implemented) the same architecture supported by the data warehouse folks. Once they have a few years to solidify their implementations, they will probably offer competitive performance. Meanwhile, most of the data warehouse vendors support HDFS, and many offer features to support semi-structured data. Hence, the data warehouse market and the Hadoop market will quickly converge. May the best systems win in the resulting head-to-head donnybrook!

Now I turn to HDFS, the only common building block left in the Hadoop stack. Notice clearly that HDFS is a file system, capable of storing bytes of data, a feature we have all come to expect on any computing platform. There are two possible world views of where HDFS might go in the future. If you take a file system view of the world, then users want a common distributed file system and HDFS is a perfectly reasonable alternative.

On the other hand, from the point of view of a parallel SQL/Hive DBMS, HDFS is a “fate worse than death.” A DBMS always wants to send the query (small kilobytes) to the data (lots of gigabytes) and never the other way around. Hence, hiding the location of data from the DBMS is death, and the DBMS will go to great lengths to circumvent this feature. All parallel DBMSs, either from the warehouse vendors or from the Hadoop vendors, will turn off location transparency, making HDFS look like a collection of Linux file systems, one per node. Likewise no DBMS wants file system replicas. See http://bit.ly/1x4Blek for an extensive discussion of this point. In short, load balancing, query optimization, and transaction considerations favor a DBMS-supplied replication system.

If it turns out the DBMS point of view prevails in the marketplace over time, then HDFS will atrophy as the DBMS vendors abandon its features. In such a world, there is a local file system at each node, a parallel DBMS supporting a high-level query language, and various tools on top or extensions defined via user-defined functions. In this scenario, Hadoop will morph into a standard shared-nothing DBMS with a collection of DBMS vendors competing for your software dollar.

On the other hand, if the file system point of view prevails, then HDFS will probably survive largely intact with a potpourri of tools running on top of it. Features users take for granted in a DBMS environment, such as load-balancing, auditing, resource governors, data independence, data integrity, high availability, concurrency control, and quality of service will be slow to come to file system users. There will be no higher-level standard interfaces in this scenario. In other words, a DBMS view of the world offers a bunch of useful services, and users would be well advised to consider carefully if they want to run lower-level interfaces.

In either scenario, the only common piece of software is a file system, and the Hadoop vendors will be selling file-system based tools, either DBMS ones or other stuff (or maybe both). In effect, they will join the ranks of the system software vendors selling software or services. May the best products win!

Comments

Great article, one of many in this series that have proven to be prophetic.

I think there is always tension in the marketplace for ideas when there is a choice of “extending” an existing technology vs. scrapping the old in favor of something revolutionary. In this case, large portions of the market chose the revolutionary, without considering all of the features that would eventually be needed to flesh out the technology for enterprise production purposes. That is not to say Hadoop and MapReduce are not useful or productive for some applications; they are. It is, however, a reach to say Hadoop and MapReduce make existing DBMS technology/products obsolete.

The significant aspects are the introduction of NoSQL databases, new forms of concurrency/transactions, for example, where “eventual consistency” is “good enough,” and “big data.” The particular products that introduce these will most likely not be the products that eventually make these features enterprise capable. Like COBOL programs, the applications using existing technology will probably never disappear until all the SQL programmers die off.

—C Rofer