The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise.
When a new user visits the online music service Pandora and selects a station, the company's software immediately generates a playlist based on the preferences of its community. If the individual creates a Chopin station, for example, the string of songs would consist of the most popular renditions of the composer's music among Pandora's community. Once the new listener inputs a rating, clicking on either the "thumbs up" or the "thumbs down" icon, Pandora factors this preference into future selections. In effect, the service becomes smarter with the vote of each thumb.
Pandora will not discuss exactly how much data it churns through daily, but head of playlist engineering Eric Bieschke says the company has at least 20 billion thumb ratings. Once every 24 hours, Pandora adds the last day's data to its historical poolnot just thumbs, but information on skipped songs and moreand runs a series of machine learning, collaborative filtering, and collective intelligence tasks to ensure it makes even smarter suggestions for its users. A decade ago this would have been prohibitively expensive. Four years ago, though, Bieschke says Pandora began running these tasks in Apache Hadoop, an open source software system that processes enormous datasets across clusters of cheap computers. "Hadoop is cost efficient, but more than that, it makes it possible to do super large-scale machine learning," he says. Pandora's working dataset will only grow, and Hadoop is also designed for expansion. "It's so much easier to scale. We can literally just buy a bunch of commodity hardware and add it to the cluster."
Bieschke is hardly alone in his endorsement. In just a few years, Hadoop has grown into the system of choice for engineers analying big data in fields as diverse as finance, marketing, and bioinformatics. At the same time, the changing nature of data itself, along with a desire for faster feedback, has sparked demand for new approaches, including tools that can deliver ad hoc, real-time processing, and the ability to parse the interconnected data flooding out of social networks and mobile devices. "Hadoop is going to have to evolve," says Mike Miller, chief scientist at Cloudant, a cloud database service based in Boston, MA. "It's very clear that there is a need for other tools." Indeed, inside and outside the Hadoop ecosystem, that evolution is already well under way.
Hadoop traces back to 2004, when Google published the second of a pair of papers [see the "Further Reading" list] describing two of the key ideas behind its search success. The first detailed the Google File System, or GFS, as a way of distributing data across hundreds or thousands of inexpensive computers. To glean insights from that data, a second tool, called MapReduce, breaks a given job into smaller pieces, sends those tasks out to the different computers, then gathers the answers in one central node. The ideas were revolutionary, and soon after Google released the two papers, Yahoo! engineers and others quickly began developing open source software that would enable other companies to take advantage of the same breed of reliable, scalable, distributed computing that Google had long enjoyed.
The result, Apache Hadoop, consists of two main software modules. The Hadoop Distributed File System (HDFS) is similar to a file system on a single computer. Like GFS, it disperses enormous datasets among hundreds or thousands of pieces of inexpensive hardware. The computational layer, Hadoop MapReduce, takes advantage of the fact that those chunks of data are all sitting on independent computers, each with its own processing power. When a developer writes a program to mine that data, the task is split up. "Each computer will look at its locally available data and run a little segment of the program on that one computer," explains Todd Lipcon, an engineer at Palo Alto, CA-based Hadoop specialist Cloudera. "It analyzes its local data and then reports back the results."
Although Hadoop is open source, companies like Cloudera and MapR Technologies of San Jose, CA, have found a market in developing additional services or packages around making it easier to use and more reliable. MapR, for example, has helped ancestry.com use Hadoop to carry out pattern matches on its enormous library of DNA details. After a customer sends in a saliva sample, the company can extract the basic biological code and use a Hadoop-based program to search for DNA matchesthat is, potential mystery relativesacross its database.
For all its strengths in large-scale data processing, however, experts note MapReduce was not designed to analyze data sets threaded with connections. A social network, for example, is best represented in graph form, wherein each person becomes a vertex and an edge drawn between two individuals signifies a connection. Google's own work supports the idea that Hadoop is not set up for this breed of data: The company's Pregel system, publicly described for the first time in 2009, was developed specifically to work with graph structures, since MapReduce had fallen short.
Along with a handful of students, University of Washington network scientist Carlos Guestrin recently released a new open source processing framework, GraphLab, that uses some of the basic MapReduce principles but pays more attention to the networked structure. The data distribution phase takes the connections into account. "If I know that you and I are neighbors in the graph, there will be some computation that needs to look at your data and my data," Guestrin explains. "So GraphLab will try to look at our data on the same machine."
The trick is that GraphLab partitions this data in a novel way. The standard method would have been to split the data into groups of highly connected vertices. In social networks, however, a few persons have a disproportionate number of connections. Those people, or vertices, could not be stuffed into single machines, which would end up forcing numerous computers to communicate. To avoid this inefficiency, Guestrin says GraphLab's algorithm partitions data according to the edges, so that closely linked edges are on the same machines. A highly connected individual such as the pop star Britney Spears will still live on multiple pieces of hardware, but far fewer than with the standard technique. "It minimizes the number of places that Britney Spears is replicated," Guestrin explains.
"Hadoop is going to have to evolve," says Mike Miller. "It's very clear that there is a need for other tools."
Hadoop, on the other hand, is agnostic to the structure of the data, according to Guestrin. Two pieces of information that should be analyzed on the same computer might end up in different clusters. "What ends up happening is that they have to move data around a lot," Guestrin says. "We can be smart about how data is placed and how data is communicated between machines."
Hadoop can often complete the same tasks as GraphLab, but Guestrin says his more efficient approach makes it much faster. On several common benchmark tests, such as a name recognition task in which an algorithm analyzes text and assigns different categories to words, GraphLab has completed the job 60 times faster, using the same hardware.
In some cases, though, this is not fast enough. Today's companies often want results in real time. A hedge fund might be looking to make a snap decision based on the day's events. A global brand might need to respond quickly to a trending topic on Twitter. For those sorts of snap decision-related tasks, Hadoop is too slow, and other tools have begun to emerge.
The Hadoop community has been building real-time response capabilities into HBase, a software stack that sits atop the basic Hadoop infrastructure. Cloudera's Lipcon explains that companies will use Hadoop to generate a complicated model of, say, movie preferences based on millions of users, then store the result in HBase. When a user gives a movie a good rating, the website using the tools can factor that small bit of data into the model to offer new, up-to-date recommendations. Later, when the latest data is fed back into Hadoop, these analyses run at a deeper level, analyzing more preferences and producing a more accurate model. "This gives you the sort of best of both worldsthe better results of a complex model and the fast results of an online model," Lipcon explains.
GraphLab, a new open source processing framework, uses some of the basic MapReduce principles, but pays more attention to the networked structure.
Cloudant, another real-time engine, uses a MapReduce-based framework to query data, but the data itself is stored as documents. As a result, Miller says, Cloudant can track new and incoming information and only process the changes. "We don't require the daily extraction of data from one system into another, analysis in Hadoop, and re-injection back into a running application layer," he says. "That allows us to analyze results in real time." And this, he notes, can be a huge advantage. "Waiting until overnight to process today's data means you've missed the boat."
Miller says Cloudant's document-oriented store approach, as opposed to the column-oriented store adopted in HBase, also makes it easier to run unexpected or ad hoc queriesanother hot topic in the evolving Hadoop ecosystem. In 2009, Google publicly described its own ad hoc analysis tool, Dremel, and a project to develop an open source version, Drill, just launched this summer. "In between the real-time processing and batch computation there's this big hole in the open source world, and we're hoping to fill that with Drill," says MapR's Ted Dunning. LinkedIn's "People You May Know" functionality would be an ideal target for Drill, he notes. Currently, the results are on a 24-hour delay. "They would like to have incremental results right away," Dunning says.
Although these efforts differ in their approaches, they share the same essential goal. Whether it relates to discovering links within pools of DNA, generating better song suggestions, or monitoring trending topics on Twitter, these groups are searching for new ways to extract insights from massive, expanding stores of information. "A lot of people are talking about big data, but most people are just creating it," says Guestrin. "The real value is in the analysis."
NoSQL Tapes, http://www.nosqltapes.com.
Dean, J. and Ghemawat, S.
MapReduce: Simplified data processing on large cluters, Proceedings of the 6th Symposium on Operating Systems Design and Implementation, San Francisco, 2004.
Ghemawat, S., Gobioff, H., and Leung, S.
The Google file system, Proceedings of the 19th ACM Symposium on Operating Systems Principles, Lake George, NY, Oct. 1922, 2003.
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J.M.
GraphLab: A new parallel framework for machine learning, The 26th Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, July 811, 2010.
Hadoop: The Definitive Guide, O'Reilly Media, Sebastopol, CA, 2009.
©2013 ACM 0001-0782/13/01
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from email@example.com or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2013 ACM, Inc.
I think that Carlos Guestrin is from CMU (as the graphlab page here - http://graphlab.org/contact/ also tells us), whereas you have said he is from University of Washington.
Could you clarify this please?
Although he is still affiliated with CMU, Guestrin recently moved to the University of Washington:
Gregory, very good article!
I would like to add that there is another free and open source distributed data-intensive computing platform, which is not based on the MapReduce paradigm: the LexisNexis HPCC Systems platform (http://hpccsystems.com).
The original design for the HPCC Systems platform predates the paper on MapReduce from the Google researchers by, at least, 5 years. The processing model of the HPCC Systems platform is dataflow oriented and provides a very high level declarative and open programming language called ECL, which offers modern programming language features, including code/data encapsulation, lazy evaluation, compilation to native code and purity. This platform underpins all the data services and analytic products from LexisNexis Risk Solutions, and several other information products from Reed Elsevier, its parent company, in areas that cover machine learning, massive data warehousing, social graph analytics, recommendation systems, etc. It has also been in use by several large and medium sized Organizations for years (even before it was released under an Open Source license, back in 2011).
On the same topic, a few weeks ago, I wrote a short blog post comparing the paradigms behind the two main data-intensive open source platforms: Hadoop and HPCC, which you can read here: http://hpccsystems.com/blog/hpcc-systems-hadoop-%E2%80%93-contrast-paradigms. Some of the concepts that I expose there are relevant for this article.
Displaying all 3 comments