Matei Zaharia discusses "Apache Spark: A Unified Engine for Big Data Processing" (cacm.acm.org/magazines/2016/11/209116), a Contributed Article in the November 2016 Communications of the ACM.
---
TRANSCRIPT
00:00 There's cooking for one... and then there's cooking for a thousand. It's more than just a difference of scale. Big crowds demand completely different facilities and procedures.
00:14 The same is true for today's big data, which is intractable without such techniques as parallel processing, in-memory computation, and on-the-fly optimization.
00:25 Join us as Matei Zaharia describes how his creation helps the world gain insight from information, in Apache Spark: A Unified Engine for Big Data Processing.
00:38 [Intro graphics/music]
00:48 San Francisco's SoMa district is home to some of the world's most data-intensive companies. It's also where Databricks' Chief Technical Officer Matei Zaharia learned first-hand how tools of the time such as MapReduce just weren't enough.
01:04 DR. ZAHARIA: I worked with people at Facebook and at Yahoo and I saw that they all wanted to do more and more types of workloads on these large-scale clusters. ... So basically the goal was to design a more-general computing engine to do different types of computation with the same data on the same kind of hardware as MapReduce.
01:24 He specifically wanted to address three problems he saw in MapReduce.
01:29 DR. ZAHARIA: One thing was iterative algorithms that make many passes through the same data. ... A second thing was real-time streaming so instead of running this batch job every night to compute a report, can you just compute it incrementally as new data arrives throughout the day; ... and the third thing was interactive queries.
01:48 To meet these goals, Dr. Zaharia and his colleagues chose to keep Apache Spark's core small, with libraries that pass data efficiently to each other.
01:58 DR. ZAHARIA: That's actually really important for big data in particular, because moving data is expensive.
02:04 Instead, Apache Spark's core pipelines functions, for example when using a data stream to train an intelligent agent.
02:13 DR. ZAHARIA: As you parse each record from the JSON, it immediately feeds that into the machine learning function -- there's no need to save it anywhere else. And as a result, you get something faster, and to end with fewer data copies and less I/O.
02:27 The pipelining process takes advantage of graph theory to make everything run even faster.
02:33 DR. ZAHARIA: Spark looks at the graph and says, O.K., "What order do I need to run things in to get the final result you want out of it?" ... And before we begin doing that, we also optimize the graph a little bit. So we look, we say, if we you're chaining together some operations that we can compact down into one thing, we're going to optimize that.
02:51 The results have enabled new applications.
02:54 DR. ZAHARIA: They have these transparent fish called zebrafish, and they could actually take images of all the neurons in that fish's brain and see when they were active. And, like, they saw them light up as like little specks of light on a photo. So they have this new imaging technology, and they connected this to Apache Spark to actually analyze the data in real time.
03:17 Dr. Zaharia believes that Apache Spark could eventually put such power on every desktop.
03:23 DR. ZAHARIA: Already we see a lot of research where people just take all of Wikipedia and do all kinds of stuff with it like find spammers, try to figure out bias and so on. But I think over time, it'll be possible to do that for, like, pretty much all human-generated text.
03:39 Get all the details in the November 2016 issue of Communications of the ACM, in the contributed article, "Apache Spark: A Unified Engine for Big Data Processing".
03:52 [Outro and credits]