Big Data

Clockwise from top left: David Blei, Daphne Koller, Vipin Kumar, and Michael Stonebraker.

Since its inauguration in 1966, the ACM A.M. Turing Award has recognized major contributions of lasting importance to computing. Through the years, it has become the most prestigious award in computing. To help celebrate 50 years of the ACM Turing Award and the visionaries who have received it, ACM has launched a campaign called “Panels in Print,” which takes the form of a collection of responses from Turing laureates, ACM award recipients and other ACM experts on a given topic or trend.

For our fourth and final Panel in Print, we invited 2014 ACM A.M. Turing Award recipient MICHAEL STONEBRAKER, 2013 ACM Prize recipient DAVID BLEI, 2007 ACM Prize recipient DAPHNE KOLLER, and ACM Fellow VIPIN KUMAR to discuss trends in big data.

Gartner estimates that there are currently about 4.9 billion connected devices (cars, homes, appliances, industrial equipment, among others) generating data. This is expected to reach 25 billion by 2020. What do you see as some of the primary challenges and opportunities this wave of data will create?

VIPIN KUMAR: One of the major challenges we are going to see is that the data being gathered from these connected devices and sensors is very different from other datasets that our big data community has had to deal with.

The biggest successes we have seen for big data are in applications such as Internet search, e-commerce, placement of online ads, language translation, image processing, autonomous driving. These successes have been enabled, to a great extent, by the availability of large, relatively structured datasets that can be used to train a broad range of machine learning algorithms. But the data from multitudes of interconnected devices in its raw state, can be highly fragmented, disparate in space and time, and very heterogeneous. Analyzing such data will be a big and new technological challenge for the machine learning and data mining communities.

DAVID BLEI: The key idea here is that just the data from something as simple as Netflix watching habits doesn’t provide the recommendation of a new movie; it’s that data alongside all the data from everybody else that helps make recommendations.

It’s an exciting world because we are personalizing our interaction with devices through the aggregate data of everybody using their devices. Of course, this all comes with a challenge around privacy and what we give up when we make our data available or the spectrum of how much we can give up against how much personalization power we get in return.

The other opportunity is in an unprecedented way to learn about the world through these huge collections of many individuals. This is a massive dataset, and patterns of communication, interaction, and movement—including all types of other macro-level descriptions of society and people and the world—are now available to us.

As more data is collected from a growing pool of devices, has the individual lost the right to information privacy?

MICHAEL STONEBRAKER: Imagine this simple example: you show up at your doctor’s office and have an x-ray done and you want the doctor to run a query that shows who else has x-rays that look like yours, what was their diagnosis and what was the morbidity of the patients. That requires integrating essentially the country’s entire online medical databases and presumably would extend to multiple countries as well. While that is a daunting data integration challenge, because every hospital chain stores its data with different formats, different encodings for common terms, etc., the social value gained from solving it is just huge. But that also creates an incredibly difficult privacy problem, one that is not a technical issue. Because if you’re looking for an interesting medical query, you’re not looking for common events; you’re looking for rare events, and at least to my knowledge, there aren’t any technical solutions that will allow access to rare events without indirectly disclosing who the events belong to.

I view the privacy problem to be basically a legal problem. We have to have legal remedies in this area. There are tons of examples of data that can be assembled right now that will compromise privacy. Unfortunately, the social value to compromising privacy is pretty substantial. So, you can argue that technology has rendered privacy a moot question. Or you can argue that preserving privacy is a legislative issue.

As predictive models are increasingly used, how do we avoid biases when interpreting and using data?

DAPHNE KOLLER: Bias will always be a challenge, and there isn’t a single, magic solution. The bigger question is: “How do we disentangle correlation from causation?” The gold standard in medicine is that of randomized case control. In the case of Web data, it’s called AB testing. Although not perfect, randomized case control, or AB testing, is about as good a tool as we have been able to develop for addressing some of the confounders. Unfortunately, this type of control is not feasible in all cases. Then processes must be carefully scrutinized to check for different confounders and to look for any and all correlations that give rise to the phenomenon being viewed. It’s a process that requires a lot of thought and a lot of care and cannot be over-stated in its importance.

For example, sometimes there are biases that are reflected in the conclusions that are drawn from the data. In searches on certain sites for example, “Steph” auto-completes to “Stephen” rather than “Stephanie” because Stephen is a more common search term. Some would say this is a gender bias and should be eliminated. As a woman in tech, I can certainly relate to and understand that perspective. Some would also say that the data is what it is, and if Stephen is a more common search term than Stephanie—then do we really want to make the algorithm do something other than what is best for user efficiency? It’s a real quandary, and one can make legitimate arguments either way.

MICHAEL STONEBRAKER: The trouble with predictive models is that they are built by humans, and humans by nature are prone to bias. If we look at the most recent presidential election, we see a spectacular failure of existing polling models. Twenty-twenty hindsight shows that nobody thought Trump could actually win, when in reality, it is far more likely the polling models were subtly biased against him.

“The trouble with predictive models is that they are built by humans, and humans are by nature prone to bias.”

So, the problem with predictive models is the models themselves. If they include fraud, bias, etc., they can yield very bad answers. One has to take predictive models with a grain of salt. We put way too much faith in predictive modeling.

What role can big data and machine learning play in helping scientists understand data (for example, in the Human Genome project) and bring forth some potential real-world opportunities in health and medicine?

DAPHNE KOLLER: One of the main reasons I came back to the healthcare field is because I think the opportunity here is so tremendous. As costs go down, our ability to sequence new genomes increases dramatically. And it’s not just genomes; it’s transcriptomes and proteomes and many other data modalities. When we combine that with wearable devices that allow you to see the effect of phenotypes, there is an amazing explosion of data that we could access. One reason this is beneficial is that it will improve our ability to determine the genetic factors that cause certain diseases. Yes, we could do that before, but when faced with tens of millions of variations in the genome and only a couple hundred examples to use, it’s really difficult to extract much out of that except the very strongest signals.

Are there potential technological breakthroughs on the horizon that could transform this area again in the near future?

DAVID BLEI: I think we are in the middle of a transformative time for machine learning and statistics, and it’s fueled by a few ideas. Reinforcement learning is a big one. This is the idea that we can learn how to act in the face of an uncertain environment with uncertain consequences of our actions; it’s fueling a lot of the amazing results that we’re seeing in machine learning and AI. Deep learning is another idea—a very flexible class of learners that, when given massive datasets, can identify complex and compositional structure in high-dimensional data. Another idea is 60 years old, but it’s optimization: I have some kind of function and I want the maximal value of that function, how do I do that? Well, it’s called an optimization procedure. Optimization tells us how to do that very efficiently with massive datasets.

VIPIN KUMAR: New types of sensors and communication technologies can be quite transformational. The kinds of sensors that we see today, we could not even have been imagined just a few decades ago. Mobile health sensors such as Fitbit and Apple Watches that can record our physiological parameters at unprecedented detail have been around only for the past decade or so. New types of sensors based on advances in electronics, nanotechnology, and biomedical sciences are already enabling deployment of small and inexpensive satellites that can monitor the earth and its environment at spatial and temporal resolutions never possible before. Without technologies such as RFID, it would be very hard for someone to imagine that you could walk into a store and purchase something just by looking at it or by being close to it—something that is now possible at Amazon Go, a grocery store in Seattle that has no checkout counter. New sensors based on quantum technology may open up entirely new applications that we are not even considering today.

Final thoughts?

MICHAEL STONEBRAKER: All of the fancy social benefits we expect from big data depends on seamless data integration. Solving the problem of how to improve data integration is going to be key in getting the most benefit from all the data being created.