Challenges of the Brave New Data World

This year's Heidelberg Laureate Forum brought together some 200 young researchers from computer science and mathematics and 26 laureates of the most important awards in those fields.

Big data, and the automated algorithmic decisions that are increasingly being taken on the basis of it, are here to stay, so scientists, citizens, governments, and enterprises have the responsibility to consider both the benefits and the dangers of big data.

This was the inspiration behind the Hot Topic session ‘Brave New Data World’ during the third Heidelberg Laureate Forum (HLF), which took place Aug. 23-28 in Heidelberg, Germany, bringing together 200 young researchers from computer science and mathematics and 26 laureates of the most important awards in computer science and mathematics: the ACM A.M. Turing Award, the Nevanlinna Prize, the Fields Medal, and the Abel Prize.

Participants in the session made it clear there are some areas where the benefits of big data are indisputable, and dangers or pitfalls on the personal level barely exist.

The Hot Topic Panel: from left, Jeremy Gillula, Megan Price, Ciro Cattuto, session leader Michele Catanzaro, Peter Ryan, Kristin Tolle, and Alessandro Acquisti.

Kristin Tolle, director of the Data Science Initiative of Microsoft Research Outreach, discussed the U.S. National Flood Interoperability Experiment. Of all the natural disasters that occur in the U.S., floods cause the greatest number of casualties each year. By using open geographical data on water levels and river flows—data that is freely availably because it has been collected through the use of taxpayer money—the hydrological models that predict floods can be significantly improved.

Tolle demonstrated how the combined datasets “are the star of the show.” A pilot study from 2014 showed the data literally can save lives, as well as being used to generate more reliable local alerts, and to aid first responders to a flood. However, Tolle pointed out, plenty of challenges remain (primarily on the computer science side), such as how to best share data collected by different institutions, how to make such data interoperable, and how to integrate the data in space and time.

Those challenges increase when collecting big data about people’s behavior, given social science and ethics considerations. From the ethical perspective, obviously privacy is the main concern.

Alessandro Acquisti, professor of Information Technology and Public Policy at the Heinz College of Carnegie Mellon University, showed examples of successful data accretion, combining different databases that are anonymized separately, but which collectively can be used to de-anonymize the underlying data. He offered the results of an experiment in which photos of individuals, taken with their consent, were combined with Facebook data to find the names and even Social Security numbers of a significant percentage of the participants.

Acquisti said some often-cited assumptions in the privacy debate, such as “people do not care about privacy any more,” or that “privacy is a modern invention,” are just not true. In fact, he said, there even seems to be a hard-wired need for a certain amount of privacy in the brains of people and animals. The recent hack of the Ashley Madison extramarital dating site has shown people’s need for privacy is alive and kicking, and violating this privacy might even lead to suicide .

Examples of de-anonymizing by data accretion led to the big question of whether privacy and big data can coexist at all. Jeremy Gillula of the Electronic Frontier Foundation (EFF), answered that question with a ‘yes, but…’. According to Gillula, big data in the personal realm can only coexist with privacy if organizations do not succumb to collecting all data, if they do not try to be “sneaky,” and if users take encryption seriously. “Encrypt, encrypt and encrypt,” said Gillula, because encryption makes it much more difficult for what he calls the “surveillance industrial complex” of governments and companies that build software tools to track the behavior of citizens without their consent.

The desire to collect all data that can possibly be collected, “because you never know when and how they might be useful,” is a pitfall, said Gillula. Data is often biased, and finding correlations from big data does not automatically reveal whether there is a causal (or any direct) relationship.

Ciro Cattuto, of the Institute for Scientific Interchange (ISI) Foundation in Torino, Italy, said an increasing number of companies are using “black boxes” to score their customers, based on their behaviors. For example, an insurance company may use data from a “SmartBox” installed in a vehicle to score one’s driving behavior at night; on the basis of that data, the price the drivers pay for their car insurance may be adjusted. “More transparency is definitely needed in what these black boxes do,” said Cattuto. “There are plenty of dangers of misclassification and algorithmic discrimination.”

Big data touches on problems from a number of different fields, including computer science, statistics, the social sciences, and ethics. It is clear there are no one-size-fits-all solutions, and it makes a huge difference whether the data is personal or not. If there was one take-home lesson from the discussion at the HLF, it is that computer scientists must realize the potential statistical, social science, and ethical pitfalls and challenges when tackling a big data problem, and not to think these are only somebody else’s responsibility.

Bennie Mols is a science and technology writer based in Amsterdam, the Netherlands.