Open a newspaper or a Web browser and you're certain to encounter a spate of stories about the misuse or loss of data and how it puts personal information at risk. Over the last decade, as computers and databases have grown ever more sophisticated, privacy concerns have moved to center stage. Today, government agencies worry about keeping highly sensitive financial and health data private. Corporations fret over protecting customer records. And the public grows ever more waryand distrustfulof organizations that handle sensitive data.
"Privacy issues aren't about to go away," observes Adam Smith, an assistant professor in the computer science and engineering department at the Pennsylvania State University. "One problem we face is that 'privacy' is an overloaded term. It means different things to different people and a lot of issues hinge on context. As a result, it is extremely difficult to create effective solutions and protectionsand to gain the trust that is necessary for respondents to answer sensitive questions honestly."
Some 220 million private records have been lost or stolen in the United States since January 2005, according to the Privacy Rights Clearinghouse, a San Diego, CA-based organization that tracks privacy issues. While no worldwide statistics exist, it's entirely apparent that a tangle of regulations, laws, and best practices cannot solve the problem. Worse, increasingly sophisticated tools make it possible to piece information together and glean details and facts about people in a way that wasn't imaginable a few years ago.
Now, a handful of researchers, mathematicians, and computer scientists are hoping to alter the landscape and frame the debate in new and important ways. Introducing a concept that has been dubbed "differential privacy," these data experts are seeking to use mathematical equations and algorithms to standardize the way computersand organizationsprotect personal data while revealing overall statistical trends. The goal, says Cynthia Dwork, a principal researcher at Microsoft, is to ensure that an adversary cannot compromise data when he or she combines the released statistics with other external sources of information. "It's an extremely attractive approach," she says.
The ability to collect and analyze vast data sets offers substantial promise. Sifting through medical data, genotype and phenotype connections, epidemio-logical statistics, and their correlation with events such as chemical spills or dietary and exercise patterns can help dictate public policy and find preventive strategies and cures for real people with real afflictions.
Yet, protecting privacy is an increasingly tricky proposition and one that confounds a growing number of organizations. Beyond the widely publicized hacker attacks and security lapses, there's an escalating threat of a person or organization assembling enough pieces of seemingly benign datasometimes from different sourcesto create a useful snapshot of a person or group. Kobbi Nissim, an assistant professor of computer science at Ben-Gurion University, describes this approach as "connecting the dots." Oftentimes, it involves culling seemingly unrelated data from diverse and disparate sources.
It's not an abstract concept. When online movie rental firm Netflix decided to improve its recommendation system in 2007, executives emphasized that they would provide complete customer anonymity to participants. Netflix designed a system that retained the date of each movie rating along with the title and year of its release. And it assigned randomized numbers in place of customer IDs.
This seemed like a perfect system until a pair of researchersgraduate student Arvind Narayanan and professor Vitaly Shmatikov, both from the department of computer sciences at the University of Texas at Austinproved that it was possible to identify individuals among a half-million participants by using public reviews published in the Internet Movie Database (IMDb) to identify movie ratings within Netflix's data. In fact, eight ratings along with dates were enough to provide 99% accuracy, according to the researchers.
This type of privacy violationknown as a linkage attack (attackers use innocuous data in one data set to identify a record in a second data set with both innocuous and sensitive data)has serious repercussions, Dwork says. It could identify someone who is gay or has an interest in extremely violent or pornographic films. Such information might potentially interfere with a person's employment or affect his or her ability to rent an apartment or belong to a religious organization. "It could result in public humiliation," says Dwork, who notes that "the conclusion may be wrong. Partners share accounts. People buy gifts, and they may have some other reason for renting or buying certain movies."
It's not the first time such an event has taken place. In 2006, researchers sifted through anonymized data of 20 million searches performed by 658,000 America Online subscribers. The researchers were able to cull sensitive informationincluding Social Security numbers, credit card numbers, addresses, and personal habitsby looking at all the searches of a single user (each user received a single randomized number). These same identification methods can be used for social networking sites and to parse through data contained in search engines, Dwork says.
The repercussions are enormous. For example, in the 1990s, a health insurance company that provided coverage for all state employees in the Commonwealth of Massachusetts released general data about the medical histories of anonymized individuals for general research purposes. Only the date of birth, gender, and ZIP code of residence was left in the data. However, a researcher, Latanya Sweeney, now an associate professor of computer science at Carnegie Mellon University, identified the medical history for William Weld, then the governor of Massachusetts. This was possible because the database contained only six people who had his same date of birth, only three of them were men, and Weld was the only person in his five-digit ZIP code.
What's remarkable, Dwork says, is that each data element alone isn't a privacy risk. "Most people would probably say, 'No big deal.' Yet, putting these three elements together is enough to identify approximately two-thirds of the population," says Dwork.
As government agencies, research institutes, companies, and nonprofit organizations search for ways to boost the value of their data, the pressure to develop better privacy-protecting methods and systems is increasing. "Despite good intentions and software tools designed to thwart breaches, breakdowns continue to take place," says Frank Mc-Sherry, a researcher at Microsoft Research Silicon Valley.
Differential privacy appears to be the only approach that offers a solid and well-defined method for achieving privacywithout making any assumptions about the adversaries' strategy.
Privacy-preserving efforts have undergone a steady evolution during the last quarter-century. Statistics, security, cryptography, and databases have all emerged as topics of interest. However, actual solutions have remained elusive, largely because there's no way to guarantee data privacy with ad-hoc tools and methods. Cryptography, for example, is fine for protecting data from a security standpoint, but it does nothing to mitigate data mining and sophisticated analysis of publicly released or anonymized data. In fact, mathematically rigorous methods have demonstrated that the 25-year-old concept of "semantic security" cannot be achieved for statistical databases.
Differential privacy, which first emerged in 2006 (though its roots go back to 2001), could provide the tipping point for real change. By introducing random noise and ensuring that a database behaves the sameindependent of whether any individual or small group is included or excluded from the data set, thus making it impossible to tell which data set was usedit's possible to prevent personal data from being compromised or misused. Pennsylvania State University's Smith says that differential privacy can be applied to numerous environments and settings. "It creates a guideline for defining whether something is acceptable or not," he says.
Government, academic, and business leaders have shown some interest in differential privacy, although the concept is still in the early stages of development and implementation. Currently, differential privacy appears to be the only approach that offers a solid and well-defined method for achieving privacywithout making any assumptions about the adversaries' strategy. "There has been a lot of positive feedback about the concept, though it is clearly on the upward slope," McSherry says. "We believe that with further analysis, testing, and tweaking, differential privacy could emerge within the next several years as the gold standard for privacy."
©2008 ACM 0001-0782/08/0900 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2008 ACM, Inc.
No entries found