Government agencies worldwide are required to release statistical information about population, education, and health, crime, and economic activities. In the U.S., protecting this data goes back to the 19th century when Carrol Wright, the first head of the Bureau of Labor Statistics, which was established in 1885, argued that protecting the confidentiality of the Bureau’s data was necessary. If enterprises feared that data about an enterprise collected by the Bureau would be shared with competitors, investigators, or the tax authorities, data quality would severely suffer. The field of statistical disclosure limitation was born.4
Fast-forward a few decades, Stanley Warner was faced with a similar conundrum. During interviews for market surveys, individuals would refuse questions of sensitive or controversial issue "for reasons of modesty, fear of being thought bigoted, or merely a reluctance to confide secrets to strangers."7 His answer was a technique where the interviewee would flip a biased coin without showing the outcome to the interviewer. Depending on the outcome of the coin flip, the interviewee would (truthfully) answer either the original yes/no question or she would negate her answers. This method intuitively protects the interviewee since her answer could always have been due to the coin flipping on the other side.
Tore Dalenius formulated a very strong notion of protection a decade later:2 "If the release of the statistic S makes it possible to determine the (microdata) value more accurately than without access to S, a disclosure has taken place…". This very strong notion of semantic security implies that data publishers should think about adversaries and their knowledge since the published data could give new information to an adversary.
Fast-forward a few more decades to the turn of the century. Statisticians have developed many different methods for limiting disclosure when publishing data such as suppression, sampling, swapping, generalization (also called coarsening), synthetic data generation, data perturbation, and the publishing of marginals for contingency tables, just to name a few. These methods are applied in practice, but they do not provide formal privacy guarantees—the methods do not formally state how much an attacker can learn, and they preserve confidentiality by hiding the parameters used.
Fast-forward to 1999. In his Innovations Award Talk at the annual ACM SIGKDD Conference, Rakesh Agrawal posed the challenge of privacy-preserving data mining to the community. In the next year, two papers with the same title "Privacy Preserving Data Mining" (one by Agrawal and Srikant1 and the other by Lindell and Pinkas5) are published, and the computer science community has entered the picture.
Computer scientists were especially intrigued by formal models of data privacy—formal definitions of information leakage and attacker models as they have been pioneered and used in cryptography and computer security. The strongest formal definition of disclosure in use today is differential privacy as pioneered by Dwork, McSherry, Nissim, and Smith.3 Differential privacy beautifully captures the intuitive notion that the published data should not reveal much information about an individual whether or not that individual’s data was in the data.
Since its original proposal, much progress has been made in the development of mechanisms that protect published data with differential privacy while maximizing information content. The national statistical offices have also started to pay attention; for example, OnTheMap, a U.S. Census Bureau application that provides maps showing where workers live and are employed, has now been published with a variant of differential privacy.6
The following paper by Frank McSherry introduces a system called PINQ that integrates differential privacy into the C# LINQ framework, which adds database query functionality to C#. PINQ enables queries over data while elegantly hiding the complexity of the underlying differentially privacy mechanisms. Users of PINQ write programs that look almost identical to standard LINQ programs, but PINQ ensures that all query answers adhere to differential privacy, and it composes the information leakage from different queries until the privacy budget of the program has run out.
Differential privacy and PINQ give only a glimpse into a new exciting area at the confluence of ideas from computer science, statistics, law, and social sciences. I believe we will see much further progress on formal privacy definitions and improved methods, and I hope that future data products from the national statistics offices will be published with some formal notion of disclosure control.
Carrol Wright would be amazed by the field today.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment