K-Anonymity Privacy Protection Model Needs a Little Help

A new method of providing anonymity to large data sets has raised excitement in realms as diverse as social networks and medical records. But the "K-anonymity" protection model, in which so-called "nonkey attributes" (gender, Zip code, birth date, etc.) are suppressed or generalized, appears to need a little help. Some researchers propose using it in conjunction with complementary methods such as "L-diversity." A group based at Stanford University has proposed adding a clustering method to K-anonymity to broaden its appeal in modern networks. In fact, other derivatives of K-anonymity already are being proposed for real-time privacy enablers in social networks such as Facebook.

The Stanford team published its work on clustering and K-anonymity in the June 2010 ACM Transactions on Algorithms. Team leader Tomas Feder says that the team has not been able to follow up its original studies since then, as Ph.D. candidates in charge of the project have gone to Google Inc., Microsoft Research Labs, and Oracle, but initial work holds out the hope for keeping information private even when attributes of individuals are used in large data sets for the U.S. Census, group studies, and double-blind medical studies.

The problem addressed by K-anonymity arose due to the unexpected power of large-scale data mining. In order to identify epidemics or predict purchasing patterns, researchers often publish charts with survey results listing "quasi-identifiers," or information such as gender that was not considered central to learning a user’s identity. It turns out that when multiple quasi-identifiers are displayed, a unique individual corresponding to that set can be found in 87 percent of cases. Pierangela Samarati was the first to propose "K-anonymity" in 1998, a method that suppressed many of these quasi-identifier fields. The goal in optimizing the algorithm is to insure that at least K-1 other records correspond to the record that could theoretically identify one person.

The Stanford group suggested that these methods could be more powerful by using the quasi-identifiers to define a metric space over the database, cluster the points in that space, and publish only the cluster information. The most powerful version studied by Feder’s group looked at publishing the radius of each information cluster, leading to a "cellular cluster." The group concluded that more information could be published about all data sets in a cellular-cluster model, as compared to general K-anonymity.

Meanwhile, a group at Cornell University has defined a way to re-define queries to databases to preserve diversity by modeling background knowledge as a probability space. In this method, the developer must be cognizant of the sensitivity of certain query types that can be made—in the example cited by the authors, the set of people extremely unlikely to be exposed to ebola virus is far less sensitive than the subset of possible candidates for ebola. The team under Professor Johannes Gehrke used a K-anonymity tool called Incognito to study diversity opportunities in data sets from Lands’ End and other vendors.

Will any of these methods have near-term payoff? Aaron Beach, Mike Gartrell, and Richard Han at The University of Colorado at Boulder already have developed a K-anonymity tool for the Facebook API called "Social-K," in which K-anonymity rules are modified to adjust to the real-time nature of privacy tools within Facebook. The CU Boulder researchers were looking for the special case of Facebook to Netflix links, to ensure that combined use of data sets would retain the privacy of the user.