Research and Advances
Architecture and Hardware Contributed articles

Designing Statistical Privacy For Your Data

Preparing data for public release requires significant attention to fundamental principles of privacy.
Posted
  1. Introduction
  2. Key Insights
  3. Desiderata of Privacy Definitions
  4. Security Without Obscurity
  5. A Generic Recipe
  6. Example: Blowfish
  7. Conclusion
  8. Acknowledgment
  9. References
  10. Authors
  11. Figures
Designing Statistical Privacy for Your Data, illustration

In 2006, AOL released a file containing search queries posed by many of its users. The user names were replaced with random hashes, though the query text was not modified. It turns out some users had queried their own names, or “vanity queries,” and nearby locations like local businesses. As a result, it was not difficult for reporters to find and interview an AOL user1 then learn personal details about her (such as age and medical history) from the rest of her queries.

Could AOL have protected all its users by also replacing each word in the search queries with a random hash? Probably not; Kumar et al.27 showed that word co-occurrence patterns would provide clues about which hashes correspond to which words, thus allowing an attacker to partially reconstruct the original queries. Such privacy concerns are not unique to Web-search data. Businesses, government agencies, and research groups routinely collect data about individuals and need to release some form of it for a variety of reasons (such as meeting legal requirements, satisfying business obligations, and encouraging reproducible scientific research). However, they must also protect sensitive information, including identities, facts about individuals, trade secrets, and other application-specific considerations, in the raw data. The privacy challenge is that sensitive information can be inferred in many ways from the data releases. Homer et al.20 showed participants in genomic research studies may be identified from publication of aggregated research results. Greveler et al.17 showed smart meter readings can be used to identify the TV shows and movies being watched in a target household. Coull et al.6 showed webpages viewed by users can be deduced from metadata about network flows, even when server IP addresses are replaced with pseudonyms. And Goljan and Fridrich16 showed how cameras can be identified from noise in the images they produce.

Back to Top

Key Insights

  • Data snoopers are highly motivated to publicize or take advantage of private information they can deduce from public data.
  • History shows simple data anonymization and perturbation methods frequently leak sensitive information.
  • Focusing on privacy design principles can help mitigate this risk.

Naive aggregation and perturbation of the raw data often leave exposed channels for making inferences about sensitive information;6,20,32,35 for instance, simply perturbing energy readings from a smart meter independently does not hide trends in energy use. “Privacy mechanisms,” or algorithms that transform the data to ensure privacy, must be designed carefully according to guidelines set by a privacy definition. If a privacy definition is chosen wisely by the data curator, the sensitive information will be protected. Unfortunately, privacy definitions are not one-size-fits-all. Each application could have its own unique privacy requirements. Working independently, researchers from disparate fields rediscover similar privacy technologies, along with their weaknesses, new fixes, and other vulnerabilities. Our goal here is to synthesize some of the latest findings in the science of data privacy in order to explain considerations and best practices important for the design of robust privacy definitions for new applications. We begin by describing best practices, then explain how they lead to a generic template for privacy definitions, explore various semantic privacy guarantees achievable with this template, and end with an example of a recent privacy definition based on the template and apply it to privacy-preserving k-means clustering.

Back to Top

Desiderata of Privacy Definitions

When data is collected, the curator, with the aid of a privacy definition, puts it in a form that is safe to release. A privacy definition is a specification for the behavior of randomized and deterministic algorithms. Algorithms that satisfy the spec are called privacy mechanisms. The curator first chooses a privacy definition, then a privacy mechanism cacm5803_a.gif satisfying the definition. The curator will run cacm5803_a.gif on the sensitive data, then grant external users access to the output of cacm5803_a.gif , or the “sanitized output.”

There is a long history of proposed privacy definitions, new vulnerabilities discovered, and amended privacy definitions developed only to be broken once again. As privacy concerns spread, parallel copies of this process are spawned in many research areas. Fortunately, current research has identified many best practices for engineering robust privacy protections for sensitive data. Although they can be formalized in a mathematically rigorous way, we present them at a more intuitive level, leveraging the following privacy definitions as sources of examples.

Definition 1 (∈-differential privacy9,11). An algorithm cacm5803_a.gif satisfies ε-differential privacy if for each of its possible outputs ω and for every pair of databases D1, D2 that differ on the addition or removal of a single record, cacm5803_b.gif .

Intuitively, ε-differential privacy guarantees that adding or removing a record from the data will have little effect on the output of a privacy mechanism cacm5803_a.gif . For small ε, it means cacm5803_a.gif will probably produce the same sanitized output regardless of whether or not Bob’s record is in the data.

How should a data curator choose ε? Consider a highly targeted query about an individual (such as asking if Bob’s record is in the data). For ∈-differential privacy, the most revealing privacy mechanism answers truthfully with probability eε/(1 + eε) and falsely with probability 1/(1 + eε).24 When ε is close to 0, both these probabilities are close to ½, and little information is provided; the mechanism is almost as likely to lie as respond truthfully; for example, when ε = 0.1, the true answer probability is ≈ 0.525, and when ε = 0.01, the probability is ≈ 0.502. We recommend choosing ε based on how close the curator wants this value to be to ½.

The Laplace mechanism is a popular mechanism for ε-differential privacy. Let f be a function that computes a vector of query answers on the data. To each query answer, the Laplace mechanism adds an independent Laplace random variable with mean 0 and standard deviation cacm5803_c.gif , where S(f) is the global sensitivity of f – the largest possible change in f due to the addition of one record, or the maximum of ||f(D1) − f(D2)||1 over pairs of databases D1, D2 that differ in one record. Intuitively, the noise masks the influence of any single record on the result of f. Now consider:

Definition 2 (k-anonymity.34,35) Given a set Q of attributes, known as the quasi-identifier, a table is k-anonymous if every record in it has the same quasi-identifier values as k–1 other records. An algorithm satisfies k-anonymity if it outputs only k-anonymous tables.

K-anonymity defends against one type of attack called a “linkage attack”—joining an external dataset that associates an identity (such as name) with the quasi-identifier (such as ZIP code and age) to a k-anonymous table containing this publicly available quasi-identifier. Its goal is to prevent the matching of an identity to a single tuple in the k-anonymous table; clearly, there will always be at least k candidates in the join result. K-anonymity mechanisms usually operate by coarsening attributes (such as dropping digits from ZIP codes and changing ages to age ranges); see Figure 1 for two examples of k-anonymous tables.

Back to Top

Security Without Obscurity

The process of sanitizing sensitive data through a privacy mechanism cacm5803_a.gif must follow Kerckhoffs’s principle21 and ensure privacy even against adversaries who might know the details of cacm5803_a.gif , except for the specific random bits it may have used. Better yet, the mechanism cacm5803_a.gif must be revealed along with the sanitized output.

The reasons for making the mechanism cacm5803_a.gif publicly known are twofold: First, history shows “security through obscurity” is unreliable in many applications; and, second, the output of cacm5803_a.gif must be useful. This sanitized output often takes the form of a dataset or statistical model and could be intended to support scientific research. In such a case, statistical validity is an important concern, and statistically valid conclusions can be drawn only when scientists know precisely what transformations were applied to the base sensitive data.

Likewise, privacy-mechanism designers should always assume attackers are smarter than they are. Just because the designer of a privacy mechanism cannot deduce sensitive information from the output of a piece of software, an adversary will also fail. A well-engineered privacy definition will overcome these disadvantages, protecting sensitive information from clever attackers who know how cacm5803_a.gif operates. We explain how in subsequent sections.

Post-processing. A privacy definition determines the mechanisms that data curators can trust to not leak sensitive information. Let cacm5803_a.gif be one such mechanism, and suppose cacm5803_u.gif is some algorithm that can be applied to the output of cacm5803_a.gif ; for example, suppose cacm5803_a.gif creates synthetic data from its inputs, and cacm5803_u.gif builds a decision tree. Let the notation cacm5803_u.gif cacm5803_d.gif cacm5803_a.gif denote the composite algorithm that first applies cacm5803_a.gif to the sensitive data and then runs cacm5803_u.gif on the sanitized output of cacm5803_a.gif .

If cacm5803_a.gif is trusted, should this composite algorithm cacm5803_u.gif cacm5803_d.gif cacm5803_a.gif also be trusted? Intuitively, the answer is yes. It would be very strange if a data curator released privacy-preserving synthetic data but then claimed building statistical models from this data is a violation of privacy.

A privacy definition is closed under post-processing if cacm5803_u.gif cacm5803_d.gif cacm5803_a.gif satisfies the constraints defining the privacy definition whenever cacm5803_a.gif does. Differential privacy11 satisfies this property, but k-anonymity does not.23 Closure under post-processing has two important consequences: First, it ensures compatibility between a privacy definition and Kerckhoffs’s principle; for example, some algorithms that satisfy k-anonymity are susceptible to a so-called minimality attack.13,36 For each such k-anonymity mechanism cacm5803_a.gif , it is possible to craft a post-processing algorithm cacm5803_u.gif that takes the output of cacm5803_a.gif , undoes some of its data transformations, and outputs a new dataset having records with possibly unique quasi-identifier values that are vulnerable to linkage attacks with external data. That is, the composite algorithm cacm5803_u.gif cacm5803_d.gif cacm5803_a.gif does not satisfy the same conditions as cacm5803_a.gif , or k-anonymity, and often reveals sensitive records.

By contrast, suppose an ε-differentially private algorithm cacm5803_a.gif is applied to the data D, and the result cacm5803_a.gif (D) is published. Given knowledge of cacm5803_a.gif , a clever adversary can design an attack algorithm cacm5803_u.gif and run it on the published data to obtain the result cacm5803_u.gif ( cacm5803_a.gif (D)). Note cacm5803_u.gif ( cacm5803_a.gif (D)) is the result of applying the composite algorithm cacm5803_u.gif cacm5803_d.gif cacm5803_a.gif to the data D. Since ε-differential privacy is closed under post-processing, the composite algorithm cacm5803_u.gif cacm5803_d.gif cacm5803_a.gif still satisfies ε-differential privacy and hence has the same semantics; the output of cacm5803_u.gif cacm5803_d.gif cacm5803_a.gif is barely affected by the presence or absence of Bob’s (or any other individual’s) record in the database.

The second important consequence of closure under post-processing is how a data curator must express privacy definitions. Consider k-anonymity and ε-differential privacy. By analogy to database-query languages, the definition of k-anonymity is declarative; that is, it specifies what we want from the output but not how to produce this output. On the other hand, differential privacy is more procedural, specifying constraints on the input/output behaviors of algorithms through constraints on probabilities (such as P( cacm5803_a.gif (D) = ω)). This is no coincidence; in order to achieve closure under postprocessing, it is necessary that the privacy definition impose conditions on the probabilities (even when cacm5803_a.gif is deterministic) rather than on the syntactic form of the outputs.22

Composition. We introduce the concept of composition with an example. Suppose the 4-anonymous table in Figure 1 was generated from data from Hospital A, while the 3-anonymous table in Figure 1 was generated by Hospital B. Suppose Alice knows her neighbor Bob was treated by both hospitals for the same condition. What can Alice infer about, say, Bob’s private records? Bob corresponds to an anonymized record in each table. By matching ZIP code, age, and disease, Alice can deduce that Bob must have had a stroke. Each anonymized table individually might have afforded Bob some privacy, but the combination of the two tables together resulted in a privacy breach. The degradation in privacy that results from combining multiple sanitized outputs is known as “composition.”14

Self-composition. “Self-composition” refers to the scenario where the sanitized outputs are all produced from privacy mechanisms that satisfy the same privacy definition. Fundamental limits on a privacy definition’s ability to withstand composition are part of a growing literature inspired by the results of Dinur and Nissim7 who showed that the vast majority of records in a database of size n can be reconstructed when n log(n)2 statistical queries are answered, even if each answer has been arbitrarily altered to have up to o( cacm5803_e.gif ) error; that is, a distortion that is less than the natural variation of query answers that an adversary would get from collecting a sample of size n from a much larger population.


The privacy challenge is that sensitive information can be inferred in many ways from the data releases.


Despite such negative results that limit the number of times a private database can be queried safely, there can be a graceful degradation of privacy protections, as in the case of ε-differential privacy. If cacm5803_a.gif 1,