
In 2006, AOL released a file containing search queries posed by many of its users. The user names were replaced with random hashes, though the query text was not modified. It turns out some users had queried their own names, or “vanity queries,” and nearby locations like local businesses. As a result, it was not difficult for reporters to find and interview an AOL user1 then learn personal details about her (such as age and medical history) from the rest of her queries.
Could AOL have protected all its users by also replacing each word in the search queries with a random hash? Probably not; Kumar et al.27 showed that word co-occurrence patterns would provide clues about which hashes correspond to which words, thus allowing an attacker to partially reconstruct the original queries. Such privacy concerns are not unique to Web-search data. Businesses, government agencies, and research groups routinely collect data about individuals and need to release some form of it for a variety of reasons (such as meeting legal requirements, satisfying business obligations, and encouraging reproducible scientific research). However, they must also protect sensitive information, including identities, facts about individuals, trade secrets, and other application-specific considerations, in the raw data. The privacy challenge is that sensitive information can be inferred in many ways from the data releases. Homer et al.20 showed participants in genomic research studies may be identified from publication of aggregated research results. Greveler et al.17 showed smart meter readings can be used to identify the TV shows and movies being watched in a target household. Coull et al.6 showed webpages viewed by users can be deduced from metadata about network flows, even when server IP addresses are replaced with pseudonyms. And Goljan and Fridrich16 showed how cameras can be identified from noise in the images they produce.
Key Insights
- Data snoopers are highly motivated to publicize or take advantage of private information they can deduce from public data.
- History shows simple data anonymization and perturbation methods frequently leak sensitive information.
- Focusing on privacy design principles can help mitigate this risk.
Naive aggregation and perturbation of the raw data often leave exposed channels for making inferences about sensitive information;6,20,32,35 for instance, simply perturbing energy readings from a smart meter independently does not hide trends in energy use. “Privacy mechanisms,” or algorithms that transform the data to ensure privacy, must be designed carefully according to guidelines set by a privacy definition. If a privacy definition is chosen wisely by the data curator, the sensitive information will be protected. Unfortunately, privacy definitions are not one-size-fits-all. Each application could have its own unique privacy requirements. Working independently, researchers from disparate fields rediscover similar privacy technologies, along with their weaknesses, new fixes, and other vulnerabilities. Our goal here is to synthesize some of the latest findings in the science of data privacy in order to explain considerations and best practices important for the design of robust privacy definitions for new applications. We begin by describing best practices, then explain how they lead to a generic template for privacy definitions, explore various semantic privacy guarantees achievable with this template, and end with an example of a recent privacy definition based on the template and apply it to privacy-preserving k-means clustering.
Desiderata of Privacy Definitions
When data is collected, the curator, with the aid of a privacy definition, puts it in a form that is safe to release. A privacy definition is a specification for the behavior of randomized and deterministic algorithms. Algorithms that satisfy the spec are called privacy mechanisms. The curator first chooses a privacy definition, then a privacy mechanism
satisfying the definition. The curator will run
on the sensitive data, then grant external users access to the output of
, or the “sanitized output.”
There is a long history of proposed privacy definitions, new vulnerabilities discovered, and amended privacy definitions developed only to be broken once again. As privacy concerns spread, parallel copies of this process are spawned in many research areas. Fortunately, current research has identified many best practices for engineering robust privacy protections for sensitive data. Although they can be formalized in a mathematically rigorous way, we present them at a more intuitive level, leveraging the following privacy definitions as sources of examples.
Definition 1 (∈-differential privacy9,11). An algorithm
satisfies ε-differential privacy if for each of its possible outputs ω and for every pair of databases D1, D2 that differ on the addition or removal of a single record,
.
Intuitively, ε-differential privacy guarantees that adding or removing a record from the data will have little effect on the output of a privacy mechanism
. For small ε, it means
will probably produce the same sanitized output regardless of whether or not Bob’s record is in the data.
How should a data curator choose ε? Consider a highly targeted query about an individual (such as asking if Bob’s record is in the data). For ∈-differential privacy, the most revealing privacy mechanism answers truthfully with probability eε/(1 + eε) and falsely with probability 1/(1 + eε).24 When ε is close to 0, both these probabilities are close to ½, and little information is provided; the mechanism is almost as likely to lie as respond truthfully; for example, when ε = 0.1, the true answer probability is ≈ 0.525, and when ε = 0.01, the probability is ≈ 0.502. We recommend choosing ε based on how close the curator wants this value to be to ½.
The Laplace mechanism is a popular mechanism for ε-differential privacy. Let f be a function that computes a vector of query answers on the data. To each query answer, the Laplace mechanism adds an independent Laplace random variable with mean 0 and standard deviation
, where S(f) is the global sensitivity of f – the largest possible change in f due to the addition of one record, or the maximum of ||f(D1) − f(D2)||1 over pairs of databases D1, D2 that differ in one record. Intuitively, the noise masks the influence of any single record on the result of f. Now consider:
Definition 2 (k-anonymity.34,35) Given a set Q of attributes, known as the quasi-identifier, a table is k-anonymous if every record in it has the same quasi-identifier values as k–1 other records. An algorithm satisfies k-anonymity if it outputs only k-anonymous tables.
K-anonymity defends against one type of attack called a “linkage attack”—joining an external dataset that associates an identity (such as name) with the quasi-identifier (such as ZIP code and age) to a k-anonymous table containing this publicly available quasi-identifier. Its goal is to prevent the matching of an identity to a single tuple in the k-anonymous table; clearly, there will always be at least k candidates in the join result. K-anonymity mechanisms usually operate by coarsening attributes (such as dropping digits from ZIP codes and changing ages to age ranges); see Figure 1 for two examples of k-anonymous tables.
Security Without Obscurity
The process of sanitizing sensitive data through a privacy mechanism
must follow Kerckhoffs’s principle21 and ensure privacy even against adversaries who might know the details of
, except for the specific random bits it may have used. Better yet, the mechanism
must be revealed along with the sanitized output.
The reasons for making the mechanism
publicly known are twofold: First, history shows “security through obscurity” is unreliable in many applications; and, second, the output of
must be useful. This sanitized output often takes the form of a dataset or statistical model and could be intended to support scientific research. In such a case, statistical validity is an important concern, and statistically valid conclusions can be drawn only when scientists know precisely what transformations were applied to the base sensitive data.
Likewise, privacy-mechanism designers should always assume attackers are smarter than they are. Just because the designer of a privacy mechanism cannot deduce sensitive information from the output of a piece of software, an adversary will also fail. A well-engineered privacy definition will overcome these disadvantages, protecting sensitive information from clever attackers who know how
operates. We explain how in subsequent sections.
Post-processing. A privacy definition determines the mechanisms that data curators can trust to not leak sensitive information. Let
be one such mechanism, and suppose
is some algorithm that can be applied to the output of
; for example, suppose
creates synthetic data from its inputs, and
builds a decision tree. Let the notation
denote the composite algorithm that first applies
to the sensitive data and then runs
on the sanitized output of
.
If
is trusted, should this composite algorithm
also be trusted? Intuitively, the answer is yes. It would be very strange if a data curator released privacy-preserving synthetic data but then claimed building statistical models from this data is a violation of privacy.
A privacy definition is closed under post-processing if
satisfies the constraints defining the privacy definition whenever
does. Differential privacy11 satisfies this property, but k-anonymity does not.23 Closure under post-processing has two important consequences: First, it ensures compatibility between a privacy definition and Kerckhoffs’s principle; for example, some algorithms that satisfy k-anonymity are susceptible to a so-called minimality attack.13,36 For each such k-anonymity mechanism
, it is possible to craft a post-processing algorithm
that takes the output of
, undoes some of its data transformations, and outputs a new dataset having records with possibly unique quasi-identifier values that are vulnerable to linkage attacks with external data. That is, the composite algorithm
does not satisfy the same conditions as
, or k-anonymity, and often reveals sensitive records.
By contrast, suppose an ε-differentially private algorithm
is applied to the data D, and the result
(D) is published. Given knowledge of
, a clever adversary can design an attack algorithm
and run it on the published data to obtain the result
(
(D)). Note
(
(D)) is the result of applying the composite algorithm
to the data D. Since ε-differential privacy is closed under post-processing, the composite algorithm
still satisfies ε-differential privacy and hence has the same semantics; the output of
is barely affected by the presence or absence of Bob’s (or any other individual’s) record in the database.
The second important consequence of closure under post-processing is how a data curator must express privacy definitions. Consider k-anonymity and ε-differential privacy. By analogy to database-query languages, the definition of k-anonymity is declarative; that is, it specifies what we want from the output but not how to produce this output. On the other hand, differential privacy is more procedural, specifying constraints on the input/output behaviors of algorithms through constraints on probabilities (such as P(
(D) = ω)). This is no coincidence; in order to achieve closure under postprocessing, it is necessary that the privacy definition impose conditions on the probabilities (even when
is deterministic) rather than on the syntactic form of the outputs.22
Composition. We introduce the concept of composition with an example. Suppose the 4-anonymous table in Figure 1 was generated from data from Hospital A, while the 3-anonymous table in Figure 1 was generated by Hospital B. Suppose Alice knows her neighbor Bob was treated by both hospitals for the same condition. What can Alice infer about, say, Bob’s private records? Bob corresponds to an anonymized record in each table. By matching ZIP code, age, and disease, Alice can deduce that Bob must have had a stroke. Each anonymized table individually might have afforded Bob some privacy, but the combination of the two tables together resulted in a privacy breach. The degradation in privacy that results from combining multiple sanitized outputs is known as “composition.”14
Self-composition. “Self-composition” refers to the scenario where the sanitized outputs are all produced from privacy mechanisms that satisfy the same privacy definition. Fundamental limits on a privacy definition’s ability to withstand composition are part of a growing literature inspired by the results of Dinur and Nissim7 who showed that the vast majority of records in a database of size n can be reconstructed when n log(n)2 statistical queries are answered, even if each answer has been arbitrarily altered to have up to o(
) error; that is, a distortion that is less than the natural variation of query answers that an adversary would get from collecting a sample of size n from a much larger population.
The privacy challenge is that sensitive information can be inferred in many ways from the data releases.
Despite such negative results that limit the number of times a private database can be queried safely, there can be a graceful degradation of privacy protections, as in the case of ε-differential privacy. If
1,