Home/Magazine Archive/January 2011 (Vol. 54, No. 1)/A Firm Foundation for Private Data Analysis/Full Text

Review article
# A Firm Foundation for Private Data Analysis

In the information realm, loss of privacy is usually associated with failure to control access to information, to control the flow of information, or to control the purposes for which information is employed. Differential privacy arose in a context in which ensuring privacy is a challenge even if all these control problems are solved: privacy-preserving statistical analysis of data.

The problem of *statistical disclosure
control*—revealing accurate statistics about a set of
respondents while preserving the privacy of individuals—has
a venerable history, with an extensive literature spanning
statistics, theoretical computer science, security, databases,
and cryptography (see, for example, the excellent survey of Adam
and Wortmann,^{1} the discussion of
related work in Blum et al.,^{2} and the
*Journal of Official Statistics* dedicated to
confidentiality and disclosure control).

This long history is a testament to the importance of the problem. Statistical databases can be of enormous social value; they are used for apportioning resources, evaluating medical therapies, understanding the spread of disease, improving economic utility, and informing us about ourselves as a species.

The data may be obtained in diverse ways. Some data, such as census, tax, and other sorts of official data, is compelled; other data is collected opportunistically, for example, from traffic on the Internet, transactions on Amazon, and search engine query logs; other data is provided altruistically, by respondents who hope that sharing their information will help others to avoid a specific misfortune, or more generally, to increase the public good. Altruistic data donors are typically promised their individual data will be kept confidential—in short, they are promised "privacy." Similarly, medical data and legally compelled data, such as census data and tax return data, have legal privacy mandates. In my view, ethics demand that opportunistically obtained data should be treated no differently, especially when there is no reasonable alternative to engaging in the actions that generate the data in question.

The problems remain: even if data encryption, key management, access control, and the motives of the data curator are all unimpeachable, what does it mean to preserve privacy, and how can it be accomplished?

Let us consider a few common suggestions and some of the difficulties they can encounter.

** Large Query Sets.** One frequent suggestion is to
disallow queries about a specific individual or small set of
individuals. A well-known differencing argument demonstrates the
inadequacy of the suggestion. Suppose it is known that Mr. X is
in a certain medical database. Taken together, the answers to the
two large queries "How many people in the database have the
sickle cell trait?" and "How many people, not named X, in the
database have the sickle cell trait?" yield the sickle cell
status of Mr. X. The example also shows that encrypting the data,
another frequent suggestion (oddly), would be of no help at all.
The privacy compromise arises from correct operation of the
database.

In *query auditing*, each query to the database is
evaluated in the context of the query history to determine if a
response would be disclosive; if so, then the query is refused.
For example, query auditing might be used to interdict the pair
of queries about sickle cell trait just described. This approach
is problematic for several reasons, among them that query
monitoring is computationally
infeasible^{16} and that the refusal to
respond to a query may itself be
disclosive.^{15}

We think of a database as a collection of *rows*, with
each row containing the data of a different respondent. In
*subsampling* a subset of the rows is chosen at random and
released. Statistics can then be computed on the subsample and,
if the subsample is sufficiently large, these may be
representative of the dataset as a whole. If the size of the
subsample is very small compared to the size of the dataset, this
approach has the property that every respondent is unlikely to
appear in the subsample. However, this is clearly insufficient:
Suppose appearing in a subsample has terrible consequences. Then
every time subsampling occurs *some* individual suffers
horribly.

In *input perturbation*, either the data or the queries
are modified before a response is generated. This broad category
encompasses a generalization of subsampling, in which the curator
first chooses, based on a secret, random, function of the query,
a subsample from the database, and then returns the result
obtained by applying the query to the
subsample.^{4} A nice feature of this
approach is that repeating the same query yields the same answer,
while semantically equivalent but syntactially different queries
are made on essentially unrelated subsamples. However, an outlier
may only be protected by the unlikelihood of being in the
subsample.

In what is traditionally called *randomized response*,
the data itself is randomized once and for all and statistics are
computed from the noisy responses, taking into account the
distribution on the perturbation.^{23}
The term "randomized response" comes from the practice of having
the respondents to a survey flip a coin and, based on the
outcome, answering an invasive yes/no question or answering a
more emotionally neutral one. In the computer science literature
the choice governed by the coin flip is usually between honestly
reporting one's value and responding randomly, typically by
flipping a second coin and reporting the outcome. Randomized
response was devised for the setting in which the individuals do
not trust the curator, so we can think of the randomized
responses as simply being published. Privacy comes from the
uncertainty of how to interpret a reported value. The approach
becomes untenable for complex data.

*Adding random noise to the output* has promise, and we
will return to it later; here we point out that if done
naïvely this approach will fail. To see this, suppose the
noise has mean zero and that fresh randomness is used in
generating every response. In this case, if the same query is
asked repeatedly, then the responses can be averaged, and the
true answer will eventually emerge. This is disastrous: an
adversarial analyst could exploit this to carry out the
difference attack described above. The approach cannot be "fixed"
by recording each query and providing the same response each time
a query is re-issued. There are several reasons for this. For
example, syntactically different queries may be semantically
equivalent, and if the query language is sufficiently rich, then
the equivalence problem itself is undecidable, so the curator
cannot even test for this.

Problems with noise addition arise even when successive
queries are completely unrelated to previous
queries.^{5} Let us assume for simplicity
that the database consists of a single—but very
sensitive—bit per person, so we can think of the database
as an *n*-bit Boolean vector *d* =
(*d*_{1}, ..., *d*_{n}).
This is an abstraction of a setting in which the database rows
are quite complex, for example, they may be medical records, but
the attacker is interested in one specific field, such as HIV
status. The abstracted attack consists of issuing a string of
queries, each described by a subset *S* of the database
rows. The query is asking how many 1's are in the selected rows.
Representing the query as the *n*-bit characteristic vector
**S** of the set *S*, with 1's in all the positions
corresponding to rows in *S* and 0's everywhere else; the
true answer to the query is the inner product .
Suppose the privacy mechanism responds with *A*(*S*) +
random noise. How much noise is needed in order to preserve
privacy?

Since we have not yet defined privacy, let us consider the
easier problem of avoiding blatant "non-privacy," defined as
follows: the system is blatantly non-private if an adversary can
construct a candidate database that agrees with the real database
*D* in, say, 99% of the entries. An easy consequence of the
following theorem is that a privacy mechanism adding noise with
magnitude always bounded by, say, *n*/401 is blatantly
non-private against an adversary that can ask all
2^{n} possible
queries.^{5} There is nothing special
about 401; any number exceeding 400 would work.

THEOREM 1. *Let M be a mechanism that adds noise bounded by
E. Then there exists an adversary that can reconstruct the
database to within* 4*E
positions*.^{5}

Blatant non-privacy with *E* = *n*/401 follows
immediately from the theorem, as the reconstruction will be
accurate in all but at most positions.

PROOF. Let *d* be the true database. The adversary can
attack in two phases:

**Estimate the number of 1's in all possible sets:**Query*M*on all subsets*S*Í [*n*].**Rule out "distant" databases:**For every candidate database*c*{0, 1}^{n}, if, for any*S*Í [*n*], |Σ_{iS}*c*_{i}–*M*(*S*)| >*E*, then rule out*c*. If*c*is not ruled out, then output*c*and halt.

Since *M* (*S*) never errs by more than *E*,
the real database will not be ruled out, so this simple (but
inefficient!) algorithm will output *some* database; let us
call it *c*. We will argue that the number of positions in
which *c* and *d* differ is at most 4 ·
*E*.

Let *I*_{0} be the indices in which
*d*_{i} = 0, that is, *I*_{0} =
{*i* | *d*_{i} = 0}. Similarly, define
*I*_{1} = {*i* | *d*_{i} =
1}. Since *c* was not ruled out, . However, by
assumption . It follows from the triangle inequality
that *c* and *d* differ in at most 2*E* positions
in *I*_{0}; the same argument shows that they differ
in at most 2*E* positions in *I*_{1}. Thus,
*c* and *d* agree on all but at most 4*E*
positions.

What if we consider more realistic bounds on the number of
queries? We think of as an interesting threshold on
noise, for the following reason: If the database contains
*n* people drawn uniformly at random from a population of
size *N* >> *n*, and the fraction of the
population satisfying a given condition is *p*, then we
expect the number of rows in the database satisfying *p* to
be roughly *np* ± Θ (), by the
properties of the hypergeometric distribution. That is, the
sampling error is on the order of . We would like
that the noise introduced for privacy is smaller than the
sampling error, ideally *o*(). Unfortunately,
noise of magnitude *o*() is blatantly
non-private against a series of *n* log^{2} *n
randomly generated* queries,^{5} no
matter the distribution on the noise. Several strengthenings of
this pioneering result are now known. For example, if the entries
in *S* are chosen independently according to a standard
normal distribution, then blatant non-privacy continues to hold
even against an adversary asking only Θ(*n*)
questions, and even if more than a fifth of the responses have
arbitrarily wild noise magnitudes, provided the other responses
have noise magnitude
*o*().^{8}

These are not just interesting mathematical exercises. We have
been focusing on *interactive* privacy mechanisms,
distinguished by the involvement of the curator in answering each
query. In the noninteractive setting the curator publishes some
information of arbitrary form, and the data is not used further.
Research statisticians like to "look at the data," and we have
frequently been asked for a method of generating a "noisy table"
that will permit highly accurate answers to be derived for
computations that are not specified at the outset. The noise
bounds say this is impossible: No such table can safely provide
very accurate answers to too many weighted subset sum questions;
otherwise the table could be used in a simulation of the
interactive mechanism, and an attack could be mounted against the
table. Thus, even if the analyst only requires the responses to a
small number of unspecified queries, the fact that the table can
be exploited to gain answers to other queries is problematic.

In the case of "Internet scale" data-sets, obtaining responses
to, say, *n* ≥ 10^{8} queries is infeasible. What
happens if the curator permits only a sublinear number of
questions? This inquiry led to the first algorithmic results in
differential privacy, in which it was shown how to maintain
privacy against a sublinear number of *counting* queries,
that is, queries of the form "How many rows in the database
satisfy property *P*?" by adding noise of order
*o*()—less than the sampling
error—to each answer.^{12} The
cumbersome privacy guarantee, which focused on the question of
what an adversary can learn about a row in the database, is now
known to imply a natural and still very powerful relaxation of
differential privacy, defined here.

Newspaper horror stories about "anonymized" and
"de-identified" data typically refer to noninteractive approaches
in which certain kinds of information in each data record have
been suppressed or altered. A famous example is AOL's release of
a set of "anonymized" search query logs. People search for many
"obviously" disclosive things, such as their full names (vanity
searches), their own social security numbers (to see if their
numbers are publicly available on the Web, possibly with a goal
of assessing the threat of identity theft), and even the
combination of mother's maiden name and social security number.
AOL carefully redacted such obviously disclosive "personally
identifiable information," and each user id was replaced by a
random string. However, search histories can be very
idiosyncratic, and a *New York Times* reporter correctly
traced such an "anonymized" search history to a specific resident
of Georgia.

It has taken several years to fully appreciate the importance of taking auxiliary information into account in privacy-preserving data release.

In a *linkage attack*, released data are linked to other
databases or other sources of information. We use the term
*auxiliary information* to capture information about the
respondents *other* than that which is obtained through the
(interactive or noninteractive) statistical database. Any priors,
beliefs, or information from newspapers, labor statistics, and so
on, all fall into this category.

In a notable demonstration of the power of auxiliary
information, medical records of the governor of Massachusetts
were identified by linking voter registration records to
"anonymized" Massachusetts Group Insurance Commission (GIC)
medical encounter data, which retained the birthdate, sex, and
zip code of the patient.^{22}

Despite this exemplary work, it has taken several years to
fully appreciate the importance of taking auxiliary information
into account in privacy-preserving data release. Sources and uses
of auxiliary information are endlessly varied. As a final
example, it has been proposed to modify search query logs by
mapping *all* terms, not just the user ids, to random
strings. In *token-based hashing* each query is tokenized,
and then an uninvertible hash function is applied to each token.
The intuition is that the hashes completely obscure the terms in
the query. However, using a statistical analysis of the hashed
log and *any* (unhashed) query log, for example, the
released AOL log discussed above, the anonymization can be
severely compromised, showing that token-based hashing is
unsuitable for anonymization.^{17}

As we will see next, there are deep reasons for the fact that auxiliary information plays such a prominent role in these examples.

In 1977 the statistician Tore Dalenius articulated an "*ad
omnia*" (as opposed to *ad hoc*) privacy goal for
statistical databases: Anything that can be learned about a
respondent from the statistical database should be learnable
without access to the database. Although informal, this feels
like the "right" direction. The breadth of the goal captures all
the common intuitions for privacy. In addition, the definition
only holds the database accountable for whatever "extra" is
learned about an individual, beyond that which can be learned
from other sources. In particular, an extrovert who posts
personal information on the Web may destroy his or her own
privacy, and the database should not be held accountable.

Formalized, Dalenius' goal is strikingly similar to the gold
standard for security of a cryptosystem against a passive
eavesdropper, defined five years later. *Semantic security*
captures the intuition that the encryption of a message reveals
no information about the message. This is formalized by comparing
the ability of a computationally efficient adversary, having
access to both the ciphertext and any auxiliary information, to
output (anything about) the plaintext, to the ability of a
computationally efficient party having access *only* to the
auxiliary information (and not the ciphertext), to achieve the
same goal.^{13} Abilities are measured
by probabilities of success, where the probability space is over
the random choices made in choosing the encryption keys, the
ciphertexts, and by the adversaries. Clearly, if this difference
is very, very tiny, then in a rigorous sense the ciphertext leaks
(almost) no information about the plaintext.

The formal definition of semantic security is a pillar of modern cryptography. It is therefore natural to ask whether a similar property, such as Dalenius' goal, can be achieved for statistical databases. But there is an essential difference in the two problems. Unlike the eavesdropper on a conversation, the statistical database attacker is also a user, that is, a legitimate consumer of the information provided by the statistical database (not to mention the fact that he or she may also be a respondent in the database).

The formal definition of semantic security is a pillar of modern cryptography. It is therefore natural to ask whether a similar property, such as Dalenius' goal, can be achieved for statistical databases.

Many papers in the literature attempt to formalize Dalenius'
goal (in some cases unknowingly) by requiring that the
adversary's prior and posterior beliefs about an individual (that
is, before and after having access to the statistical database)
should not be "too different," or that access to the statistical
database should not change the adversary's views about any
individual "too much." The difficulty with this approach is that
if the statistical database teaches us anything at all, then it
*should* change our beliefs about individuals. For example,
suppose the adversary's (incorrect) prior view is that everyone
has two left feet. Access to the statistical database teaches
that almost everyone has one left foot and one right foot. The
adversary now has a very different view of whether or not any
given respondent has two left feet. But has privacy been
compromised?

The last hopes for Dalenius' goal evaporate in light of the following parable, which again involves auxiliary information. Suppose we have a statistical database that teaches average heights of population subgroups, and suppose further that it is infeasible to learn this information (perhaps for financial reasons) in any other way (say, by conducting a new study). Finally, suppose that one's true height is considered sensitive. Given the auxiliary information "Turing is two inches taller than the average Lithuanian woman," access to the statistical database teaches Turing's height. In contrast, anyone without access to the database, knowing only the auxiliary information, learns much less about Turing's height.

A rigorous impossibility result generalizes this argument,
extending to essentially any notion of privacy compromise,
*assuming the statistical database is useful*. The heart of
the attack uses extracted randomness from the statistical
database as a one-time pad for conveying the privacy compromise
to the
adversary/user.^{6,9}

Turing did not have to be a member of the database for the attack described earlier to be prosecuted against him. More generally, the things that statistical databases are designed to teach can, sometimes indirectly, cause damage to an individual, even if this individual is not in the database.

In practice, statistical databases are (typically) created to
provide some anticipated social gain; they teach us something we
could not (easily) learn without the database. Together with the
attack against Turing, and the fact that he did not have to be a
member of the database for the attack to work, this suggests a
new privacy goal: Minimize the increased risk to an individual
incurred by joining (or leaving) the database. That is, we move
from comparing an adversary's prior and posterior views of an
individual to comparing the risk to an individual when included
in, versus when not included in, the database. This makes sense.
A privacy guarantee that limits risk incurred by joining
encourages participation in the dataset, increasing social
utility. This is the starting point on our path to
*differential privacy*.

Differential privacy will ensure that the ability of an
adversary to inflict harm (or good, for that matter)—of any
sort, to any set of people—should be essentially the same,
independent of whether any individual opts in to, or opts out of,
the dataset. We will do this indirectly, simultaneously
addressing all possible forms of harm and good, by focusing on
the probability of any given output of a privacy mechanism and
how this probability can change with the addition or deletion of
any row. Thus, we will concentrate on pairs of databases (*D,
D'*) differing only in one row, meaning one is a subset of the
other and the larger database contains just one additional row.
Finally, to handle worst-case pairs of databases, our
probabilities will be over the random choices made by the privacy
mechanism.

DEFINITION 1. *A randomized function Κ gives*
ε-differential privacy *if for all datasets D and D'
differing on at most one row, and all S Í
Range*(*Κ*),

*where the probability space in each case is over the coin
flips of Κ*.

The multiplicative nature of the guarantee implies that an
output whose probability is zero on a given database must also
have probability zero on any neighboring database, and hence, by
repeated application of the definition, on any other database.
Thus, Definition 1 trivially rules out the subsample-and-release
paradigm discussed: For an individual *x* not in the
dataset, the probability that *x*'s data is sampled and
released is obviously zero; the multiplicative nature of the
guarantee ensures that the same is true for an individual whose
data *is* in the dataset.

Any mechanism satisfying this definition addresses all
concerns that any participant might have about the leakage of his
or her personal information, regardless of any auxiliary
information known to an adversary: Even if the participant
removed his or her data from the dataset, no outputs (and thus
consequences of outputs) would become significantly more or less
likely. For example, if the database were to be consulted by an
insurance provider before deciding whether or not to insure a
given individual, then the presence or absence of *any*
individual's data in the database will not significantly affect
his or her chance of receiving coverage.

Definition 1 extends naturally to group privacy. Repeated
application of the definition bounds the ratios of probabilities
of outputs when a collection *C* of participants opts in or
opts out, by a factor of *e*^{|C|ε}.
Of course, the point of the statistical database is to disclose
aggregate information about large groups (while simultaneously
protecting individuals), so we should expect privacy bounds to
disintegrate with increasing group size.

The parameter ε is public, and its selection is a social question. We tend to think of ε as, say, 0.01, 0.1, or in some cases, ln 2 or ln 3.

Sometimes, for example, in the census, an individual's participation is known, so hiding presence or absence makes no sense; instead we wish to hide the values in an individual's row. Thus, we can (and sometimes do) extend "differing in at most one row" to mean having symmetric difference at most 1 to capture both possibilities. However, we will continue to use the original definition.

Returning to randomized response, we see that it yields
ε-differential privacy for a value of ε that
depends on the universe from which the rows are chosen and the
probability with which a random, rather than non-random, value is
contributed by the respondent. As an example, suppose each row
consists of a single bit, and that the respondent's instructions
are to first flip an unbiased coin to determine whether he or she
will answer randomly or truthfully. If heads (respond randomly),
then the respondent is to flip a second unbiased coin and report
the outcome; if tails, the respondent answers truthfully. Fix
*b* {0, 1}. If the true value of the input is
*b*, then *b* is output with probability 3/4. On the
other hand, if the true value of the input is 1 – *b*,
then *b* is output with probability 1/4. The ratio is 3,
yielding (ln 3)-differential privacy.

Suppose *n* respondents each employ randomized response
independently, but using coins of known, fixed, bias. Then, given
the randomized data, by the properties of the binomial
distribution the analyst can approximate the true answer to the
question "How many respondents have value *b*?" to within an
expected error on the order of Θ (). As we will
see, it is possible to do much better—obtaining
*constant* expected error, independent of *n*.

Generalizing in a different direction, suppose each row now
has two bits, each one randomized independently, as described
earlier. While each bit remains (ln 3)-differentially private,
their logical-AND enjoys less privacy. That is, consider a
privacy mechanism in which each bit is protected by this exact
method of randomized response, and consider the query: "What is
the logical-AND of the bits in the row of respondent *i*
(after randomization)?" If we consider the two extremes, one in
which respondent *i* has data 11 and the other in which
respondent *i* has data 00, we see that in the first case
the probability of output 1 is 9/16, while in the second case the
probability is 1/16. Thus, this mechanism is at best (ln
9)-differentially private, not ln 3. Again, it is possible to do
much better, even while releasing the entire 4-element histogram,
also known as a *contingency table*, with only constant
expected error in each cell.

Achieving differential privacy revolves around hiding the
presence or absence of a single individual. Consider the query
"How many rows in the database satisfy property *P*?" The
presence or absence of a single row can affect the answer by at
most 1. Thus, a differentially private mechanism for a query of
this type can be designed by first computing the true answer and
then adding random noise according to a distribution with the
following property:

To see why this is desirable, consider any feasible response
*r*. For any *m*, if *m* is the true answer and
the response is *r* then the random noise must have value
*r − m*; similarly, if *m* − 1 is the true
answer and the response is *r*, then the random noise must
have value *r − m* + 1. In order for the response
*r* to be generated in a differentially private fashion, it
suffices for

In general we are interested in vector-valued queries; for
example, the data may be points in **R**^{d}
and we wish to carry out an analysis that clusters the points and
reports the location of the largest cluster.

DEFINITION 2. *For f : D* →
**R**^{d}, *the L*_{1} sensitivity
*of f is*^{7}

*for all D, D' differing in at most one row*.

In particular, when *d* = 1 the sensitivity of *f*
is the maximum difference in the values that the function
*f* may take on a pair of databases that differ in only one
row. This is the difference our noise must be designed to hide.
For now, let us focus on the case *d* = 1.

The Laplace distribution with parameter *b*, denoted
Lap(*b*), has density function ; its variance is
2*b*^{2}. Taking *b* = 1/ε we have that
the density at *z* is proportional to
*e*^{−ε|z|}. This distribution
has highest density at 0 (good for accuracy), and for any *z,
z*' such that |*z − z'*| ≤ 1 the density at
*z* is at most *e*^{ε} times the
density at *z'*, satisfying the condition in Equation 2. It
is also symmetric about 0, and this is important. We cannot, for
example, have a distribution that only yields non-negative noise.
Otherwise the only databases on which a counting query could
return a response of 0 would be databases in which no row
satisfies the query. Letting *D* be such a database, and
letting *D*' = *D* {*r*} for some row
*r* satisfying the query, the pair *D, D*' would
violate ε-differential privacy. Finally, the distribution
gets flatter as ε decreases. This is correct: smaller
ε means better privacy, so the noise density should be
less "peaked" at 0 and change more gradually as the magnitude of
the noise increases.

There is nothing special about the cases *d* = 1,
Δ*f* = 1:

THEOREM 2. *For f : D* →
**R**^{d}, *the mechanism Κ that adds
independently generated noise with distribution Lap
(Δf/ε) to each of the d output terms enjoys
ε-differential
privacy*.^{7}

Before proving the theorem, we illustrate the situation for
the case of a counting query (Δ*f* = 1) when ε
= ln2 and the true answer to the query is 100. The distribution
on the outputs (in gray) is centered at 100. The distribution on
outputs when the true answer is 101 is shown in orange.

PROOF. (Theorem 2) The proof is a simple generalization of the reasoning we used to illustrate the case of a single counting query.

Consider any subset *S* Í Range(*Κ*), and
let *D, D*' be any pair of databases differing in at most
one row. When the database is *D*, the probability density
at any *r* *S* is proportional to
exp(−||*f*(*D*) − *r*||_{1}
(ε/Δ*f*)). Similarly, when the database is
*D*', the probability density at any *r*
Range(*K*) is proportional to exp (−||*f*
(*D*') −
*r*||_{1}(ε/Δ*f*)).

We have

where the inequality follows from the triangle inequality. By
definition of sensitivity, ||*f*(*D*') −
*f*(*D*)||_{1} ≤ Δ*f*, and so the
ratio is bounded by exp(ε). Integrating over *S*
yields ε-differential privacy.

Given any query sequence *f*_{1},...,
*f*_{m}, ε-differential privacy can
be achieved by running *Κ* with noise distribution
Lap() on *each* query, even if the queries are
chosen adaptively, with each successive query depending on the
answers to the previous queries. In other words, by allowing the
quality of each answer to deteriorate in a controlled way with
the sum of the sensitivities of the queries, we can maintain
ε-differential privacy.

With this in mind, let us return to some of the suggestions we
considered earlier. Recall that using the specific randomized
response strategy described above, for a single Boolean
attribute, yielded error Θ() on databases of
size *n* and (ln 3)-differential privacy. In contrast, using
Theorem 2 with the same value of ε, noting that
Δ*f* = 1 yields a variance of 2(1/ln
3)^{2}, or an expected error of
/ln3. More generally, to obtain
ε-differential privacy we get an expected error of
/ε. Thus, our expected error magnitude is
constant, independent of *n*.

What about two queries? The sensitivity of a sequence of two
counting queries is 2. Applying the theorem with
Δ*f*/ε = 2/ε, adding independently
generated noise distributed as Lap(2/ε) to each true
answer yields ε-differential privacy. The variance is
2(2/ε)^{2}, or standard deviation
22/ε. Thus, for any desired ε we can
achieve ε-differential privacy by increasing the expected
magnitude of the errors as a function of the total sensitivity of
the two-query sequence. This holds equally for:

- Two instances of the
*same query*, addressing the repeated query problem - One count for each of two different bit positions, for example, when each row consists of two bits
- A pair of queries of the form: "How many rows satisfy
property
*P*?" and "How many rows satisfy property*Q*?" (where possibly*P*=*Q*) - An arbitrary pair of queries

However, the theorem also shows we can sometimes do better. The logical-AND count we discussed earlier, even though it involves two different bits in each row, still only has sensitivity 1: The number of 2-bit rows whose entries are both 1 can change by at most 1 with the addition or deletion of a single row. Thus, this more complicated query can be answered in an ε-differentially private fashion using noise distributed as Lap(1/ε); we do not need to use the distribution Lap(2/ε).

** Histogram Queries.** The power of Theorem 2 really
becomes clear when considering

** When Noise Makes No Sense.** There are times when
the addition of noise for achieving privacy makes no sense. For
example, the function

There are times when the addition of noise for achieving privacy makes no sense.

Assume the curator holds a database *D* and the goal is
to produce an object *y*. The *exponential
mechanism*^{19} works as follows. We
assume the existence of a *utility function u*(*D, y*)
that measures the quality of an output *y*, given that the
database is *D*. For example, the data may be a set of
labeled points in **R**^{d} and the output
*y* might be a *d*-ary vector describing a (*d*
− 1)-dimensional hyperplane that attempts to classify the
points, so that those labeled with +1 have non-negative inner
product with *y* and those labeled with −1 have
negative inner product. In this case the utility would be the
number of points correctly classified, so that higher utility
corresponds to a better classifier. The exponential mechanism
outputs *y* with probability proportional to
exp(*u*(*D,y*)ε/Δ*u*) and ensures
ε-differential privacy. Here Δ*u* is the
sensitivity of the utility function bounding, for all databases
(*D, D*') differing in only one row and potential outputs
*y*, the difference |*u*(*D, y*) −
*u*(*D',y*)|. In our example, Δ*u* = 1. The
mechanism assigns most mass to the best classifier, and the mass
assigned to any other drops off exponentially in the decline in
its utility for the current dataset—hence the name
"exponential mechanism."

** When Sensitivity Is Hard to Analyze.** The Laplace
and exponential mechanisms provide a differentially private
interface through which the analyst can access the data. Such an
interface can be useful even when it is difficult to determine
the sensitivity of the desired function or query sequence; it can
also be used to run an iterative algorithm, composed of easily
analyzed steps, for as many iterations as a given privacy budget
permits. This is a powerful observation; for example, using only
noisy sum queries, it is possible to carry out many standard data
mining tasks, such as singular value decompositions, finding an
ID3 decision tree, clustering, learning association rules, and
learning anything learnable in the statistical queries learning
model, frequently with good accuracy, in a privacy-preserving
fashion.

** k-Means Clustering.** As an example of "private
programming,"

- Partition the samples {
*p*_{i}} into*k*sets*S*_{1}, ...,*S*_{k}, associating each*p*_{i}with the nearest μ_{j}. - For 1 ≤
*j*≤*k*, set μ'_{j}= Σ_{iεsj}*p*_{i}/|*S*_{j}|, the mean of the samples associated with μ_{j}.

This update rule is typically iterated until some convergence criterion has been reached, or a fixed number of iterations have been applied.

Although computing the nearest mean of any one sample (Step 1)
would breach privacy, we observe that to compute an average among
an unknown set of points it is enough to compute their sum and
divide by their number. Thus, the computation only needs to
expose the approximate cardinalities of the
*S*_{j}, not the sets themselves. Happily,
the *k* candidate means implicitly define a histogram query,
since they partition the space [0, 1]^{d}
according to their Voronoi cells, and so the vector
(|*S*_{1}|, ..., |*S*_{k}|)
can be released with very low noise in each coordinate. This
gives us a differentially private approximation to the
denominators in Step 2. As for the numerators, the sum of a
subset of the *p*_{i} has sensitivity at most
*d*, since the points come from the bounded region
[0,1]^{d}. Even better, the sensitivity of the
*d*-ary function that returns, for each of the *k*
Voronoi cells, the *d*-ary sum of the points in the cell is
at most *d:* Adding or deleting a single *d*-ary point
can affect at most one sum, and that sum can change by at most 1
in each of the *d* dimensions. Thus, using a query sequence
with total sensitivity at most *d* + 1, the analyst can
compute a new set of candidate means by dividing, for each
μ_{j}, the approximate sum of the points in
*S*_{j} by the approximation to the
cardinality |*S*_{j}|.

If we run the algorithm for a fixed number *N* of
iterations we can use the noise distribution Lap((*d* +
1)*N*/ε) to obtain ε-differential privacy.
If we do not know the number of iterations in advance we can
increase the noise parameter as the computation proceeds. There
are many ways to do this. For example, we can answer in the first
iteration with parameter (*d* + 1)(ε/2), in the next
with parameter (*d* + 1)(ε/4), and so on, each time
using up half of the remaining "privacy budget."

The idea of creating a synthetic dataset whose statistics
closely mirror those of the original dataset, but which preserves
privacy of individuals, was proposed in the statistics community
no later than 1993.^{21} The lower
bounds on noise discussed at the end of Section on "How Is Hard"
imply that no such dataset can safely provide very accurate
answers to too many weighted subset sum questions, motivating the
interactive approach to private data analysis discussed herein.
Intuitively, the advantage of the interactive approach is that
only the questions actually asked receive responses.

Against this backdrop, the non-interactive case was revisited
from a learning theory perspective, challenging the
interpretation of the noise lower bounds as a limit on the number
of queries that can be answered
privately.^{3} This work, described next,
has excited interest in interactive and non-interactive solutions
yielding noise in the range [ω(),
*o(n*)].

Let *X* be a universe of data items and let *C* be a
*concept class* consisting of functions *c* : *X*
→ {0,1}. We say *x* *X satisfies* a concept
*c* *C* if and only if *c(x*) = 1. A
concept class can be extremely general; for example, it might
consist of all rectangles in the plane, or all Boolean circuits
containing a given number of gates.

Given a sufficiently large database *D*
*X*^{n}, it is possible to privately generate
a synthetic database that maintains approximately correct
fractional counts for *all* concepts in *C* (there may
be infinitely many!). That is, letting *S* denote the
synthetic database produced, with high probability over the
choices made by the privacy mechanism, for every concept *c*
*C*, the fraction of elements in *S* that
satisfy *c* is approximately the same as the fraction of
elements in *D* that satisfy *c*.

The minimal size of the input database depends on the quality
of the approximation, the logarithm of the cardinality of the
universe *X*, the privacy parameter ε, and the
*Vapnick–Chervonenkis dimension* of the concept class
*C* (for finite |*C*| this is at most log_{2}
|*C*|). The synthetic dataset, chosen by the exponential
mechanism, will be a set of *m* =
*O*(VCdim(*C*)/γ^{2}), elements in
*X* (γgoverns the maximum permissible inaccuracy in
the fractional count). Letting *D* denote the input dataset
and a candidate synthetic dataset, the utility
function for the exponential mechanism is given by

Data collected by a curator for a given purpose may be subject to "mission creep" and legal compulsion, such as a subpoena. Of course, we could analyze data and then throw it away, but can we do something even stronger, never storing the data in the first place? Can we strengthen our notion of privacy to capture the "never store" requirement?

These questions suggest an investigation of differentially private streaming algorithms with small state—much too small to store the data. However, nothing in the definition of a streaming algorithm, even one with very small state, precludes storing a few individual data points. Indeed, popular techniques from the streaming literature, such as Count-Min Sketch and subsampling, do precisely this. In such a situation, a subpoena or other intrusion into the local state will breach privacy.

A *pan-private* algorithm is private "inside and out,"
remaining differentially private even if its internal state
becomes visible to an adversary.^{10} To
understand the pan-privacy guarantee, consider click stream data.
This data is generated by individuals, and an individual may
appear many times in the stream. Pan-privacy requires that any
two streams differing only in the information of a single
individual should produce very similar distributions on the
*internal states* of the algorithm *and on its
outputs*, even though the data of an individual are
interleaved arbitrarily with other data in the stream.

As an example, consider the problem of *density
estimation*. Assuming, for simplicity, that the data stream is
just a sequence of IP addresses in a certain range, we wish to
know what fraction of the set of IP addresses in the range
actually appears in the stream. A solution inspired by randomized
response can be designed using the following
technique.^{10}

Define two probability distributions, *D*_{0} and
*D*_{1}, on the set {0, 1}. *D*_{0}
assigns equal mass to zero and to one. *D*_{1} has a
slight bias toward 1; specifically, 1 has mass 1/2 + ε/4,
while 0 has mass 1/2 − ε/4.

Let *X* denote the set of all possible IP addresses in
the range of interest. The algorithm creates a table, with a
1-bit entry *b*_{x} for each *x*
*X*, initialized to an independent random draw from
*D*_{0}. So initially the table is roughly half
zeroes and half ones.

In an atomic step, the algorithm receives an element from the
stream, changes state, and discards the element. When processing
*x* *X*, the algorithm makes a fresh random draw
from *D*_{1}, and stores the result in
*b*_{x}. This is done no matter how many
times *x* may have appeared in the past. Thus, for any
*x* appearing at least once, *b*_{x}
will be distributed according to *D*_{1}. However,
if *x* never appears, then the entry for *x* is the bit
drawn according to *D*_{0} during the initialization
of the table.

As with randomized response, the density in *X* of the
items in the stream can be approximated from the number of 1's in
the table, taking into account the expected fraction of "false
positives" from the initialization phase and the "false
negatives" when sampling from *D*_{1}. Letting
θ denote the fraction of entries in the table with value 1,
the output is 4(θ − 1/2)/ε +
Lap(1/ε|*X*|).

Intuitively, the internal state is differentially private
because, for each *b* {0, 1},
*e*^{−ε} ≤ Pr_{D1}
[*b*]/Pr_{D0} [*b*] ≤
*e*^{ε}; privacy for the output is ensured
by the addition of Laplacian noise. Over all, the algorithm is
2ε-differentially pan-private.

The differential privacy frontier is expanding rapidly, and there is insufficient space here to list all the interesting directions currently under investigation by the community. We identify a few of these.

** The Geometry of Differential Privacy.** Sharper
upper and lower bounds on noise required for achieving
differential privacy against a sequence of linear queries can be
obtained by understanding the geometry of the query
sequence.

** Algorithmic Complexity.** We have so far ignored
questions of computational complexity. Many, but not all, of the
techniques described here have efficient implementations. For
example, there are instances of the synthetic data generation
problem that, under standard cryptographic assumptions, have no
polynomial time implementation.

** An Alternative to Differential Privacy?** Is there
an alternative,

The work described herein has, for the first time, placed private data analysis on a strong mathematical foundation. The literature connects differential privacy to decision theory, economics, robust statistics, geometry, additive combinatorics, cryptography, complexity theory learning theory, and machine learning. Differential privacy thrives because it is natural, it is not domain-specific, and it enjoys fruitful interplay with other fields. This flexibility gives hope for a principled approach to privacy in cases, like private data analysis, where traditional notions of cryptographic security are inappropriate or impracticable.

1. Adam, N.R., Wortmann, J. Security-control
methods for statistical databases: A comparative study. *ACM
Comput. Surv. 21* (1989), 515–556.

2. Blum, A., Dwork, C., McSherry, F., Nissim,
K. Practical privacy: The SuLQ framework. In *Proceedings of
the 24th ACM Symposium on Principles of Database Systems*
(2005), 128–138.

3. Blum, A., Ligett, K., Roth, A. A learning
theory approach to non-interactive database privacy. In
*Proceedings of the 40th ACM Symposium on Theory of
Computing* (2008), 609–618.

4. Denning, D.E. Secure statistical databases
with random sample queries. *ACM Trans. Database Syst. 5*
(1980), 291–315.

5. Dinur, I., Nissim, K. Revealing
information while preserving privacy. In *Proceedings of the
22nd ACM Symposium on Principles of Database Systems* (2003),
202–210.

6. Dwork, C. Differential privacy. In
*Proceedings of the 33rd International Colloquium on Automata,
Languages and Programming (ICALP) (2)* (2006), 1–12.

7. Dwork, C., McSherry, F., Nissim, K.,
Smith, A. Calibrating noise to sensitivity in private data
analysis. In *Proceedings of the 3rd Theory of Cryptography
Conference* (2006), 265–284.

8. Dwork, C., McSherry, F., Talwar, K. The
price of privacy and the limits of lp decoding. In *Proceedings
of the 39th ACM Symposium on Theory of Computing* (2007),
85–94.

9. Dwork, C. Naor, M. On the difficulties of
disclosure prevention in statistical databases or the case for
differential privacy. *J. Privacy Confidentiality 2* (2010).
Available at: http://repository.cmu.edu/jpc/vol2/iss1/8.

10. Dwork, C., Naor, M., Pitassi, T.,
Rothblum, G., Yekhanin, S. Pan-private streaming algorithms. In
*Proceedings of the 1st Symposium on Innovations in Computer
Science* (2010).

11. Dwork, C., Naor, M., Reingold, O.,
Rothblum, G., Vadhan, S. When and how can privacy-preserving data
release be done efficiently? In *Proceedings of the 41st ACM
Symposium on Theory of Computing* (2009), 381–390.

12. Dwork, C., Nissim, K. Privacy-preserving
datamining on vertically partitioned databases. In *Advances in
Cryptology—CRYPTO'04* (2004), 528–544.

13. Goldwasser, S., Micali, S. Probabilistic
encryption. *JCSS 28* (1984), 270–299.

14. Hardt, M., Talwar, K. On the geometry of
differential privacy, (2009). In *Proceedings of the
42 ^{nd} ACM Symposium on Theory of Computing* (2010),
705–714.

15. Kenthapadi K., Mishra, N., Nissim, K.
Simulatable auditing. In *Proceedings of the 24th ACM Symposium
on Principles of Database Systems* (2005), 118–127.

16. Kleinberg, J., Papadimitriou, C.,
Raghavan, P. Auditing boolean attributes. In *Proceedings of
the 19th ACM Symposium on Principles of Database Systems*
(2000), 86–91.

17. Kumar, R., Novak, J., Pang, B., Tomkins,
A. On anonymizing query logs via token-based hashing. In
*Proceedings of the WWW 2007* (2007), 629–638.

18. McSherry, F. Privacy integrated queries
(codebase). Available on Microsoft Research downloads website.
See also *Proceedings of SIGMOD* (2009), 19–30.

19. McSherry, F., Talwar, K. Mechanism
design via differential privacy. In *Proceedings of the 48th
Annual Symposium on Foundations of Computer Science*
(2007).

20. Mironov, I., Pandey, O., Reingold, O.,
Vadhan, S. Computational differential privacy. In *Advances in
Cryptology—CRYPTO'09* (2009), 126–142.

21. Rubin, D. Discussion: Statistical
disclosure limitation. *J. Official Statist. 9* (1993),
462–468.

22. Sweeney, L. Weaving technology and
policy together to maintain confidentiality. *J. Law Med.
Ethics 25* (1997), 98–110.

23. Warner, S. Randomized response: a survey
technique for eliminating evasive answer bias. *JASA*
(1965), 63–69.

**©2011
ACM 0001-0782/11/0100 $10.00**

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.

The following letter was published in the Letters to the Editor in the May 2011 CACM (http://cacm.acm.org/magazines/2011/5/107681).

--CACM Administrator

Many thanks for Cynthia Dwork's article "A Firm Foundation for Private Data Analysis" (Jan. 2011), explaining why, in trying to formalize what is perfect privacy, we cannot use the late University of Stockholm economist Tore E. Dalenius's criterion that asking allowed queries of a statistical database, we should not be able to learn new (private) information about a particular individual. When preparing to discuss Dwork's article at a recent colloquium in our computer science department, we came up with an even simpler explanation of such an impossibility:

One important purpose of collecting statistical data is to help identify correlations between, say, weight and blood pressure. Suppose, for example, it turns out that blood pressure is equal to weight, and we know that person A (not in this database) weighs 180 pounds. Without the database, A's blood pressure might be private, but once we learn the perfect correlation from it, we can conclude that A's blood pressure is 180.

In real life, we never see such perfect correlation, but, by analyzing the database and discovering some correlation, we know more about the probability of different values of blood pressure than we would otherwise know.

Vladik Kreinovich and Luc Longpre

El Paso, TX

" However, this is clearly insufficient: Suppose appearing in a subsample has terrible consequences. Then every time subsampling occurs some individual suffers horribly."

I don't understand this line of reasoning. In most randomized trial experiments, which are considered as a gold standard in both medical and social sciences this is ethically acceptable. At the same time appearing in a subsample might need not end up always horribly for the selected individual, it can also provide a benefit. The underlying assumption that appearance in a randomized subsample is somehow discriminatory goes completely against ethical norms followed by all other fields.

Akshay Bhat,

New York, NY

Displaying **all 2** comments