Sign In

Communications of the ACM

Research highlights

A Theory Of Pricing Private Data


A Theory of Pricing Private Data, illustration

Credit: Getty Images

When the analysis of individuals' personal information has value to an institution, but it compromises privacy, should individuals be compensated? We describe the foundations of a market in which those seeking access to data must pay for it and individuals are compensated for the loss of privacy they may suffer.

Back to Top

1. Introduction

The interests of individuals and institutions with respect to personal data are often at odds. Personal data has great value to institutions: they eagerly collect it and monetize it by using it to model customer behavior, personalize services, target advertisements, or by selling the data directly. Yet the inappropriate disclosure of personal data poses a risk to individuals. They may suffer a range of harms including elevated prices for goods or services, discrimination, or exclusion from employment opportunities.3

A rich literature on privacy-preserving data analysis4,6,11 has tried to devise technical means for negotiating these competing interests. The goal is to derive accurate aggregate information from data collected from a group of individuals while at the same time protecting each member's personal information. But this approach necessarily imposes restrictions on the use of data. A seminal result from this line of work is that any mechanism providing reasonable privacy must strictly limit the number of query answers that can be accurately released.5 Nevertheless, recent research into differential privacy,7 a formal model of privacy in which an individual's privacy loss is rigorously measured and bounded, has shown that, for some applications, accurate aggregate analysis need not entail significant disclosure about individuals. Practical adoption of these techniques is slowing increasing: they have been used in a U.S. Census product16 and for application monitoring by Google9 and Apple.13

But there remain settings where strictly limiting privacy loss degrades the utility of data and means the intended use of data will be impossible. We therefore pursue an alternative approach which allows a non-negligible degree of privacy loss if that loss is compensated in accordance with users' preferences. Compensating privacy loss is an improvement over the narrower view that mandates negligible privacy loss because it empowers individuals to control their data through financial means and permits more accurate data analysis if end-users are willing to pay for it.

Considering personal information as a resource, one that is valuable but also exchangeable, is not new. Twenty years ago, Laudon proposed that personal information be bought and sold in a national market18 and there is a mature literature on economic aspects of privacy.1 And in today's online services, one could argue that individuals are compensated indirectly for contributing their personal data. Many internet companies acquire personal data by offering a (purportedly) free service, attracting users who provide their data, and then monetizing the personal data by selling it, or by selling information derived from it, to third parties.

Even so, a technical foundation for a market for personal information is lacking, particularly one that is consistent with recent advances in the formal modeling of privacy. We address this by proposing a formal framework for assigning prices to queries in order to compensate data owners for their loss of privacy. Our framework borrows from, and extends, key principles from both differential privacy7,8 and data markets.17,21

There are three types of actors in our setting: individuals, or data owners, contribute their personal data; a buyer submits an aggregate query over many owners' data; and a market maker, trusted to answer queries on behalf of owners, charges the buyer and compensates the owners.

Our framework makes three important connections:

* 1.1. Perturbation and price

In response to a buyer's query, the market maker computes the true query answer, adds random noise, and returns a perturbed result. Using differential privacy, perturbation is always necessary. Here query answers can be sold unperturbed, but the price would be high because each data owner contributing to an aggregate query needs to be compensated. By adding perturbation to the query answer, the price can be lowered: the more perturbation, the lower the price. When issuing the query, the buyer specifies the degree of accuracy for which he is willing to pay. Unperturbed query answers are very expensive, but at the other extreme, query answers are almost free if the noise added is the same as required by differential privacy with conservative privacy parameters. The relationship between the accuracy of a query result and its cost depends on the query and the preferences of contributing data owners. Formalizing this relationship is one of the goals of this article.

* 1.2. Arbitrage and perturbation

Arbitrage allows a buyer to obtain the answer to a query more cheaply than its advertised price by deriving the answer from a less expensive alternative set of queries. Arbitrage is possible because of inconsistency in a set of priced queries. As a simple example, suppose that a given query is sold with two options for perturbation, measured by variance: $5 for a variance of 10 and $200 for a variance of 1. A savvy buyer seeking a variance of 1 would never pay $200. Instead, he would purchase the first query 10 times, receive 10 noisy answers, and compute their average. Since noise is added independently, the variance of the resulting average is 1, and the total cost is only $50. The pricing of queries should avoid arbitrage opportunities. While this has been considered before for data markets,2, 17, 21 it has not been studied for perturbed query answers. Formalizing arbitrage for noisy queries is a second goal of this article.

* 1.3. Privacy-loss and payments

Given a randomized mechanism for answering a query q, a common measure of privacy loss to an individual is defined by differential privacy7:it is the maximum ratio between the probability of returning some fixed output with and without that individual's data. Differential privacy imposes a bound of eε on this quantity, where ε is a small constant, presumed acceptable to all individuals. Our framework contrasts with this in several ways. First, the privacy loss is not bounded, but depends on the buyer's request. If the buyer asks for a query with low variance, then the privacy loss to individuals will be high. These data owners must be compensated for their privacy loss by the buyer's payment. In addition, we allow each data owner to value their privacy loss separately, by demanding greater or lesser payments. Formalizing the relationship between privacy loss and payments to data owners is a third goal of this article.

In our framework, the burden of the market maker is not to enforce a strict limit on the privacy loss of individuals. Instead, they must ensure that prices are set such that, whatever disclosure is obtained by the buyer, all contributing individuals are properly compensated. In particular, if a sequence of queries can indeed reveal the private data for most individuals, its price must approach the total cost of the entire database.

Back to Top

2. Background

In this section we describe the basic architecture of the private data pricing framework, illustrated as shown in Figure 1.

f1.jpg
Figure 1: The pricing framework has 3 components: (A) Pricing and purchase: the buyer asks a query Q = (q, v) and must pay its price, π(Q); (B) Privacy loss: by answering Q, the market maker leaks some information ε about the private data of the data owners to the buyer; (C) Compensation: the market maker must compensate each data owner for their privacy loss with micro-payments; μi(Q) is the total of the micro-payments for all users in bucket i. The pricing framework is balanced if the price π(Q) is sufficient to cover all micro-payments μj and the micro-payments μi compensate the owners for their privacy loss ε.

* 2.1. The main actors

The main actors in our proposed marketplace are data owners, query buyers, and a market maker negotiating between the two.

The market maker. The market maker is trusted by the buyer and by each of the data owners. He collects data from the owners and sells it in the form of queries. When a buyer decides to purchase a query, the market maker collects payment, computes the answer to the query, adds noise as appropriate, returns the result to the buyer, and finally distributes individual payments to the data owners. The market maker may retain a fraction of the price as profit.

The owner and her data. Each owner contributes a single tuple conforming to a relational schema R(A), with attributes A = {A1, A2, ..., Ak}. The crossproduct of the attribute domains, written dom(A), is the set of all possible tuples that could occur. For a fixed k, its size, n = | dom(A)|, is polynomial in the number of users; n grows exponentially with k. Having collected tuples from each owner, the market maker forms a relational table I, an instance of R(A).

The buyer and his queries. The buyer is a data analyst who wishes to compute some queries over a data. We restrict our attention to linear aggregation queries over instance I, a common class of queries that has received considerable attention from the privacy community.8, 15, 19

To express linear queries, we first represent the instance I by a finite data vector x consisting of non-negative integral counts. For each element tdom(A), the vector x has one entry that reports the number of individuals whose tuple matches t. We assume that dom(A) is ordered, and denote xi the i'th entry in the vector x. In other words, xi represents the number of individuals whose attribute values match the i'th entry in dom(A). Although the size n of the vector x can be large, in practice one avoids fully materializing x and retains only the relational representation of I.

DEFINITION 1 (LINEAR QUERY). A linear query is a real-valued vector q = (q1, q2 ... qn). The answer q(x) to a linear query on x is the vector product qx = q1x1 + ... + qnxn.

The class of linear queries includes many common aggregation queries including general predicate counting queries and familiar statistical queries such as histogram counts, data cube queries, and marginals. Linear queries can express differences, weighted counts, and averages over discrete domains. While the query vector q is large, in practice queries are expressed compactly, directly over the relational representation of a data, in a query language like Structured Query Language (SQL).

EXAMPLE 2. Consider a competition between candidates A and B that is decided by a population of voters. The ratings and descriptive fields of voters are sensitive and should be compensated properly if used in any way. Imagine that users contribute their gender and age along with two numerical ratings, each consisting of values from the domain {0, 1, 2, 3, 4, 5}. Thus Alice's tuple may be ("Female", 39, 0, 5) if she strongly favors candidate B over candidate A. If the domain for the age attribute is [1 ... 120], then the database vector will have size 2 × 120 × 6 × 6 = 8640.

Examples of linear queries include: the total of all ratings for candidate A; the number of female voters over 40 who gave candidate A a rating of 5; the number of voters whose rating for candidate A exceeds that for candidate B.

We make two comments about our framework. First, although the vector x removes personally identifiable information (since xi is the total number of users with a particular combination of attribute values), queries can still leak information about individual users. Differential privacy is designed to limit such leaks. Second, we assume that the buyer is allowed to issue multiple queries, which means that he can combine information derived from multiple queries to infer answers to other queries not explicitly requested. This presents a challenge we must address: to ensure that the buyer pays for any information that he might derive directly or indirectly.

* 2.2. Balanced pricing frameworks

The market maker enters into two contracts: (1) they promise to answer a buyer's queries according to an agreed price, π, (Section 3), and (2) they promise to compensate the data owners with a micro-payment μi(ε) whenever they suffer a privacy loss ε in response to a buyer's query (Section 5).

In the contract with the buyer, the market maker allows the buyer to specify, for each linear query q, an amount of noise v that he is willing to tolerate in the answer. Adding noise reduces the price. Thus, the buyer's query is a pair Q = (q, v), where q is a linear query and v ≥ 0 represents an upper bound on the variance, and the price depends on both, π(Q) = π(q, v) ≥ 0. The market maker answers by first computing the exact answer q(x), then adding noise sampled from a distribution with mean 0 and variance at most v. This feature gives the buyer more pricing options because, by increasing v, he can lower his price. Notice that the price depends only on the variance v, and not on the type of noise.

In the contract with the data owners, the market maker promises to compensate her with a micro-payment. We denote μi(ε) the sum of all micropayments to all users whose attributes match the i'th entry to the vector, where ε is the privacy loss incurred by answering the query Q. The privacy loss depends both on the variance, and on the type of noise that the market maker uses to answer queries: in Section 4 we will restrict the noise to the Laplace distribution, for which there exists an explicit formula connecting the privacy loss ε to the variance. In that case the micro-payment depends on the buyer's query, μi(Q). For the pricing framework to be balanced, we must have cacm6012_a.gif. Note that μi(Q) needs to be further split among all users participating in the i'th bucket of the data x.

EXAMPLE 3. Continuing Example 2, suppose that there are 1000 voters, and that Bob, the buyer, wants to compute the sum of ratings for candidate A. Assume that each voter requires $10 for each raw vote. For an accurate answer to the query, Bob needs to pay $10,000, which is, arguably, too expensive.

Assume Bob is willing to buy the query perturbed with variance v = 5,000, which would give an error of ±300 with 94% confidence. The market maker should charge Bob a much lower price; to see how low, we need to consider how the market maker compensates the data owners. We assume that he uses Laplace noise for the perturbation, and therefore one can show that the answer to the query is ε-differentially private with ε = 0.1, which offers reasonably good privacy to all data owners: each will be happy to accept only $0.001 for basically no loss of privacy, and Bob pays only $1 for the entire query. The challenge is to design the prices in between. For example, suppose the data owner wants to buy more accuracy, say a variance v = 50, to reduce the error to ±30. How should the price be set? For now, let us observe that the price cannot exceed $100. If it did, then a savvy buyer would never pay that price, instead he would purchase the $1 query 100 times, compute the average, and obtain the answer with a variance of 5000/100 = 50. This is an example of arbitrage and the market maker should define a pricing function that avoids it.

While in the contract with the buyer the price depends only on the variance and not on the type of perturbation, the contract with the data owner is highly sensitive to the type of noise. For example, consider noise defined by the following probability distribution: P(0) = 1 – 2/m and Pm) = 1/m, where m = 1064. This noise distribution has mean 0 and variance 2m, but is a poor choice for this market. On the one hand, it has a high variance, which implies a low price π. On the other hand, it returns an accurate answer with extremely high probability, leading to huge privacy losses εi, and, consequently, to huge micro-payments. The market maker will not be able to recover his costs. Thus, in order to design a balanced pricing framework, we need to have a perturbation mechanism where the privacy loss is given by an explicit function in the variance; in Section 5.1 we will only consider the Laplace mechanism, where such a function exists.

Back to Top

3. Pricing queries

In this section we describe the first component of the framework as shown in Figure 1: the pricing function π(Q) = π(q, v). We denote + = [0, ∞) and cacm6012_b.gif.

DEFINITION 4. A pricing function is cacm6012_c.gif.

In our framework, the buyer is allowed to issue multiple queries. As a consequence, an important concern is that the buyer may combine answers from multiple queries and derive an answer to a new query, without paying the full price for the latter, a situation we call arbitrage. A reasonable pricing function must guarantee that no arbitrage is possible, in which case we call it arbitrage-free. Such a pricing function ensures that the market maker receives proper payment for each query by removing any incentive for the buyer to "game" the system by asking a set of cheaper queries in order to obtain the desired answer. Here we formally define arbitrage-free pricing functions, study their properties, and discuss the construction of arbitrage-free pricing functions.

* 3.1. Queries and answers

A randomized mechanism is a random function κ, with some input x, denoted κ(x). For a given query Q = (q, v), the market maker answers it using a randomized mechanism κQ, with the property that, for any x, E(κQ(x) ) = q(x) and Var (κQ(x) ≤ v. In other words, when the buyer asks for a query Q, the market maker samples one value from the distribution κQ and returns it to the buyer, in exchange for payment π(Q). We abbreviate κQ with κ when Q is clear from the context.

DEFINITION 5. We say that a random function κ(x) answers the query Q = (q, v) on the database x if its expectation is q(x) and its variance is less than or equal to v.

Other options for answering queries are possible, and we briefly discuss them in Section 6.

To illustrate with a simple example, consider the mechanism κQ which, on input x first computes the query q(x), then adds random noise with mean 0 and variance v. In this section we do not impose any restrictions on the type of perturbation used in answering the query.

We assume that the market maker is stateless: he does not keep a log of previous users, their queries, or of released answers. As a consequence, each query is answered using an independent random variable. If the same buyer issues the same query repeatedly, the market maker answers using independent samples from the random function κ. Of course, the buyer would have to pay for each query separately.

* 3.2. Answerability

Before investigating arbitrage we establish the key concept of query answerability. This notion is well studied for deterministic queries and views,14, 22 but, in our setting, the query answers are random variables, and it requires a precise definition.

DEFINITION 6 (ANSWERABILITY). A query Q is answerable from a multi-set of queries S = {Q1, ..., Qk} if there exists a function f : ksuch that, for any mechanisms K1 ..., Kk, that answer the queries Q1, ..., Qk, the composite mechanism defined as cacm6012_d.gif answers the query Q.

We say that Q is linearly answerable from S, and write SQ, if the function f is linear; in that case, we will also say that S determines Q.

In other words, if we have already computed some queries, then we can compute a new query by applying a function to their results. For example, suppose we have computed Q1 = (q1, v1) and Q2 = (q2, v2), and obtained the answers y1, y2. Then we can answer the query Q3 = ((q1+q2)/2, (v1+v2)/4) by applying the function f(y1, y2) = (y1 + y2)/2. We pay only for Q1, Q2, and do not pay again for Q3.

How do we check if a query can be answered from a given set of queries? In this paper we give a partial answer, by characterizing when a query is linearly answerable.

PROPOSITION 7. Let S = {(q1, v1),...,(qm, vm)} be a multi-set of queries, and Q = (q, v) be a query. Then the following conditions are equivalent.

  1. S → Q.
  2. There exists c1, ..., cm such that c1q1+...+cmqm = q and cacm6012_e.gif

It follows that deciding determinacy can be done in polynomial time, by solving for the coefficients c1, ..., cm using a quadratic program.

* 3.3. Arbitrage-free pricing functions: definition

Arbitrage is possible when the answer to a query Q can be obtained more cheaply than the advertised price π(Q) from an alternative set of priced queries. When arbitrage is possible it complicates the interface between the buyer and market maker: the buyer may need to reason carefully about his queries to achieve the lowest price, while at the same time the market maker may not achieve the revenue intended by some of his advertised prices. Thus, arbitrage is undesirable, and we want to design pricing functions that are arbitrage-free.

DEFINITION 8 (ARBITRAGE-FREE). A pricing function π is arbitrage-free if, whenever a multi-set of queries determines a query, the multi-set is at least as expensive as the query. That is, SQ impliesQ' ∈ s π(Q') ≥ π (Q).

If π does permit arbitrage, then a buyer would never pay full price for the determined query, but instead would purchase the multiset of queries that determine it.

EXAMPLE 9. Consider a query (q, v) offered for price π(q, v). A buyer who wishes to improve the accuracy of the query may ask the same query n times, (q, v), (q, v), ..., (q, v), at a total cost of n · π(q, v). The buyer then computes the average of the query answers to get an estimated answer with a much lower variance, namely v/n. The pricing function must ensure that the total payment collected from the buyer covers the cost of this lower variance, in other words n · π(q, v) ≥ π(q, v/n). If π is arbitrage free, then it is easy to check that this condition holds. Indeed, {(q, v), ..., (q, v)} → (nq, nv) → (q, v/n), and arbitrage-freeness implies π(q, v/n) ≤ π(q, v) + ... + π(q, v) = n · π (q, v).

One can prove20 that any arbitrage-free pricing function π has the following properties:

  1. The zero query is free: π (0, v) = 0.
  2. Higher variance is cheaper: vv' implies π(q, v) ≥ π(q, v').
  3. The zero-variance query is the most expensivea: π(q, 0) ≥ π(q, v) for all v ≥ 0.
  4. Infinite noise is free: if π is continuous at q = 0, then π(q, ∞) = 0.
  5. As v → ∞, π (q, v) = Ω(1/v).

Arbitrage-free pricing functions have been studied before,17, 21 but only in the context of deterministic (i.e. unperturbed) query answers. Our definition extends those in17, 21 to queries with perturbed answers.

* 3.4. Arbitrage-free pricing functions: synthesis

Does an arbitrage-free pricing function exist? The zero-cost function π = 0, where every query is free, is arbitrage-free. A constant-price pricing function, which charges the same amount c > 0 for every query, is not arbitrage-free, because in that case the zero-query would also cost c, and by item (1) above π must have arbitrage. In this section we prove that non-trivial arbitrage-free pricing functions exist, and give some sufficient rules for synthesizing such functions, and in Section 3.5 we show that general construction of arbitrage-free pricing functions remains a difficult problem.

Example 9 and item (5) suggest defining a pricing function that decreases linearly with the variance, π ∼ 1/v. Recall that a semi-norm is a function f : n+ that satisfies the following properties:

  • For any c ε and any qn, f(cq) = |c| f(b).
  • For any q1, q2n, f(q1 + q2) ≤ (q1) + f (q2).

One can check that, q = 0 implies f(q) = 0; if the converse also holds, then f is called a norm. We prove:

PROPOSITION 10 (Basic pricing functions). Let π be a pricing function of the form π(q, v) = f 2(q)/v , for some function f. Then π is arbitrage-free iff f is a semi-norm.

Thus, we can obtain an arbitrage-free pricing function from a semi-norm. However, all these pricing functions set an infinite price for the raw (unperturbed) data, because in that case the variance is v = 0. In some cases, the market maker may be willing to sell the raw data for some high, but finite price. We describe next a method for synthesizing other arbitrage-free pricing functions, including ones that set a finite price for the raw data.

PROPOSITION 11 (COMPOSED PRICING FUNCTIONS). Let cacm6012_f.gif be a function that is sub-additive and non-decreasing (meaning f(x + y) ≤ f(x) + f(y), and xy implies f(x) ≤ f(y)). Then, for any arbitrage-free pricing functions π1,...,πk, the function π = f1,...,πk) is also arbitrage-free.

A simple sufficient criterion for checking if a function f satisfies the properties required by the proposition is the following: if f is continuous, twice differentiable, f(0) = 0, ∂f/∂xi ≥ 0, ∂2 f/∂xixj ≤ 0 for all i, j = 1, k, then it is non-decreasing and sub-additive.

By using these two propositions we can construct a large set of arbitrage-free pricing functions. We list here a few:

COROLLARY 12. All functions defined below are arbitrage-free, assuming π1, ..., πk are arbitrage free:

ueq01.gif

The last entry defines bounded pricing functions, in particular they charge a finite price for the raw data.

* 3.5. Deriving pricing functions from price points

We discuss here a more flexible framework for defining an arbitrage-free pricing function: the market maker sets the price for a finite set of queries called views, then the system automatically extrapolates this to a pricing function on all queries.

DEFINITION 13. A price point is a pair (V, p), where V = (q, v) is a query (called a view) and p ε + is a fixed price.

Given a set of price points V = ((V1, p1),..., (Vm, pm)), we say that a pricing function π is valid w.r.t. V if it is arbitrage-free and agrees with all price points, π(Vi)=pi, for all (Vi, pi) ∈ V,

This represents an ideal pricing framework because the market maker only needs to set prices for a small number of queries and variances, and the system extrapolates it to a valid pricing function. However, the technical challenge is: how do we compute a valid pricing function from a set of price points?

In general, there may not exist a valid pricing function. For example, the market maker may choose two views such that the first determines the second, yet the price of the second is set higher than that of the first view. No arbitrage-free pricing function can agree with both price points. If any valid pricing function exists, then we say that the set of price points is consistent.

The key to designing a valid pricing function is to compute, for any given query, the cheapest plan to answer it from the views in the set of price points. A procurement plan is a plan for answering a query by purchasing views in the set of price points.

DEFINITION 14 (PROCUREMENT PLAN). Consider an ordered set of price points, V = {(V1, p1)(V2, p2),...,(Vm, pm)}. A procurement plan is an ordered multi-set of non-negative integers, B = {b1, b2, ..., bm}, such that the multiset cacm6012_h.gif (where each query Vi occurs bi times) determines Q: VBQ. The cost of the procurement plan is cost(B) = ∑i bi pi.

For a simple example, suppose there is a single price point, V = {((q, 100), $5)}, which charges $5 for some query with variance 100. The buyer wants to purchase Q = (q, 25). Then, a procurement plan is {b1 = 4}. In other words, we must purchase the query 4 times, and compute the average, in order to reduce the variance to 25 (since the random noise added to different purchases is independent).

Procurement plans should not be confused with answerability (Definition 6). There the queries were already purchased. Now we have to decide which views to purchase and, in particular, we may purchase a view several times. We call the cost of the cheapest procurement plan the arbitrage pricing function.

DEFINITION 15. Given a set of price-points ν, the arbitrage pricing function is:

ueq02.gif

Note that, by this definition, if a query is not answerable from a set of price points, then the arbitrage price is ∞. The arbitrage function has a simple interpretation. Once the views are offered for sale, a buyer can purchase any query and pay only the arbitrage function, regardless of what the market maker wants to charge for that query. The arbitrage function is thus an upper bound on any valid pricing function. Could the market maker choose the arbitrage function as the official price for all queries? We answer this positively, by proving that the arbitrage function is indeed arbitrage-free.

THEOREM 16. For any set of price points ν, the arbitrage function πν is arbitrage-free. ν is consistent, if the arbitrage price does not lower the price of any view: pi ≤ πν(Vi) for all (Vi, pi) ∈ ν.

Thus, one choice for the market maker is to set all prices to be the arbitrage function. Unfortunately, it turns out that computing the arbitrage function is NP-hard.

THEOREM 17. The following problem: "Given a set of price points ν, a query Q, and budget p > 0, decide whether Q is answerable from ν within a budget p (in other words, check πν(Q) ≤ p)", is NP-complete.

The proof if by reduction from the Exact Cover by 3-Sets problem. Hardness holds even if all views in ν have variance 0.

Back to Top

4. Privacy loss

In this section we describe the second component of the pricing framework as shown in Figure 1: the privacy loss for each owner, denoted by ε. Our definition of privacy loss follows that of differential privacy. Given the database vector x, denote by x(j) the database vector that results from adding one count to entry j and leaving all other values unchanged. An individual's privacy loss is measured by comparing the mechanism output on two data vectors that differ in any one entry.

DEFINITION 18. Let K be any mechanism (meaning: for any database instance x, K(x) is a random variable). The privacy loss to each user, in notation cacm6012_i.gif, is defined as:

ueq03.gif

where x ranges over all possible databases and S ranges over measurable sets of .

Recall that in Section 3 we defined the price π(Q) to depend only on the variance, and not on the mechanism used to answer the query. But the definition of privacy loss depends on the mechanism K, so now we need to choose a particular mechanism. We will use the classical Laplace mechanism; in that case, the privacy loss can be given as a function of the variance, and of a property that depends only on the query, called query sensitivity.7 The sensitivity of q, denoted sq, is the largest possible change to the answer that can be caused by adding or removing one individual's contribution: cacm6012_j.gif.

Formally, the Laplace Mechanism, denoted , answers a query Q = (q, v) by returning Q(x) = q(x) + ρ, where ρ is noise drawn from a Laplace distribution, centered at zero, with scale cacm6012_k.gif. The privacy loss of each individual is bounded by cacm6012_l.gif.

Back to Top

5. Balanced pricing frameworks

In this section we define formally when a pricing framework is balanced and we provide a general procedure for designing a balanced pricing framework. The concept of balance brings together the three components as shown in Figure 1: the query price π, the privacy loss ε, and the micro-payments μi. We begin with a description of micro-payments.

* 5.1. Micro-payments to data owners

By answering a buyer's query Q, using some mechanism κQ, the market maker leaks some of the private data of the data owners. He must compensate each data owner with some micro-payment. Recall that μi(Q) denotes the sum of all micro-payments due to the data owner whose attributes match xi, the i'th entry in the data vector. The micro-payments close the loop as shown in Figure 1: they must be covered by the buyer's payment π, and must also be a function of the degree of the privacy loss ε. We require the micro-payment functions μi to be arbitrage-free, in order to promise that the owner's loss of privacy will be compensated, and that there is no way for the buyer to circumvent the owed micro-payment by asking other queries and combining their answers.

* 5.2. Balanced pricing frameworks: definition

The contract between data owners and the market-maker consists of non-decreasing functions cacm6012_m.gif, i = 1, n, s.t Wi(0) = 0. Here cacm6012_n.gif the market maker promises to pay the data owners contributing to the i' bucket a micro-payment of at least μiWi(ε)in the event of a privacy loss ε. We denote W = (W1, ..., Wn) the set of contracts between the market maker and all data owners.

The connection between the micro-payments μi, the query price π and the privacy loss εi is captured by the following definition.

DEFINITION 19. We say that the micro-payment functions μi, i = 1, ..., n are cost-recovering for a pricing function π if, for any query Q, π(Q) ≥ ∑i μi (Q).

Fix a query answering mechanism κ. We say that a micro-payment function μi is compensating for a contract function Wi, if for any query Q, μi(Q) ≥ Wi (ε(κQ)).

The market maker will insist that the micro-payment functions are cost-recovering: otherwise, he will not be able to pay the data owners from the buyer's payment. In addition, a data owner will insist that the micro-payment function is compensating: this enforces the contract between her and the market maker, guaranteeing that she will be compensated at least Wi(ε), in the event of a privacy loss ε.

Fix a query answering mechanism κ. We denote a pricing framework (π, ε, μ, W), where π(Q), μi(Q) are the buyer's price and the micro-payments, ε = (ε1,..., εn) where εiQ) is the privacy loss corresponding to the mechanism κ, and Wi(ε) is the contract with the data owner i.

DEFINITION 20. A pricing framework (π, ε, μ, W) is balanced if (1) π is arbitrage-free and (2) the micro-payment functions μ are arbitrage-free, cost-recovering for π, and compensating for W.

We explain how the contract between the data owner and the market maker differs from that in privacy-preserving mechanisms. Let ε > 0 be a small constant. A mechanism K is called differentially private7 if, for any measurable set S, any database vector x and for any entry j of x:

ueq04.gif

In differential privacy, the basic contract between the mechanism and the data owner is the promise to every user that her privacy loss is no larger than ε. In our framework for pricing private data we turn this contract around. Now, privacy is lost, and Definition 18 quantifies this loss. The contract requires that the users are compensated according to their privacy loss. At an extreme, if the mechanism is ε-differentially private for a tiny ε, then each user will receive only a tiny micro-payment Wi(ε); as her privacy loss increases, she will be compensated more.

The micro-payments circumvent a common limitation of differentially-private mechanisms. In differential privacy, the data analyst typically has a fixed budget, ε, granted by the data curator, for all queries that he may ever ask. In order to issue N queries, he needs to divide the privacy budget among these queries, and, as a result, each query will be perturbed with greater noise. After issuing these N queries, he can no longer query the database, because otherwise the contract with the data owner would be breached.

In our pricing framework there is no such hard limitation because the buyer simply pays for each query. The budget is now a real financial quantity, and the buyer can ask as many query as he wants, with as high accuracy as he wants, as long as he has money to pay for them. As a consequence, it is the analyst-buyer, rather than the data owner, who ultimately determines the actual privacy loss. A similar model was proposed by Ghosh et al., in which the ε budget is determined by the budget of the data analyst.12

* 5.3. Balanced pricing frameworks: synthesis

Call (ε, μ, W) semi-balanced if all micro-payment functions are arbitrage free and compensating w.r.t. κ; that is, we leave out the pricing function π and the cost-recovering requirement. The first step is to design a semi-balanced set of micro-payment functions.

PROPOSITION 21. Let be the Laplace Mechanism, and let the contract functions be linear, Wi(ε) = ci · ε, where ci > 0 is a fixed constant, for i = 1, ..., n. Define the micro-payment functions cacm6012_o.gif, for i =1,...,n. Then (ε, μ, W) is semi-balanced.

The proposition defines the micro-payment associated to linear contracts Wi, which requires infinite payment for total loss of privacy. In practice, data owners are willing to sell their raw data at some high (but finite) price. We show next how to derive new semi-balanced micro-payments by applying sub-additive functions to other micro-payments.

PROPOSITION 22. Fix k semi-balanced pricing frameworks, (ε, μ, Wj), j = 1, ..., k. Define cacm6012_q.gif, and cacm6012_r.gif, for each i = 1, ..., n, where each function cacm6012_p.gif, i = 1,...,n, is non-decreasing, sub-additive, and satisfies fi(0) = 0. Then, (ε, μ, W) is also semi-balanced, where μ = (μ1, ..., μn) and W = (W1, ..., Wn).

Finally, we choose a payment function such as to ensure that the micro-payments are cost-recovering.

PROPOSITION 23. Suppose that (ε, μ, W) is semi-balanced, and define π (Q) = ∑iμi(Q). Then, (π ε, μ, W) is balanced.

To summarize, the synthesis procedure for a pricing framework proceeds as follows. Start with the simple micro-payment functions given by Proposition 21, which ensure linear compensation for each user. Next, modify both the micro-payment and the contract functions using Proposition 22, as desired, in order to adjust to the preferences of individual users (e.g. in order to allow a user to set a price for her true data). Finally, define the query price to be the sum of all micro-payments (Proposition 23), then increase this price freely, by using any method in Section 3.4.

Back to Top

6. Discussion

A number of issues deserve further attention if we are to achieve a fully-functional marketplace for private data.

First, we have made two restrictions which limit the prevention of arbitrage in our market. We require the market maker to answer queries using an unbiased estimator (Definition 5) and therefore arbitrage-freeness only prevents an adversary from deriving an unbiased estimator of a query using cheaper queries (Definition 6). In addition, we have considered a restricted notion of answerability, limiting our attention to adversaries who only consider queries computed by linear functions of other queries. Both of these assumptions limit the class of adversaries against which the arbitrage-free property is guaranteed to hold.

A more general approach would be to relax Definition 5 to allow queries to be answered using either biased or unbiased estimators and to include answerability using non-linear functions. This would provide the market maker with a stronger guarantee against arbitrage. However it would likely make reasoning about determinacy and arbitrage-free pricing significantly more difficult, and it would further restrict the set of arbitrage-free pricing functions available to the market maker. In other words, a more powerful notion of arbitrage would lead to more restrictive pricing functions, potentially limiting the ability of the market maker to set prices. The tradeoff between completeness of the framework and the feasibility of analyzing arbitrage is an important topic for future research.

A second issue is the problem of incentivizing the data owner to participate in the database and truthfully report her privacy valuations. Currently, there is nothing stopping the data owner from quoting an impossibly high price, for even a tiny loss of her privacy. In other words, she would choose a contract function W(ε) that is as close to ∞ as possible. Incentivizing users to report their true valuation is a goal of mechanism design. This has been studied for private data only in the restricted case of a single query, and has been shown to be a difficult task.10,12 In Ref. Li et al.20 we make a preliminary attempt to address this issue by requiring that users choose from a limited set of contract functions (e.g. one appropriate for a risk-tolerant user and one appropriate for a risk-averse user).

A third issue is the protection of privacy valuations themselves. When a user has sufficient freedom to choose their privacy valuation, it may be strongly correlated with the data xi itself. In that case, even releasing the price of a query may lead to privacy loss, a factor not considered in the framework described above. Hiding the valuation itself is a difficult problem which is still being actively investigated in mechanism design.10,12,23 In, Ref. Li et al.20 we describe a simple approach that is based on perturbing the price itself, in the same way that we perturb the data. Thus both π(Q) and μi(Q) are perturbed in the same fashion as query answers, and are therefore random variables. All three properties of arbitrage-freeness, cost-recovery, and compensation are then defined in terms of expected values. In addition, the privacy loss for data item xi includes two parts: one is due to the release of the query answer and the other is due to the release of the price.

Back to Top

7. Conclusion

We have introduced a framework for selling private data. Buyers can purchase any linear query, with any amount of perturbation, and need to pay accordingly. Data owners, in turn, are compensated according to the privacy loss they incur for each query. Buyers are allowed to ask an arbitrary number of queries, and we have designed techniques for ensuring that the prices are arbitrage-free, according to a specific definition of arbitrage-freeness, meaning that buyers are guaranteed to pay for any information they may further extract from the queries. Our pricing framework is balanced, in the sense that the buyer's price covers the micro-payments to the data owner and each micro-payment compensates the users according to their privacy loss.

An interesting open question is whether we can achieve both truthfulness (e.g. as discussed in Ghosh and Roth12) and arbitrage-freeness (as discussed in the current paper) when pricing private data. Further, it remains to consider general notions of answerability that go beyond linear answerability, or to bound the impact non-linear estimation methods could have in the context of arbitrage.

Back to Top

References

1. Acquisti, A., Taylor, C.R., Wagman, L. The economics of privacy. Journal of Economic Literature 52, 2 (2016).

2. Balazinska, M., Howe, B., Suciu, D. Data markets in the cloud: An opportunity for the database community. PVLDB 4, 12 (2011), 1482–1485.

3. Calo, R. The boundaries of privacy harm. Indiana Law Journal 86, 3 (2011).

4. Chen, B.-C., Kifer, D., LeFevre, K., Machanavajjhala, A. Privacy-preserving data publishing. In Foundations and Trends in Databases (2010).

5. Dinur, I., Nissim, K. Revealing information while preserving privacy. In PODS (2003), 202–210.

6. Dwork, C. A firm foundation for private data analysis. Commun. ACM 54, 1 (2011), 86–95.

7. Dwork, C., McSherry, F., Nissim, K., Smith, A. Calibrating noise to sensitivity in private data analysis. In TCC (2006), 265–284.

8. Dwork, C., Roth, A. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science (2014) Now Publishers Inc., Hanover, MA, USA.

9. Erlingsson, U., Pihur, V., Korolova, A. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Computer and Communications Security (CCS) (2014), 1054–1067.

10. Fleischer, L., Lyu, Y.-H. Approximately optimal auctions for selling privacy when costs are correlated with data. In ACM Conference on Electronic Commerce (2012), 568–585.

11. Fung, B.C.M., Wang, K., Chen, R., Yu, P.S. Privacy-preserving data publishing: A survey of recent developments. ACM Comput. Surv. 42, 4 (2010).

12. Ghosh, A., Roth, A. Selling privacy at auction. In ACM Conference on Electronic Commerce (2011), 199–208.

13. Greenberg, A. Apple's "differential privacy" is about collecting your data—but not your data. Wired (Jun 13 2016).

14. Halevy, A.Y. Answering queries using views: A survey. VLDB J. 10, 4 (2001), 270–294.

15. Hardt, M., Ligett, K., McSherry, F. A simple and practical algorithm for differentially private data release. In NIPS (2012).

16. Kifer, D., Abowd, J., Gehrke, J. Vilhuber, L. Privacy: Theory meets practice on the map. In ICDE (2008).

17. Koutris, P., Upadhyaya, P., Balazinska, M., Howe, B., Suciu, D. Query-based data pricing. J. ACM 62, 5 (Nov. 2015), 43:1–43:44.

18. Laudon, K.C. Markets and privacy. Commun. ACM 39, 9 (1996), 92–104.

19. Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A. Optimizing linear counting queries under differential privacy. In PODS (2010).

20. Li, C., Li, D.Y., Miklau, G. Suciu, D. A theory of pricing private data. ACM Trans. Database Syst. 39, 4 (Dec. 2014), 1–28.

21. Li, C., Miklau, G. Pricing aggregate queries in a data marketplace. In WebDB (2012).

22. Nash, A., Segoufin, L., Vianu, V. Views and queries: Determinacy and rewriting. TODS 35, 3 (2010).

23. Nissim, K., Vadhan, S., Xiao, D. Redrawing the boundaries on purchasing data from privacy-sensitive individuals. In Conference on Innovations in Theoretical Computer Science (2014), 411–422.

Back to Top

Authors

Chao Li (chaoli@google.com), Google Inc.

Daniel Yang Li (dyli@cs.washington.edu), University of Washington, Seattle, WA.

Gerome Miklau (miklau@cs.umass.edu), University of Massachusetts, Amherst, MA.

Dan Suciu (suciu@cs.washington.edu), University of Washington, Seattle, WA.

Back to Top

Footnotes

a. It is possible that π (q, 0) = ∞.

The original version of this paper was published in ACM Transactions on Database Systems 39, 4 (Dec. 2014), 1–28.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.


©2017 ACM  0001-0782/17/12

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2017 ACM, Inc.


 

No entries found