Research highlights
# A Theory Of Pricing Private Data

When the analysis of individuals' personal information has value to an institution, but it compromises privacy, should individuals be compensated? We describe the foundations of a market in which those seeking access to data must pay for it and individuals are compensated for the loss of privacy they may suffer.

The interests of individuals and institutions with respect to personal data are often at odds. Personal data has great value to institutions: they eagerly collect it and monetize it by using it to model customer behavior, personalize services, target advertisements, or by selling the data directly. Yet the inappropriate disclosure of personal data poses a risk to individuals. They may suffer a range of harms including elevated prices for goods or services, discrimination, or exclusion from employment opportunities.^{3}

A rich literature on privacy-preserving data analysis^{4,6,11} has tried to devise technical means for negotiating these competing interests. The goal is to derive accurate aggregate information from data collected from a group of individuals while at the same time protecting each member's personal information. But this approach necessarily imposes restrictions on the use of data. A seminal result from this line of work is that any mechanism providing reasonable privacy must strictly limit the number of query answers that can be accurately released.^{5} Nevertheless, recent research into differential privacy,^{7} a formal model of privacy in which an individual's privacy loss is rigorously measured and bounded, has shown that, for some applications, accurate aggregate analysis need not entail significant disclosure about individuals. Practical adoption of these techniques is slowing increasing: they have been used in a U.S. Census product^{16} and for application monitoring by Google^{9} and Apple.^{13}

But there remain settings where strictly limiting privacy loss degrades the utility of data and means the intended use of data will be impossible. We therefore pursue an alternative approach which allows a non-negligible degree of privacy loss if that loss is compensated in accordance with users' preferences. Compensating privacy loss is an improvement over the narrower view that mandates negligible privacy loss because it empowers individuals to control their data through financial means and permits more accurate data analysis if end-users are willing to pay for it.

Considering personal information as a resource, one that is valuable but also exchangeable, is not new. Twenty years ago, Laudon proposed that personal information be bought and sold in a national market^{18} and there is a mature literature on economic aspects of privacy.^{1} And in today's online services, one could argue that individuals are compensated indirectly for contributing their personal data. Many internet companies acquire personal data by offering a (purportedly) free service, attracting users who provide their data, and then monetizing the personal data by selling it, or by selling information derived from it, to third parties.

Even so, a technical foundation for a market for personal information is lacking, particularly one that is consistent with recent advances in the formal modeling of privacy. We address this by proposing a formal framework for assigning prices to queries in order to compensate data owners for their loss of privacy. Our framework borrows from, and extends, key principles from both differential privacy^{7,8} and data markets.^{17,21}

There are three types of actors in our setting: individuals, or data *owners*, contribute their personal data; a *buyer* submits an aggregate query over many owners' data; and a *market maker*, trusted to answer queries on behalf of owners, charges the buyer and compensates the owners.

Our framework makes three important connections:

**1.1. Perturbation and price**

In response to a buyer's query, the market maker computes the true query answer, adds random noise, and returns a perturbed result. Using differential privacy, perturbation is always necessary. Here query answers can be sold unperturbed, but the price would be high because each data owner contributing to an aggregate query needs to be compensated. By adding perturbation to the query answer, the price can be lowered: the more perturbation, the lower the price. When issuing the query, the buyer specifies the degree of accuracy for which he is willing to pay. Unperturbed query answers are very expensive, but at the other extreme, query answers are almost free if the noise added is the same as required by differential privacy with conservative privacy parameters. The relationship between the accuracy of a query result and its cost depends on the query and the preferences of contributing data owners. Formalizing this relationship is one of the goals of this article.

**1.2. Arbitrage and perturbation**

Arbitrage allows a buyer to obtain the answer to a query more cheaply than its advertised price by deriving the answer from a less expensive alternative set of queries. Arbitrage is possible because of inconsistency in a set of priced queries. As a simple example, suppose that a given query is sold with two options for perturbation, measured by variance: $5 for a variance of 10 and $200 for a variance of 1. A savvy buyer seeking a variance of 1 would never pay $200. Instead, he would purchase the first query 10 times, receive 10 noisy answers, and compute their average. Since noise is added independently, the variance of the resulting average is 1, and the total cost is only $50. The pricing of queries should avoid arbitrage opportunities. While this has been considered before for data markets,^{2, 17, 21} it has not been studied for perturbed query answers. Formalizing arbitrage for noisy queries is a second goal of this article.

**1.3. Privacy-loss and payments**

Given a randomized mechanism for answering a query *q*, a common measure of privacy loss to an individual is defined by differential privacy^{7}:it is the maximum ratio between the probability of returning some fixed output with and without that individual's data. Differential privacy imposes a bound of *e*^{ε} on this quantity, where ε is a small constant, presumed acceptable to all individuals. Our framework contrasts with this in several ways. First, the privacy loss is not bounded, but depends on the buyer's request. If the buyer asks for a query with low variance, then the privacy loss to individuals will be high. These data owners must be compensated for their privacy loss by the buyer's payment. In addition, we allow each data owner to value their privacy loss separately, by demanding greater or lesser payments. Formalizing the relationship between privacy loss and payments to data owners is a third goal of this article.

In our framework, the burden of the market maker is not to enforce a strict limit on the privacy loss of individuals. Instead, they must ensure that prices are set such that, whatever disclosure is obtained by the buyer, all contributing individuals are properly compensated. In particular, if a sequence of queries can indeed reveal the private data for most individuals, its price must approach the total cost of the entire database.

In this section we describe the basic architecture of the private data pricing framework, illustrated as shown in Figure 1.

**Figure 1: The pricing framework has 3 components: (A) Pricing and purchase: the buyer asks a query Q = (q, v) and must pay its price, π(Q); (B) Privacy loss: by answering Q, the market maker leaks some information ε about the private data of the data owners to the buyer; (C) Compensation: the market maker must compensate each data owner for their privacy loss with micro-payments; μ_{i}(Q) is the total of the micro-payments for all users in bucket i. The pricing framework is balanced if the price π(Q) is sufficient to cover all micro-payments μ_{j} and the micro-payments μ_{i} compensate the owners for their privacy loss ε.**

**2.1. The main actors**

The main actors in our proposed marketplace are data owners, query buyers, and a market maker negotiating between the two.

**The market maker**. The market maker is trusted by the buyer and by each of the data owners. He collects data from the owners and sells it in the form of queries. When a buyer decides to purchase a query, the market maker collects payment, computes the answer to the query, adds noise as appropriate, returns the result to the buyer, and finally distributes individual payments to the data owners. The market maker may retain a fraction of the price as profit.

**The owner and her data**. Each owner contributes a single tuple conforming to a relational schema *R*(𝔸), with attributes 𝔸 = {*A*_{1}, *A*_{2}, ..., *A _{k}*}. The crossproduct of the attribute domains, written

**The buyer and his queries**. The buyer is a data analyst who wishes to compute some queries over a data. We restrict our attention to linear aggregation queries over instance *I*, a common class of queries that has received considerable attention from the privacy community.^{8, 15, 19}

To express linear queries, we first represent the instance *I* by a finite *data vector* **x** consisting of non-negative integral counts. For each element *t* ∈ *dom*(𝔸), the vector **x** has one entry that reports the number of individuals whose tuple matches *t*. We assume that *dom*(𝔸) is ordered, and denote *x _{i}* the

DEFINITION 1 (LINEAR QUERY). A linear query *is a real-valued vector* **q** = (*q*_{1}, *q*_{2} ... *q _{n}*).

The class of linear queries includes many common aggregation queries including general predicate counting queries and familiar statistical queries such as histogram counts, data cube queries, and marginals. Linear queries can express differences, weighted counts, and averages over discrete domains. While the query vector **q** is large, in practice queries are expressed compactly, directly over the relational representation of a data, in a query language like Structured Query Language (SQL).

EXAMPLE 2. *Consider a competition between candidates A and B that is decided by a population of voters. The ratings and descriptive fields of voters are sensitive and should be compensated properly if used in any way. Imagine that users contribute their gender and age along with two numerical ratings, each consisting of values from the domain* {0, 1, 2, 3, 4, 5}. *Thus Alice's tuple may be* ("*Female*", 39, 0, 5) *if she strongly favors candidate B over candidate A. If the domain for the age attribute is* [1 ... 120], *then the database vector will have size* 2 × 120 × 6 × 6 = 8640.

*Examples of linear queries include: the total of all ratings for candidate A; the number of female voters over 40 who gave candidate A a rating of 5; the number of voters whose rating for candidate A exceeds that for candidate B*.

We make two comments about our framework. First, although the vector **x** removes personally identifiable information (since *x _{i}* is the total number of users with a particular combination of attribute values), queries can still leak information about individual users. Differential privacy is designed to limit such leaks. Second, we assume that the buyer is allowed to issue multiple queries, which means that he can combine information derived from multiple queries to infer answers to other queries not explicitly requested. This presents a challenge we must address: to ensure that the buyer pays for any information that he might derive directly or indirectly.

**2.2. Balanced pricing frameworks**

The market maker enters into two contracts: (1) they promise to answer a buyer's queries according to an agreed price, *π*, (Section 3), and (2) they promise to compensate the data owners with a micro-payment *μ*_{i}(*ε*) whenever they suffer a privacy loss *ε* in response to a buyer's query (Section 5).

In the contract with the buyer, the market maker allows the buyer to specify, for each linear query **q**, an amount of noise *v* that he is willing to tolerate in the answer. Adding noise reduces the price. Thus, the buyer's query is a pair **Q** = (**q**, *v*), where **q** is a linear query and *v* ≥ 0 represents an upper bound on the variance, and the price depends on both, π(**Q**) = π(**q**, *v*) ≥ 0. The market maker answers by first computing the exact answer **q**(**x**), then adding noise sampled from a distribution with mean 0 and variance at most *v*. This feature gives the buyer more pricing options because, by increasing *v*, he can lower his price. Notice that the price depends only on the variance *v*, and not on the type of noise.

In the contract with the data owners, the market maker promises to compensate her with a micro-payment. We denote μ_{i}(ε) the sum of all micropayments to all users whose attributes match the *i*'th entry to the vector, where ε is the privacy loss incurred by answering the query **Q**. The privacy loss depends both on the variance, and on the type of noise that the market maker uses to answer queries: in Section 4 we will restrict the noise to the Laplace distribution, for which there exists an explicit formula connecting the privacy loss ε to the variance. In that case the micro-payment depends on the buyer's query, μ_{i}(**Q**). For the pricing framework to be balanced, we must have . Note that μ_{i}(**Q**) needs to be further split among all users participating in the *i*'th bucket of the data **x**.

EXAMPLE 3. *Continuing Example 2, suppose that there are 1000 voters, and that Bob, the buyer, wants to compute the sum of ratings for candidate A. Assume that each voter requires $10 for each raw vote. For an accurate answer to the query, Bob needs to pay* $10,000, *which is, arguably, too expensive*.

*Assume Bob is willing to buy the query perturbed with variance v* = 5,000, *which would give an error of* ±300 *with* 94% *confidence. The market maker should charge Bob a much lower price; to see how low, we need to consider how the market maker compensates the data owners. We assume that he uses Laplace noise for the perturbation, and therefore one can show that the answer to the query is ε-differentially private with ε* = 0.1, *which offers reasonably good privacy to all data owners: each will be happy to accept only* $0.001 *for basically no loss of privacy, and Bob pays only* $1 *for the entire query. The challenge is to design the prices* in between. *For example, suppose the data owner wants to buy more accuracy, say a variance v* = 50, *to reduce the error to* ±30. *How should the price be set? For now, let us observe that the price cannot exceed* $100. *If it did, then a savvy buyer would never pay that price, instead he would purchase the* $1 *query 100 times, compute the average, and obtain the answer with a variance of* 5000/100 = 50. *This is an example of arbitrage and the market maker should define a pricing function that avoids it*.

While in the contract with the buyer the price depends only on the variance and not on the type of perturbation, the contract with the data owner is highly sensitive to the type of noise. For example, consider noise defined by the following probability distribution: *P*(0) = 1 – 2/*m* and *P*(±*m*) = 1/*m*, where *m* = 10^{64}. This noise distribution has mean 0 and variance 2*m*, but is a poor choice for this market. On the one hand, it has a high variance, which implies a low price π. On the other hand, it returns an accurate answer with extremely high probability, leading to huge privacy losses ε_{i}, and, consequently, to huge micro-payments. The market maker will not be able to recover his costs. Thus, in order to design a balanced pricing framework, we need to have a perturbation mechanism where the privacy loss is given by an explicit function in the variance; in Section 5.1 we will only consider the Laplace mechanism, where such a function exists.

In this section we describe the first component of the framework as shown in Figure 1: the pricing function π(**Q**) = π(**q**, *v*). We denote ℝ^{+} = [0, ∞) and .

DEFINITION 4. *A pricing function is* .

In our framework, the buyer is allowed to issue multiple queries. As a consequence, an important concern is that the buyer may combine answers from multiple queries and derive an answer to a new query, without paying the full price for the latter, a situation we call *arbitrage*. A reasonable pricing function must guarantee that no arbitrage is possible, in which case we call it *arbitrage-free*. Such a pricing function ensures that the market maker receives proper payment for each query by removing any incentive for the buyer to "game" the system by asking a set of cheaper queries in order to obtain the desired answer. Here we formally define arbitrage-free pricing functions, study their properties, and discuss the construction of arbitrage-free pricing functions.

**3.1. Queries and answers**

A *randomized mechanism* is a random function *κ*, with some input **x**, denoted *κ*(**x**). For a given query **Q** = (**q**, *v*), the market maker answers it using a randomized mechanism *κ*_{Q}, with the property that, for any **x**, **E**(*κ*_{Q}(**x**) ) = **q**(**x**) and **Var** (*κ*_{Q}(**x**) ≤ *v*. In other words, when the buyer asks for a query **Q**, the market maker samples one value from the distribution *κ*_{Q} and returns it to the buyer, in exchange for payment π(**Q**). We abbreviate *κ*_{Q} with *κ* when **Q** is clear from the context.

DEFINITION 5. *We say that a random function* *κ*(**x**) answers the query **Q** = (**q**, *v*) on the database **x** *if its expectation is* **q**(**x**) *and its variance is less than or equal to v*.

Other options for answering queries are possible, and we briefly discuss them in Section 6.

To illustrate with a simple example, consider the mechanism *κ*_{Q} which, on input **x** first computes the query **q**(**x**), then adds random noise with mean 0 and variance *v*. In this section we do not impose any restrictions on the type of perturbation used in answering the query.

We assume that the market maker is stateless: he does not keep a log of previous users, their queries, or of released answers. As a consequence, each query is answered using an independent random variable. If the same buyer issues the same query repeatedly, the market maker answers using independent samples from the random function *κ*. Of course, the buyer would have to pay for each query separately.

**3.2. Answerability**

Before investigating arbitrage we establish the key concept of query answerability. This notion is well studied for deterministic queries and views,^{14, 22} but, in our setting, the query answers are random variables, and it requires a precise definition.

DEFINITION 6 (ANSWERABILITY). *A query* **Q** *is* answerable *from a multi-set of queries* **S** = {**Q**_{1}, ..., **Q**_{k}} *if there exists a function f* : ℝ^{k} → ℝ *such that, for any mechanisms K*_{1} ..., *K _{k}*,

*We say that* **Q** *is* linearly answerable *from* **S**, *and write* **S** → **Q**, *if the function f is linear; in that case, we will also say that* **S** determines **Q**.

In other words, if we have already computed some queries, then we can compute a new query by applying a function to their results. For example, suppose we have computed **Q**_{1} = (**q**_{1}, *v*_{1}) and **Q**_{2} = (**q**_{2}, *v*_{2}), and obtained the answers *y*_{1}, *y*_{2}. Then we can answer the query **Q**_{3} = ((**q**_{1}+**q**_{2})/2, (*v*_{1}+*v*_{2})/4) by applying the function *f*(*y*_{1}, *y*_{2}) = (*y*_{1} + *y*_{2})/2. We pay only for **Q**_{1}, **Q**_{2}, and do not pay again for **Q**_{3}.

How do we check if a query can be answered from a given set of queries? In this paper we give a partial answer, by characterizing when a query is *linearly* answerable.

PROPOSITION 7. *Let* **S** = {(**q**_{1}, *v*_{1}),...,(**q**_{m}, *v*_{m})} *be a multi-set of queries, and* **Q** = (**q**, *v*) *be a query. Then the following conditions are equivalent*.

**S → Q**.*There exists c*_{1}, ...,*c*such that_{m}*c*_{1}**q**_{1}+...+*c*_{m}**q**_{m}=**q***and*

It follows that deciding determinacy can be done in polynomial time, by solving for the coefficients *c*_{1}, ..., *c _{m}* using a quadratic program.

**3.3. Arbitrage-free pricing functions: definition**

Arbitrage is possible when the answer to a query **Q** can be obtained more cheaply than the advertised price π(**Q**) from an alternative set of priced queries. When arbitrage is possible it complicates the interface between the buyer and market maker: the buyer may need to reason carefully about his queries to achieve the lowest price, while at the same time the market maker may not achieve the revenue intended by some of his advertised prices. Thus, arbitrage is undesirable, and we want to design pricing functions that are arbitrage-free.

DEFINITION 8 (ARBITRAGE-FREE). *A pricing function π is arbitrage-free if, whenever a multi-set of queries determines a query, the multi-set is at least as expensive as the query. That is*, **S** → **Q** *implies* ∑_{Q' ∈ s} π(**Q**') ≥ π (**Q**).

If π does permit arbitrage, then a buyer would never pay full price for the determined query, but instead would purchase the multiset of queries that determine it.

EXAMPLE 9. *Consider a query* (**q**, *v*) *offered for price* π(**q**, *v*). *A buyer who wishes to improve the accuracy of the query may ask the same query n times*, (**q**, *v*), (**q**, *v*), ..., (**q**, *v*), *at a total cost of n* · π(**q**, *v*). *The buyer then computes the average of the query answers to get an estimated answer with a much lower variance, namely v/n. The pricing function must ensure that the total payment collected from the buyer covers the cost of this lower variance, in other words n* · π(**q**, *v*) ≥ π(**q**, *v/n*). *If π is arbitrage free, then it is easy to check that this condition holds. Indeed*, {(**q**, *v*), ..., (**q**, *v*)} → (*n***q**, *nv*) → (**q**, *v/n*), *and arbitrage-freeness implies* π(**q**, *v/n*) ≤ π(**q**, *v*) + ... + π(**q**, *v*) = *n* · π (**q**, *v*).

One can prove^{20} that any arbitrage-free pricing function π has the following properties:

- The zero query is free: π (
**0**,*v*) = 0. - Higher variance is cheaper:
*v*≤*v*' implies π(**q**,*v*) ≥ π(**q**,*v*'). - The zero-variance query is the most expensive
^{a}: π(**q**, 0) ≥ π(**q**,*v*) for all*v*≥ 0. - Infinite noise is free: if π is continuous at
**q**= 0, then π(**q**, ∞) = 0. - As
*v*→ ∞, π (**q**,*v*) = Ω(1/*v*).

Arbitrage-free pricing functions have been studied before,^{17, 21} but only in the context of deterministic (i.e. unperturbed) query answers. Our definition extends those in^{17, 21} to queries with perturbed answers.

**3.4. Arbitrage-free pricing functions: synthesis**

Does an arbitrage-free pricing function exist? The zero-cost function π = 0, where every query is free, is arbitrage-free. A constant-price pricing function, which charges the same amount *c* > 0 for every query, is not arbitrage-free, because in that case the zero-query would also cost *c*, and by item (1) above π must have arbitrage. In this section we prove that non-trivial arbitrage-free pricing functions exist, and give some sufficient rules for synthesizing such functions, and in Section 3.5 we show that general construction of arbitrage-free pricing functions remains a difficult problem.

Example 9 and item (5) suggest defining a pricing function that decreases linearly with the variance, π ∼ 1/*v*. Recall that a *semi-norm* is a function *f* : ℝ^{n} → ℝ^{+} that satisfies the following properties:

- For any
*c*ε ℝ and any**q**∈ ℝ^{n},*f*(*c***q**) = |*c*|*f*(**b**). - For any
**q**_{1},**q**_{2}∈ ℝ^{n},*f*(**q**_{1}+**q**_{2}) ≤ (**q**_{1}) +*f*(**q**_{2}).

One can check that, **q** = 0 implies *f*(**q**) = 0; if the converse also holds, then *f* is called a norm. We prove:

PROPOSITION 10 (Basic pricing functions). *Let π be a pricing function of the form* π(**q**, *v*) = *f* ^{2}(**q**)/*v* , *for some function f. Then π is arbitrage-free iff f is a semi-norm*.

Thus, we can obtain an arbitrage-free pricing function from a semi-norm. However, all these pricing functions set an infinite price for the raw (unperturbed) data, because in that case the variance is *v* = 0. In some cases, the market maker may be willing to sell the raw data for some high, but finite price. We describe next a method for synthesizing other arbitrage-free pricing functions, including ones that set a finite price for the raw data.

PROPOSITION 11 (COMPOSED PRICING FUNCTIONS). *Let* *be a function that is sub-additive and non-decreasing (meaning f*(**x** + **y**) ≤ *f*(**x**) + *f*(**y**), *and* **x** ≤ **y** *implies f*(**x**) ≤ *f*(**y**)). *Then, for any arbitrage-free pricing functions* π_{1},...,π_{k}, *the function* π = *f*(π_{1},...,π_{k}) *is also arbitrage-free*.

A simple sufficient criterion for checking if a function *f* satisfies the properties required by the proposition is the following: if *f* is continuous, twice differentiable, *f*(**0**) = 0, ∂*f*/∂*x*_{i} ≥ 0, ∂^{2} *f*/∂*x*_{i}∂*x*_{j} ≤ 0 for all *i*, *j* = 1, *k*, then it is non-decreasing and sub-additive.

By using these two propositions we can construct a large set of arbitrage-free pricing functions. We list here a few:

COROLLARY 12. *All functions defined below are arbitrage-free, assuming* π_{1}, ..., π_{k} *are arbitrage free:*

The last entry defines bounded pricing functions, in particular they charge a finite price for the raw data.

**3.5. Deriving pricing functions from price points**

We discuss here a more flexible framework for defining an arbitrage-free pricing function: the market maker sets the price for a finite set of queries called *views*, then the system automatically extrapolates this to a pricing function on all queries.

DEFINITION 13. *A* price point *is a pair* (**V**, *p*), *where* **V** = (**q**, *v*) *is a query (called a view) and p* ε ℝ^{+} *is a fixed price*.

*Given a set of price points* **V** = ((**V**_{1}, *p*_{1}),..., (**V**_{m}, *p*_{m})), *we say that a pricing function π is* valid *w.r.t*. **V** *if it is arbitrage-free and agrees with all price points*, π(**V**_{i})=*p*_{i}, *for all* (**V**_{i}, *p*_{i}) ∈ **V**,

This represents an ideal pricing framework because the market maker only needs to set prices for a small number of queries and variances, and the system extrapolates it to a valid pricing function. However, the technical challenge is: how do we compute a valid pricing function from a set of price points?

In general, there may not exist a valid pricing function. For example, the market maker may choose two views such that the first determines the second, yet the price of the second is set higher than that of the first view. No arbitrage-free pricing function can agree with both price points. If any valid pricing function exists, then we say that the set of price points is *consistent*.

The key to designing a valid pricing function is to compute, for any given query, the cheapest plan to answer it from the views in the set of price points. A *procurement plan* is a plan for answering a query by purchasing views in the set of price points.

DEFINITION 14 (PROCUREMENT PLAN). *Consider an ordered set of price points*, **V** = {(**V**_{1}, *p*_{1})(**V**_{2}, *p*_{2}),...,(**V**_{m}, *p*_{m})}. *A* procurement plan *is an ordered multi-set of non-negative integers*, **B** = {*b*_{1}, *b*_{2}, ..., *b _{m}*},

For a simple example, suppose there is a single price point, **V** = {((**q**, 100), $5)}, which charges $5 for some query with variance 100. The buyer wants to purchase **Q** = (**q**, 25). Then, a procurement plan is {*b*_{1} = 4}. In other words, we must purchase the query 4 times, and compute the average, in order to reduce the variance to 25 (since the random noise added to different purchases is independent).

Procurement plans should not be confused with answerability (Definition 6). There the queries were already purchased. Now we have to decide which views to purchase and, in particular, we may purchase a view several times. We call the cost of the cheapest procurement plan the *arbitrage pricing function*.

DEFINITION 15. *Given a set of price-points ν, the* arbitrage pricing function

Note that, by this definition, if a query is not answerable from a set of price points, then the arbitrage price is ∞. The arbitrage function has a simple interpretation. Once the views are offered for sale, a buyer can purchase any query and pay only the arbitrage function, regardless of what the market maker wants to charge for that query. The arbitrage function is thus an upper bound on any valid pricing function. Could the market maker choose the arbitrage function as the official price for all queries? We answer this positively, by proving that the arbitrage function is indeed arbitrage-free.

THEOREM 16. *For any set of price points ν, the arbitrage function π_{ν} is arbitrage-free. ν is consistent, if the arbitrage price does not lower the price of any view: p_{i} ≤ π_{ν}(V_{i}) for all (V_{i}, p_{i}) ∈ ν*.

Thus, one choice for the market maker is to set all prices to be the arbitrage function. Unfortunately, it turns out that computing the arbitrage function is NP-hard.

THEOREM 17. *The following problem: "Given a set of price points ν, a query*

The proof if by reduction from the *Exact Cover by 3-Sets* problem. Hardness holds even if all views in **ν** have variance 0.

In this section we describe the second component of the pricing framework as shown in Figure 1: the privacy loss for each owner, denoted by ε. Our definition of privacy loss follows that of differential privacy. Given the database vector **x**, denote by **x**^{(j)} the database vector that results from adding one count to entry *j* and leaving all other values unchanged. An individual's privacy loss is measured by comparing the mechanism output on two data vectors that differ in *any* one entry.

DEFINITION 18. *Let K be any mechanism (meaning: for any database instance* **x**, *K*(**x**) *is a random variable). The* privacy loss *to each user, in notation* , *is defined as:*

*where* **x** *ranges over all possible databases and S ranges over measurable sets of* ℝ.

Recall that in Section 3 we defined the price π(**Q**) to depend only on the variance, and not on the mechanism used to answer the query. But the definition of privacy loss depends on the mechanism *K*, so now we need to choose a particular mechanism. We will use the classical Laplace mechanism; in that case, the privacy loss can be given as a function of the variance, and of a property that depends only on the query, called *query sensitivity*.^{7} The sensitivity of **q**, denoted *s*_{q}, is the largest possible change to the answer that can be caused by adding or removing one individual's contribution: .

Formally, the *Laplace Mechanism*, denoted ℒ, answers a query **Q** = (**q**, *v*) by returning ℒ_{Q}(**x**) = _{q}(_{x}) + *ρ*, where *ρ* is noise drawn from a Laplace distribution, centered at zero, with scale . The privacy loss of each individual is bounded by .

In this section we define formally when a pricing framework is *balanced* and we provide a general procedure for designing a balanced pricing framework. The concept of balance brings together the three components as shown in Figure 1: the query price π, the privacy loss ε, and the micro-payments μ_{i}. We begin with a description of micro-payments.

**5.1. Micro-payments to data owners**

By answering a buyer's query **Q**, using some mechanism *κ*_{Q}, the market maker leaks some of the private data of the data owners. He must compensate each data owner with some micro-payment. Recall that μ_{i}(**Q**) denotes the sum of all micro-payments due to the data owner whose attributes match *x _{i}*, the

**5.2. Balanced pricing frameworks: definition**

The contract between data owners and the market-maker consists of non-decreasing functions , *i* = 1, *n*, s.t *W*_{i}(0) = 0. Here the market maker promises to pay the data owners contributing to the *i*' bucket a micro-payment of at least μ_{i} ≥ *W*_{i}(*ε*)in the event of a privacy loss ε. We denote **W** = (*W*_{1}, ..., *W _{n}*) the set of contracts between the market maker and all data owners.

The connection between the micro-payments μ_{i}, the query price π and the privacy loss ε_{i} is captured by the following definition.

DEFINITION 19. *We say that the micro-payment functions μ _{i}*,

*Fix a query answering mechanism κ. We say that a micro-payment function μ _{i} is* compensating

The market maker will insist that the micro-payment functions are cost-recovering: otherwise, he will not be able to pay the data owners from the buyer's payment. In addition, a data owner will insist that the micro-payment function is compensating: this enforces the contract between her and the market maker, guaranteeing that she will be compensated at least *W _{i}*(ε), in the event of a privacy loss ε.

Fix a query answering mechanism *κ*. We denote a pricing framework (π, ε, μ, **W**), where π(**Q**), μ_{i}(**Q**) are the buyer's price and the micro-payments, ε = (ε_{1},..., ε_{n}) where ε_{i}(κ_{Q}) is the privacy loss corresponding to the mechanism *κ*, and *W _{i}*(ε) is the contract with the data owner

DEFINITION 20. *A pricing framework* (π, ε, μ, **W**) *is* balanced *if (1) π is arbitrage-free and (2) the micro-payment functions μ are arbitrage-free, cost-recovering for π, and compensating for* **W**.

We explain how the contract between the data owner and the market maker differs from that in privacy-preserving mechanisms. Let ε > 0 be a small constant. A mechanism *K* is called *differentially private*^{7} if, for any measurable set *S*, any database vector **x** and for any entry *j* of **x:**

In differential privacy, the basic contract between the mechanism and the data owner is the promise to every user that her privacy loss is no larger than ε. In our framework for pricing private data we turn this contract around. Now, privacy *is* lost, and Definition 18 quantifies this loss. The contract requires that the users are compensated according to their privacy loss. At an extreme, if the mechanism is ε-differentially private for a tiny ε, then each user will receive only a tiny micro-payment *W _{i}*(ε); as her privacy loss increases, she will be compensated more.

The micro-payments circumvent a common limitation of differentially-private mechanisms. In differential privacy, the data analyst typically has a fixed budget, ε, granted by the data curator, for all queries that he may ever ask. In order to issue *N* queries, he needs to divide the privacy budget among these queries, and, as a result, each query will be perturbed with greater noise. After issuing these *N* queries, he can no longer query the database, because otherwise the contract with the data owner would be breached.

In our pricing framework there is no such hard limitation because the buyer simply pays for each query. The budget is now a real financial quantity, and the buyer can ask as many query as he wants, with as high accuracy as he wants, as long as he has money to pay for them. As a consequence, it is the analyst-buyer, rather than the data owner, who ultimately determines the actual privacy loss. A similar model was proposed by Ghosh et al., in which the ε budget is determined by the budget of the data analyst.^{12}

**5.3. Balanced pricing frameworks: synthesis**

Call (ε, μ, **W**) *semi-balanced* if all micro-payment functions are arbitrage free and compensating w.r.t. *κ*; that is, we leave out the pricing function π and the cost-recovering requirement. The first step is to design a semi-balanced set of micro-payment functions.

PROPOSITION 21. *Let ℒ be the Laplace Mechanism, and let the contract functions be linear*, *W _{i}*(ε) =

The proposition defines the micro-payment associated to linear contracts *W _{i}*, which requires infinite payment for total loss of privacy. In practice, data owners are willing to sell their raw data at some high (but finite) price. We show next how to derive new semi-balanced micro-payments by applying sub-additive functions to other micro-payments.

PROPOSITION 22. *Fix k semi-balanced pricing frameworks*, (ε, μ, **W**^{j}), *j* = 1, ..., *k*. *Define* , *and* , *for each i* = **1**, ..., *n, where each function* , *i* = 1,...,*n*, *is non-decreasing, sub-additive, and satisfies f _{i}*(

Finally, we choose a payment function such as to ensure that the micro-payments are cost-recovering.

PROPOSITION 23. *Suppose that* (ε, μ, **W**) *is semi-balanced, and define* π (**Q**) = ∑_{i}μ_{i}(**Q**). *Then*, (π ε, μ, **W**) *is balanced*.

To summarize, the synthesis procedure for a pricing framework proceeds as follows. Start with the simple micro-payment functions given by Proposition 21, which ensure linear compensation for each user. Next, modify both the micro-payment and the contract functions using Proposition 22, as desired, in order to adjust to the preferences of individual users (e.g. in order to allow a user to set a price for her true data). Finally, define the query price to be the sum of all micro-payments (Proposition 23), then increase this price freely, by using any method in Section 3.4.

A number of issues deserve further attention if we are to achieve a fully-functional marketplace for private data.

First, we have made two restrictions which limit the prevention of arbitrage in our market. We require the market maker to answer queries using an unbiased estimator (Definition 5) and therefore arbitrage-freeness only prevents an adversary from deriving an unbiased estimator of a query using cheaper queries (Definition 6). In addition, we have considered a restricted notion of answerability, limiting our attention to adversaries who only consider queries computed by linear functions of other queries. Both of these assumptions limit the class of adversaries against which the arbitrage-free property is guaranteed to hold.

A more general approach would be to relax Definition 5 to allow queries to be answered using either biased or unbiased estimators and to include answerability using non-linear functions. This would provide the market maker with a stronger guarantee against arbitrage. However it would likely make reasoning about determinacy and arbitrage-free pricing significantly more difficult, and it would further restrict the set of arbitrage-free pricing functions available to the market maker. In other words, a more powerful notion of arbitrage would lead to more restrictive pricing functions, potentially limiting the ability of the market maker to set prices. The tradeoff between completeness of the framework and the feasibility of analyzing arbitrage is an important topic for future research.

A second issue is the problem of incentivizing the data owner to participate in the database and truthfully report her privacy valuations. Currently, there is nothing stopping the data owner from quoting an impossibly high price, for even a tiny loss of her privacy. In other words, she would choose a contract function *W*(ε) that is as close to ∞ as possible. Incentivizing users to report their true valuation is a goal of *mechanism design*. This has been studied for private data only in the restricted case of a single query, and has been shown to be a difficult task.^{10,12} In Ref. Li et al.^{20} we make a preliminary attempt to address this issue by requiring that users choose from a limited set of contract functions (e.g. one appropriate for a risk-tolerant user and one appropriate for a risk-averse user).

A third issue is the protection of privacy valuations themselves. When a user has sufficient freedom to choose their privacy valuation, it may be strongly correlated with the data *x _{i}* itself. In that case, even releasing the price of a query may lead to privacy loss, a factor not considered in the framework described above. Hiding the valuation itself is a difficult problem which is still being actively investigated in mechanism design.

We have introduced a framework for selling private data. Buyers can purchase any linear query, with any amount of perturbation, and need to pay accordingly. Data owners, in turn, are compensated according to the privacy loss they incur for each query. Buyers are allowed to ask an arbitrary number of queries, and we have designed techniques for ensuring that the prices are arbitrage-free, according to a specific definition of arbitrage-freeness, meaning that buyers are guaranteed to pay for any information they may further extract from the queries. Our pricing framework is balanced, in the sense that the buyer's price covers the micro-payments to the data owner and each micro-payment compensates the users according to their privacy loss.

An interesting open question is whether we can achieve both truthfulness (e.g. as discussed in Ghosh and Roth^{12}) and arbitrage-freeness (as discussed in the current paper) when pricing private data. Further, it remains to consider general notions of answerability that go beyond linear answerability, or to bound the impact non-linear estimation methods could have in the context of arbitrage.

1. Acquisti, A., Taylor, C.R., Wagman, L. The economics of privacy. *Journal of Economic Literature 52*, 2 (2016).

2. Balazinska, M., Howe, B., Suciu, D. Data markets in the cloud: An opportunity for the database community. *PVLDB 4*, 12 (2011), 1482–1485.

3. Calo, R. The boundaries of privacy harm. *Indiana Law Journal 86*, 3 (2011).

4. Chen, B.-C., Kifer, D., LeFevre, K., Machanavajjhala, A. Privacy-preserving data publishing. In *Foundations and Trends in Databases* (2010).

5. Dinur, I., Nissim, K. Revealing information while preserving privacy. In *PODS* (2003), 202–210.

6. Dwork, C. A firm foundation for private data analysis. *Commun. ACM 54*, 1 (2011), 86–95.

7. Dwork, C., McSherry, F., Nissim, K., Smith, A. Calibrating noise to sensitivity in private data analysis. In *TCC* (2006), 265–284.

8. Dwork, C., Roth, A. *The Algorithmic Foundations of Differential Privacy*. Foundations and Trends in Theoretical Computer Science (2014) Now Publishers Inc., Hanover, MA, USA.

9. Erlingsson, U., Pihur, V., Korolova, A. Rappor: Randomized aggregatable privacy-preserving ordinal response. In *Computer and Communications Security (CCS)* (2014), 1054–1067.

10. Fleischer, L., Lyu, Y.-H. Approximately optimal auctions for selling privacy when costs are correlated with data. In *ACM Conference on Electronic Commerce* (2012), 568–585.

11. Fung, B.C.M., Wang, K., Chen, R., Yu, P.S. Privacy-preserving data publishing: A survey of recent developments. *ACM Comput. Surv. 42*, 4 (2010).

12. Ghosh, A., Roth, A. Selling privacy at auction. In *ACM Conference on Electronic Commerce* (2011), 199–208.

13. Greenberg, A. Apple's "differential privacy" is about collecting your data—but not *your* data. *Wired* (Jun 13 2016).

14. Halevy, A.Y. Answering queries using views: A survey. *VLDB J. 10*, 4 (2001), 270–294.

15. Hardt, M., Ligett, K., McSherry, F. A simple and practical algorithm for differentially private data release. In *NIPS* (2012).

16. Kifer, D., Abowd, J., Gehrke, J. Vilhuber, L. Privacy: Theory meets practice on the map. In *ICDE* (2008).

17. Koutris, P., Upadhyaya, P., Balazinska, M., Howe, B., Suciu, D. Query-based data pricing. *J. ACM 62*, 5 (Nov. 2015), 43:1–43:44.

18. Laudon, K.C. Markets and privacy. *Commun. ACM 39*, 9 (1996), 92–104.

19. Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A. Optimizing linear counting queries under differential privacy. In *PODS* (2010).

20. Li, C., Li, D.Y., Miklau, G. Suciu, D. A theory of pricing private data. *ACM Trans. Database Syst. 39*, 4 (Dec. 2014), 1–28.

21. Li, C., Miklau, G. Pricing aggregate queries in a data marketplace. In *WebDB* (2012).

22. Nash, A., Segoufin, L., Vianu, V. Views and queries: Determinacy and rewriting. *TODS 35*, 3 (2010).

23. Nissim, K., Vadhan, S., Xiao, D. Redrawing the boundaries on purchasing data from privacy-sensitive individuals. In *Conference on Innovations in Theoretical Computer Science* (2014), 411–422.

a. It is possible that π (**q**, 0) = ∞.

The original version of this paper was published in *ACM Transactions on Database Systems 39*, 4 (Dec. 2014), 1–28.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

**©2017 ACM 0001-0782/17/12**

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2017 ACM, Inc.

No entries found