Selling personal information is very different from selling physical goods, and raises novel challenges. On the sell-side of the market, individuals own their own personal data and experience costs based on the usage of their data insofar as that usage leads to future quantifiable harm. On the buy-side of the market, buyers are interested in "statistical information" about the dataset, that is, aggregate information, rather than information derived from a single individual. Differential privacy1 provides a means to quantify the harm that can come to individual data owners as the result of the use of their data. This ability to quantify harm allows for data owners to be compensated for the risk they incur. Past work studying markets for private data focused on the simple case in which the buyer is interested in only the answer to a single linear function of the data,2,3,4,6 which makes the buy-side of the market particularly simple.
The following paper introduces a fascinating and complicated issue that arises on the buy-side of the market when buyers are interested in multiple linear functions of the same dataset. Information exhibits complementarities: given some information about a dataset, it is possible to learn other things about the dataset. This means that when pricing information, there might be opportunities for arbitrage: rather than directly buying the answer to the query he is interested in, the buyer might instead more cheaply buy a bundle of queries that lets him deduce the answer he is interested in. The authors give conditions under which a pricing is arbitrage free. This is a compelling condition to ask for: it means that it is a dominant strategy for arriving buyers to faithfully request the answer to the query they are interested in, rather than trying to game the system. By asking for arbitrage-free pricings, the authors are making the market safe for buyers.
Reasoning about these arbitrage opportunities can be complicated: if the values of the purchased linear functions were revealed exactly, then the answer to any other query in the span of the purchased queries would be derivable. But to guarantee the sellers differential privacy, it is necessary to sell only noisy estimates of the data. This makes reasoning about what is derivable complex. Sensibly, since they are introducing a new problem, the authors opt to study a restricted notion of derivability and arbitrage. They give pricings that rule out arbitrage opportunities when the buyer is only allowed to learn by taking linear combinations of observed queries, is interested only in unbiased estimates of query values, and will attempt arbitrage only at the level of one query at a time. Because of the richness of the authors’ problem, one of the most exciting aspects of this work is the doors it opens for future exploration. Here, I will highlight what I think are the most interesting problems coming out of this paper:
- Multi-query arbitrage: The paper gives query pricings such that a buyer can never more cheaply derive the answer to a single query by buying a different bundle of queries. However, the buyer can still sometimes more cheaply derive the answer to one bundle of queries by buying the answers to another bundle! Which pricings can prevent this?
- Arbitrage for biased estimators: When buyers are only interested in unbiased estimates, the best linear unbiased estimator is always given by least-squares linear regression. However, buyers can often improve the accuracy of their derivations by trading off a small amount of bias for a large reduction in variance. The space of optimal estimators is much more complex when they do so. In general, a buyer can access the dataset by asking arbitrary "statistical queries," and can run any learning algorithm in the "SQ model" to derive the answers to other queries.5 Can the large literature on SQ learning be used to give arbitrage-free pricings for more general notions of arbitrage?
- Seller profit: Arbitrage-free pricings give a family of dominant-strategy truthful mechanisms for selling information. Suppose we know something about the distribution of buyer demands—can we find the arbitrage-free pricing that maximizes seller profit? This is particularly intriguing, because it seems the seller can sometimes increase her profit by selling noisier queries, thereby reducing the complementarities among the goods she is selling. The opportunity to increase profit by degrading product quality rarely arises in markets for physical goods.
This paper opens a rich research direction. I recommend that new Ph.D. students (or anyone looking for an attractive problem) read it.