From the demographic statistics produced by national census bodies to the complex predictive models built by companies in “Big Tech” and finance, data is the fuel that powers these applications. Most such use cases rely on data derived from the properties and actions of individual people. This data is therefore considered sensitive and in need of protections to prevent inappropriate use or disclosure. Some protections come from enforcing policies, access control, and contractual agreements. But we also seek technical interventions—definitions and algorithms that can be applied by computer systems to protect private information while still enabling the intended use.
Although there is not universal consensus, the model of differential privacy (DP) has emerged as the prevailing notion of privacy, with many deployments in industry and government, and thousands of research papers studying different aspects of the definition. At its heart, DP places a requirement on algorithms to introduce uncertainty to their output via randomization, so that the uncertainty is sufficient to provide reasonable doubt over whether the data of any particular person was part of the algorithm’s input. Contributing to its success is the fact that differential privacy can be achieved by following some simple recipes. A starting point is to compute the true answer to a numerical query and then add noise from a suitable random distribution to obtain a “privatized” output that can be shared. These recipes can then be augmented with additional ingredients, such as sampling, aggregation, and optimization, along with combining and post-processing the results of multiple queries to provide private algorithms for a range of questions. Consequently, DP techniques offering precise privacy guarantees have been presented for tasks ranging from spectral graph analysis to training neural networks.
To achieve widespread adoption, the privacy research community needs to move beyond developing bespoke algorithms for each particular question at hand. Instead, we seek to build tools and systems that can handle a broad class of queries specified in a high-level language, and that automatically introduce the necessary randomness into the output to provide a DP guarantee. That is, we can aspire to a DP-DBMS: a differentially private data management system. The first attempts in this direction were based on the simple “evaluate and add noise” recipe above, but they encountered difficulties when the magnitude of the noise was found to be so large that it overwhelmed the true answer.
“R2T: Instance-optimal Truncation for Differentially Private Query Evaluation with Foreign Keys” by Dong et al. tackles an important technical question when trying to apply DP in this systematic fashion: how to tame the amount of privacy noise to answer database queries. It starts from a simple observation: The scale of noise required is directly proportional to the amount that a query answer can change when one individual’s data is changed. This grows large in databases, where a single record pertaining to one person can link in turn to many other records spread throughout the data tables. But the impact can be kept in check by “truncating” the query output—clipping the contribution of a user to a limited range so the user’s influence is de facto bounded. This sets up a trade-off: Truncating user contributions introduces bias to the query results, but relaxing the bound leads to higher variance from the privacy noise addition.
The paper formalizes truncation as an optimization problem and presents the R2T algorithm as an efficient, iterative approach to solving it. It provides multiple other important contributions. First, it identifies the class of select-project-join-aggregate (SPJA) queries as the Goldilocks template—powerful enough to be highly expressive for database queries while not too complicated to prevent a generic solution. It handles the case of self-joins, for which simple approaches that assume independence will fail. The optimization requires only the solution of a linear program, which can be computed readily. The findings are validated by the proof-of-concept prototype system, which is shown to outperform existing state-of-the-art DP systems in evaluations over a number of benchmarks.
The next steps for this line of work are to consider other families of queries for other canonical types of data. We can seek to build analogous systems for other structured and unstructured forms of data—movement patterns (trajectories), medical readings, and written text, for instance. The long-term goal may be to build systems to handle increasingly higher-level queries over data, which can guarantee meaningful privacy protection while still demonstrating acceptable utility for the outputs. Automatically ensuring privacy for arbitrary computations may be a long way off, but this work is an important step toward that end.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment