Research and Advances
Computing Applications Digital government

Table Servers Protect Confidentiality in Tabular Data Releases

Federal statistical agencies must balance concern over confidentiality of data with their obligation to report information to the public. Advances in IT threaten privacy, but new technologies can also protect confidentiality while meeting user needs in innovative ways.
Posted
  1. Introduction
  2. System Design and Prototypes
  3. References
  4. Authors
  5. Figures

Here we describe table servers being developed by the National Institute of Statistical Sciences (NISS) that disseminate tabular summaries of statistical data in response to user queries for marginal sub-tables of a large (for example, 40 dimensions with 4 categories each) contingency table containing counts or sums. Table servers evaluate disclosure risk dynamically, in light of previously answered queries.

The query space Q, which contains all 2K sub-tables of a K-way table, is partially ordered by set inclusion of attributes in sub-tables. The set R(t) of all tables released through some time t contains direct releases in response to queries and indirect releases (previously unreleased children of direct releases); R(t) is specified by the released frontier RF(t) of its maximal elements.

Underlying dynamic release decisions is a risk criterion RC defined on subsets of Q: at all times the system must satisfy RC (R(t))a, where a is a risk threshold set by the operators. A typical risk criterion is accuracy of bounds based on R(t) for sensitive (small count) cells in the full table. Bounds can be computed using integer programming methods, but these techniques do not scale to the table sizes under consideration here. There are also exact techniques for special cases. For example, if the released sub-tables constitute the minimal sufficient statistics of a decomposable graphical model [4], then bounds can be expressed as explicit functions of these sub-tables [2].

Whenever an answered query releases previously unreleased information, other queries become unanswerable. Consequently, at t there is an unreleasable set U(t) of sub-tables whose release would be too risky, with an unreleasable frontier UF (t) of its minimal elements.

Release rules determine which requests for unreleased tables will be fulfilled. The simplest is the myopic rule of releasing T at t as long as RC(R(t) cup.gif T) ≤ a. To prevent the table server from taking excessively large steps, one can allow only tables adding but one attribute to a previously released table to be eligible for release. To prevent a single user (or a set of colluding users) from driving the table server into a region of Q that suits their needs but not those of other users, release rules can be biased against releases that add large numbers of tables to U(t). Rules can also incorporate the value of releasing T [3, 5].

Back to Top

System Design and Prototypes

A prototype table server, written as a Java application, is shown in the accompanying figure. Its principal strength is the engaging (but non-scalable) visualization of Q, which illustrates the underlying abstractions such as releasable and unreleasable frontiers.

A more powerful table server has been written using the Java 2 Enterprise Edition platform, with query processing performed by Java Servlets and risk computations by native executables. This prototype has been tested on a 14-dimensional, 300,000,000-cell, but extremely sparse, table derived from the Current Population Survey.


Table servers disseminate tabular summaries of statistical data in response to user queries for marginal sub-tables of a large contingency table containing counts or sums. Table servers evaluate disclosure risk dynamically, in light of previously answered queries.


If the requested table lies on or below RF (t), it is provided immediately, ordinarily via downloaded XML (HTML tables and text are alternatives). Releases are governed by the myopic and at most one step away from R(t) rules, and disclosure risk is evaluated in real time. The query history database, with tables for users, queries and the time trajectories of RF (t) and UF (t), is maintained in a MySQL database server. A frontier display facility monitors evolution of RF (t).

The system employs data structures based on hash tables for storing tables and algorithms that exploit sparsity and the fact that R(t) and U(t) are characterized completely by RF (t) and UF (t). The risk criterion is narrowness of cell bounds computed via a generalized shuttle algorithm [1].

Back to Top

Back to Top

Back to Top

Figures

UF1 Figure. Java table server prototype. The visualization of the query space Q shows direct releases (yellow), indirect releases (blue), unacceptably risky (red), and the potential effect (dark blue, magenta, and dark red) of releasing the 5-way table indicated by the cursor. The released (unreleasable) frontier lies at the top of the lower-left (bottom of the upper-right) portion of the visualization.

Back to top

    1. Buzzigoli, L. and Giusti, A. An algorithm to calculate the lower and upper bounds of the elements of an array given its marginals. In Proceedings of the Conference on Statistical Data Protection. (Luxembourg, Germany, 1999) Eurostat, 131–147.

    2. Dobra, A. and Fienberg, S.E. Bounds for cell entries in contingency tables given marginal totals and decomposable graphs. In Proceedings of the National Academy of Science 97, 22 (2000) 11885–11892.

    3. Duncan, G.T., Keller-McNulty, S.A., and Stokes, S.L. Disclosure risk vs. data utility: The R-U confidentiality map. Management Science.

    4. Lauritzen, S.L. Graphical Models. Clarendon Press, Oxford, UK, 1996.

    5. Trottini, M. A decision-theoretic approach to data disclosure problems. In 2nd Joint ECE/Eurostat Work Session on Statistical Data Confidentiality. (Luxembourg, Germany, 2001). Eurostat.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More