Data and Information

R2T: Instance-Optimal Truncation for Differentially Private Query Evaluation with Foreign Keys

In this paper, we propose the first differential privacy mechanism for answering arbitrary SPJA queries in a database with foreign-key constraints.

By Wei Dong, Juanru Fang, Ke Yi, Yuchao Tao, and Ashwin Machanavajjhala

Posted Feb 20 2025

Introduction
Preliminaries
Instance Optimality of DP Mechanisms with FK Constraints
R2T: Instance-Optimal Truncation
Truncation for SJA Queries
Truncation for SPJA Queries
System Implementation
Experiments
More Discussions
Acknowledgement
References
Footnotes

Read the related article

Abstract

Answering SPJA queries under differential privacy (DP), including graph-pattern counting under node-DP as an important special case, has received considerable attention in recent years. The dual challenge of foreign-key constraints and self-joins is particularly tricky to deal with, and no existing DP mechanisms can correctly handle both. For the special case of graph pattern counting under node-DP, the existing mechanisms are correct (that is, satisfy DP), but they do not offer nontrivial utility guarantees or are very complicated and costly. In this paper, we propose the first DP mechanism for answering arbitrary SPJA queries in a database with foreign-key constraints. Meanwhile, it achieves a fairly strong notion of optimality, which can be considered as a small and natural relaxation of instance optimality. Finally, our mechanism is simple enough that it can be easily implemented on top of any RDBMS and an LP solver. Experimental results show that it offers order-of-magnitude improvements in terms of utility over existing techniques, even those specifically designed for graph pattern counting.

1. Introduction

Differential privacy (DP) has become the standard notion for private data release, due to its strong protection of individual information. Informally speaking, DP requires indistinguishability of the query results whether any particular individual’s data is in the database or not. The standard Laplace mechanism first finds $G S_{Q}$ , the global sensitivity, of the query, that is, how much the query result may change if an individual’s data is added/removed from the database. Then it adds a Laplace noise calibrated accordingly to the query result to mask this difference. However, this mechanism runs into issues in a relational database, as illustrated in the following example.

Example 1.1.

Consider a simple join-counting query $Q : = | R_{1} (\underline{x_{1}}, \dots) ⋈ R_{2} (x_{1}, x_{2}, \dots) | .$

Here, the underlined attribute $\underline{x_{1}}$ is the primary key (PK), while $R_{2} . x_{1}$ is a foreign key (FK) referencing $R_{1} . x_{1}$ . For instance, $R_{1}$ may store customer information where $x_{1}$ is the customer ID and $R_{2}$ stores the orders the customers have placed. Then this query simply returns the total number of orders; more meaningful queries could be formed with some predicates, for example, all customers from a certain region and/or orders in a certain category. Furthermore, suppose the customers, namely, the tuples in $R_{1}$ , are the entities whose privacy we aim to protect.

What’s the $G S_{Q}$ for this query? It is, unfortunately, $\infty$ . This is because a customer, theoretically, could have an unbounded number of orders, and adding such a customer to the database can cause an unbounded change in the query result. A simple fix is to assume a finite $G S_{Q}$ , which can be justified in practice because we may never have a customer with, say, more than a million orders. However, as assuming such a $G S_{Q}$ limits the allowable database instances, one tends to be conservative and sets a large $G S_{Q}$ . This allows the Laplace mechanism to work, but adding noise of this scale clearly eliminates any utility of the released query answer.

1.1 The truncation mechanism.

The issue above was first identified by Kotsogiannis et al.,²¹ who also formalized the DP policy for relational databases with foreign key (FK) constraints. The essence of their model (a rigorous definition is given in Section 2) is that individuals and their private data are stored in separate relations linked by FKs. This is perhaps the most crucial feature of the relational model, yet it causes a major difficulty in designing DP mechanisms as illustrated above. Their solution is the truncation mechanism, which simply deletes all customers with more than $τ$ orders before applying the Laplace mechanism, for some threshold $τ$ . After truncation, the query has sensitivity $τ$ , so adding a noise of scale $τ$ is sufficient.

Truncation is a special case of Lipshitz extensions and has been studied extensively for graph pattern-counting queries²⁰ and machine learning (ML).¹ A well-known issue for the truncation mechanism is the bias-variance trade-off: In one extreme $τ = G S_{Q}$ ; it degenerates into the naive Laplace mechanism with a large noise (that is, large variance). In the other extreme $τ = 0$ , the truncation introduces a bias as large as the query answer. The issue of how to choose a near-optimal $τ$ has been extensively studied in the statistics and ML communities.²^,¹⁷ In fact, the particular query in Example 1.1 is equivalent to the 1-dimensional mean (sum) estimation problem, which is important for many ML tasks. A key challenge there is that the selection of $τ$ must also be done in a DP manner.

1.2 The issue with self-joins.

While self-join-free queries are equivalent to mean (sum) estimation (see Section 3 for a more formal statement), self-joins introduce another challenge unique to relational queries. In particular, all techniques from the statistics and machine-learning literature for choosing a $τ$ critically rely on the fact that the individuals are independent, that is, adding/removing one individual does not affect the data associated with another, which is not true when the query involves self-joins. In fact, when there are self-joins, even the truncation mechanism itself fails, as illustrated in the example below.

Example 1.2.

Suppose we extend the query from Example 1.1 to the following one with a self-join: $Q : = | R_{1} (\underline{x_{1}}, \dots,) ⋈ R_{1} (\underline{y_{1}}, \dots) ⋈ R_{2} (x_{1}, y_{1}, x_{2}, \dots) | .$

Note that the PK of $R_{1}$ has been renamed differently in the two logical copies $R_{1}$ , so that they join different attributes of $R_{2}$ . For instance, $R_{2}$ may store the transactions between pairs of customers, and this query would count the total number of transactions. Again, predicates can be added to make the query more meaningful.

Let $G$ be an undirected $τ$ -regular graph (that is, every vertex has degree $τ$ ) with $n$ vertices. We will construct an instance $I = (I_{1}, I_{2})$ on which the truncation mechanism fails. Let $I_{1}$ be the vertices of $G$ and let $I_{2}$ be the edges (each edge will appear twice as $G$ is undirected). Thus, $Q$ simply returns the number of edges in the graph times 2. Let $I^{'}$ be the neighboring instance corresponding to $G^{'}$ , to which we add a vertex $v$ that connects to every existing vertex. Note that in $G^{'}$ , $v$ has degree $n$ while every other vertex has degree $τ + 1$ . Now truncating by $τ$ fails DP: The query answer on $I$ is $n τ$ , and that on $I^{'}$ is 0 (all vertices are truncated). Adding noise of scale $τ$ cannot mask this gap, violating the DP definition.

The reason why the truncation mechanism fails is that the underlined claim above does not hold in the presence of self-joins. More fundamentally, this is due to the correlation among the individuals introduced by self-joins. In the example above, we see that the addition of one node may cause the degrees of many others to increase. For the problem of graph pattern counting under node-DP, which can be formulated as a multi-way self-join counting query on the special schema $R = {Node (\underline{ID})$ , $Edge (src, dst)}$ , Kasiviswanathan et al.²⁰ propose an LP-based truncation mechanism (to differentiate, we will call the truncation mechanism above naive truncation) to fix the issue, but they do not study how to choose $τ$ . As a result, while their mechanism satisfies DP, there is no optimality guarantee in terms of utility. In fact, if $τ$ is chosen inappropriately, their error can be even larger than $G S_{Q}$ , namely, worse than the naive Laplace mechanism.

1.3 Our contributions.

In this paper, we start by studying how to choose a near-optimal $τ$ in a DP manner in the presence of self-joins. As with all prior $τ$ -selection mechanisms over mean (sum) estimation²^,¹⁷ and self-join-free queries,²⁴ we assume that the global sensitivity of the given query $Q$ is bounded by $G S_{Q}$ . Since one tends to set a large $G S_{Q}$ as argued in Example 1.1, we must try to minimize the dependency on $G S_{Q}$ .

The first contribution of this paper (Section 4) is a simple and general DP mechanism, called Race-to-the-Top (R2T), which can be used to adaptively choose $τ$ in combination with any valid DP truncation mechanism that satisfies certain properties. In fact, it does not choose $τ$ per se; instead, it directly returns a privatized query answer with error at most $O (log (G S_{Q}) log log (G S_{Q}) \cdot D S_{Q} (I))$ for any instance $I$ with constant probability. While we defer the formal definition of $D S_{Q} (I)$ to Section 3, what we can show is that it is an per-instance lower bound, that is, any valid DP mechanism has to incur error $Ω (D S_{Q} (I))$ on $I$ (in a certain sense). Thus, the error of R2T is instance-optimal up to logarithmic factors in $G S_{Q}$ . Furthermore, a logarithmic dependency on $G S_{Q}$ is also unavoidable,¹⁹ even for the mean estimation problem, that is, the simple self-join-free query in Example 1.1. In practice, these log factors are usually between 10 to 100, and our experiments show that R2T has better utility than previous methods in most cases.

However, as we see in Example 1.2, naive truncation is not a valid DP mechanism in the presence of self-joins. As our second contribution (Section 5), we extend the LP-based mechanism of Kasiviswanathan et al,²⁰ which only works for graph pattern-counting queries, to general queries on an arbitrary relational schema that uses the four basic relational operators: Selection (with arbitrary predicates), Projection, Join (including self-join), and sum Aggregation. When plugged into R2T, this yields the first DP mechanism for answering arbitrary SPJA queries in a database with FK constraints. For SJA queries, the utility is instance-optimal, while the optimality guarantee for SPJA queries is slightly weaker, but we argue that this is unavoidable.

Furthermore, the simplicity of our mechanism allows it to be built on top of any RDMBS and an LP solver. To demonstrate its practicality, we built a system prototype (Section 7) using PostgreSQL and CPLEX. Experimental results (Section 8) show it can provide order-of-magnitude improvements in terms of utility over the state-of-the-art DP-SQL engines. We obtain similar improvements even over node-DP mechanisms specifically designed for graph pattern-counting problems, which are just special SJA queries.

2. Preliminaries

2.1 Database queries.

Let $R$ be a database schema. We start with a multi-way join:

J : = R_{1} (x_{1}) ⋈ \dots ⋈ R_{n} (x_{n}),

(1)

where $R_{1}, \dots, R_{n}$ are relation names in $R$ and each $x_{i}$ is a set of $a r i t y (R_{i})$ variables. When considering self-joins, there can be repeats, i.e., $R_{i} = R_{j}$ ; in this case, we must have $x_{i} \neq x_{j}$ , or one of the two atoms will be redundant. Let $v a r (J) : = x_{1} \cup \dots \cup x_{n}$ .

Let $I$ be a database instance over $R$ . For any $R \in R$ , denote the corresponding relation instance in $I$ as $I (R)$ . This is a physical relation instance of $R$ . We use $I (R, x)$ to denote $I (R)$ after renaming its attributes to $x$ , which is also called a logical relation instance of $R$ . When there are self-joins, one physical relation instance may have multiple logical relation instances; they have the same rows but with different column (variable) names.

A JA or an SJA query $Q$ aggregates over the join results $J (I)$ . More abstractly, let $ψ : dom (v a r (J)) \to N$ be a function that assigns non-negative integer weights to the join results, where $dom (v a r (J))$ denotes the domain of $v a r (J)$ . The result of evaluating $Q$ on $I$ is

Q (I) : = \sum_{q \in J (I)} ψ (q) .

(2)

Note that the function $ψ$ only depends on the query. For a counting query, $ψ (\cdot) \equiv 1$ ; for an aggregation query, for example, $SUM (A * B)$ , $ψ (q)$ is the value of $A * B$ for $q$ . And an SJA query with an arbitrary predicate over $v a r (J)$ can be easily incorporated into this formulation: If some $q \in J (I)$ does not satisfy the predicate, we simply set $ψ (q) = 0$ .

Example 2.1.

Graph pattern-counting queries can be formulated as SJA queries. Suppose we store a graph in a relational database by the schema $R = {Edge (src, dst)$ , $Node (\underline{ID})}$ where $src$ and $dst$ are FKs referencing $ID$ , then the number of length-3 paths can be counted by first computing the join $Edge (A, B) ⋈ Edge (B, C) ⋈ Edge (C, D),$

followed by a count aggregation. Note that this also counts triangles and non-simple paths (for example, $x$ – $y$ – $x$ – $z$ ), which may or may not be considered as length-3 paths depending on the application. If not, they can be excluded by introducing a predicate (that is, redefining $ψ$ ) $A \neq C \land A \neq D \land B \neq D$ . If the graph is undirected, then the query counts every path twice, so we should divide the answer by 2. Alternatively, we may introduce the predicate $A < D$ to eliminate the double counting.

2.2 DP in relational databases with FK constraints.

We adopt the DP policy in Kotsogiannis. et al,²¹ which defines neighboring instances by taking FK constraints into consideration. We model all the FK relationships as a directed acyclic graph (DAG) over $R$ by adding a directed edge from $R$ to $R^{'}$ if $R$ has an FK referencing the PK of $R^{'}$ . There is a.^a designated primary private relation $R_{P}$ , and any relation that has a direct or indirect FK referencing $R_{P}$ is called a secondary private relation. The referencing relationship over the tuples is defined recursively as follows: (1) any tuple $t_{P} \in I (R_{P})$ said to reference itself; (2) for $t_{P} \in I (R_{P})$ , $t \in I (R), t^{'} \in I (R^{'})$ , if $t^{'}$ references $t_{P}$ , $R$ has an FK referencing the PK of $R^{'}$ , and the FK of $t$ equals the PK of $t^{'}$ , then we say that $t$ references $t_{P}$ . Then two instances $I$ and $I^{'}$ are considered neighbors if $I^{'}$ can be obtained from $I$ by deleting a set of tuples, all of which reference the same tuple $t_{P} \in I (R_{P})$ , or vice versa. In particular, $t_{P}$ may also be deleted, in which case all tuples referencing $t_{P}$ must be deleted in order to preserve the FK constraints. Finally, for a join result $q \in J (I)$ , we say that $q$ references $t_{P} \in I (R_{P})$ if $| t_{P} ⋈ q | = 1$ .

We use the notation $I \sim I^{'}$ to denote two neighboring instances and $I \sim_{t_{P}} I^{'}$ denotes that all tuples in the difference between $I$ and $I^{'}$ reference the tuple $t_{P} \in R_{P}$ .

Example 2.2.

Consider the TPC-H schema:

$R = {Nation (\underline{NK}), Customer (\underline{CK}, NK), Order (\underline{OK}, CK), Lineitem (OK)} .$

If the customers are the individuals whose privacy we wish to protect, then we designate $Customer$ as the primary private relation, which implies that $Order$ and $Lineitem$ will be secondary private relations, while $Nation$ will be public. Note that once $Customer$ is designated as a primary private relation, the information in $Order$ and $Lineitem$ is also protected since the privacy induced by $Customer$ is stronger than that induced by $Order$ and $Lineitem$ . Alternatively, one may designate $Order$ as the primary private relation, which implies that $Lineitem$ will be a secondary private relation, while $Customer$ and $Nation$ will be public. This would result in weaker privacy protection but offer higher utility.

Some queries, as given, may be incomplete, that is, it has a variable that is an FK but its referenced PK does not appear in the query $Q$ . The query in Example 2.1 is such an example. Following Kotsogiannis. et al,²¹ we always make the query complete by iteratively adding those relations whose PKs are referenced to $Q$ . The PKs will be given variable names matching the FKs. For example, for the query in Example 2.1, we add $Node (A)$ , $Node (B)$ , $Node (C)$ , and $Node (D)$ .

The DP policy above incorporates both edge-DP and node-DP, two commonly used DP policies for private graph analysis, as special cases. In Example 2.1, by designating $Edge$ as the private relation ( $Node$ is thus public, and we may even assume it contains all possible vertex IDs), we obtain edge-DP; for node-DP, we add FK constraints from $src$ and $dst$ to $ID$ , and designate $Node$ as the primary private relation, while $Edge$ becomes a secondary private relation.

A mechanism $M$ is $ε$ -DP if for any neighboring instance $I, I^{'}$ , and any output $y$ , we have

Pr [M (I) = y] \leq e^{ε} Pr [M (I^{'}) = y] .

Typical values of $ε$ used in practice range from $0.1$ to 10, where a smaller value corresponds to stronger privacy protection.

3. Instance Optimality of DP Mechanisms with FK Constraints

Global sensitivity and worst-case optimality. The standard DP mechanism is the Laplace mechanism,¹⁵ which adds $L a p (G S_{Q})$ to the query answer. Here, $L a p (b)$ denotes a random variable drawn from the Laplace distribution with scale $b$ and $G S_{Q} = {max}_{I \sim I^{'}} | Q (I) - Q (I^{'}) |$ is the global sensitivity of $Q$ . However, either a join or a sum aggregation makes $G S_{Q}$ unbounded. The issue with the former is illustrated in Example 1.1, where a customer may have unbounded orders; a sum aggregation with an unbounded $ψ$ results in the same situation. Thus, as with prior work,²^,¹⁷^,²⁴ we restrict to a set of instances $I$ such that

max_{I \in I, I^{'} \in I, I \sim I^{'}} | Q (I) - Q (I^{'}) | = G S_{Q},

(3)

where $G S_{Q}$ is a parameter given in advance. For the query in Example 1.1, this is equivalent to assuming that a customer is allowed to have at most $G S_{Q}$ orders in any instance.

For general queries, the situation is more complicated. We first consider SJA queries. Given an instance $I$ and an SJA query $Q$ , for a tuple $t_{P} \in I (R_{P})$ , its sensitivity is

S_{Q} (I, t_{P}) : = \sum_{q \in J (I)} ψ (q) I (q references t_{P}),

(4)

where $I (\cdot)$ is the indicator function. For SJA queries, (1) is equivalent to

max_{I \in I} max_{t_{P} \in I (R_{P})} S_{Q} (I, t_{P}) = G S_{Q} .

For self-join-free SJA queries, it is clear that

Q (I) = \sum_{t_{P} \in R_{P}} S_{Q} (I, t_{P}),

which turns the problem into a sum estimation problem. However, when self-joins are present, this equality no longer holds since one join result $q$ references multiple $t_{P}$ ’s. This also implies that removing one tuple from $I (R_{P})$ may affect multiple $S_{Q} (I, t_{P})$ ’s, making the neighboring relationship more complicated than in the sum estimation problem, where two neighboring instances differ by only one datum.²^,¹⁷

What notion of optimality shall we use for DP mechanisms over SJA queries? The traditional worst-case optimality is meaningless, since the naive Laplace mechanism that adds noise of scale $G S_{Q}$ is already worst-case optimal, just by the definition of $G S_{Q}$ . In fact, the basis of the entire line of work on the truncation mechanism and smooth sensitivity is the observation that typical instances should be much easier than the worst case, so these mechanisms all add instance-specific noises, which are often much smaller than the worst-case noise level $G S_{Q}$ .

Instance optimality. The standard notion of optimality for measuring the performance of an algorithm on a per-instance basis is instance optimality. More precisely, let $M$ be the class of DP mechanisms and let^b

L_{ins} (I) : = min_{M^{'} \in M} min {ξ : Pr [| M^{'} (I) - Q (I) | \leq ξ] \geq 2 / 3}

be the lower bound any $M^{'} \in M$ can achieve (with probability $2 / 3$ ) on $I$ , then the standard definition of instance optimality requires us to design an $M$ such that

Pr [| M (I) - Q (I) | \leq c \cdot L_{ins} (I)] \geq 2 / 3

(5)

for every $I$ , where $c$ is called the optimality ratio. Unfortunately, for any $I$ , one can design a trivial $M^{'} (\cdot) \equiv Q (I)$ that has 0 error on $I$ (but fails miserably on other instances), so $L_{ins} (\cdot) \equiv 0$ , which rules out instance-optimal DP mechanisms by a standard argument.¹⁵

To avoid such a trivial $M^{'}$ ,³^,¹² consider a relaxed version of instance optimality where we compare $M$ against any $M^{'}$ that is required to work well not just on $I$ , but also on its neighbors, that is, we raise the target error from $L_{ins} (I)$ to

\begin{matrix} L_{nbr} (I) : = & min_{M^{'} \in M} max_{I^{'} : I \sim I^{'}} min {ξ : \\ Pr [| M^{'} (I^{'}) - Q (I^{'}) | \leq ξ] \geq 2 / 3} . \end{matrix}

Vadhan²⁵ observes that $L_{nbr} (I) \geq L S_{Q} (I) / 2$ , where

L S_{Q} (I) : = max_{I^{'} \in I, I^{'} \sim I} | Q (I) - Q (I^{'}) |

is the local sensitivity of $Q$ at $I$ . This instance optimality has been used for certain ML problems³ and conjunctive queries without FKs.¹² However, it has an issue for SJA queries in a database with FK constraints: For any $I$ , we can add a $t_{P}$ to $I (R_{P})$ together with tuples in the secondary private relations all referencing $t_{P}$ , obtaining an $I^{'}$ such that $S_{Q} (I^{'}, t_{P}) = G S_{Q}$ , that is, $L S_{Q} (\cdot) \equiv G S_{Q}$ . This means that this relaxed instance optimality degenerates into worst-case optimality. This is also why smooth sensitivity, including all its efficiently computable versions,¹¹^,¹²^,¹⁸^,²³ will not have better utility than the naive Laplace mechanism on databases with FK constraints, since they are all no lower than the local sensitivity.

The reason why the above relaxation is “too much” is that we require $M^{'}$ to work well on any neighbor $I^{'}$ of $I$ . Under the neighborhood definition with FK constraints, this means that $I^{'}$ can be any instance obtained from $I$ by adding a tuple $t_{P}$ and arbitrary tuples referencing $t_{P}$ in the secondary private relations. This is too high a requirement for $M^{'}$ , hence too low an optimality notion for $M$ .

To address the issue, Huang et al.¹⁷ restricts the neighborhood in which $M^{'}$ is required to work well, but their definition only works for the mean estimation problem. For SJA queries under FK constraints, we revise $L_{nbr} (\cdot)$ to

\begin{matrix} L_{d-nbr} (I) : = & min_{M^{'} \in M} max_{I^{'} : I \sim I^{'}, I^{'} \subseteq I} min {ξ : \\ Pr [| M^{'} (I^{'}) - Q (I^{'}) | \leq ξ] \geq 2 / 3}, \end{matrix}

namely, we require $M^{'}$ to work well only on $I^{'}$ and its down-neighbors, which can be obtained only by removing a tuple $t_{P}$ already in $I (R_{P})$ and all tuples referencing $t_{P}$ . Correspondingly, an instance-optimal $M$ (with regard to the down-neighborhood) is one such that (1) holds where $L_{ins}$ is replaced by $L_{d-nbr}$ .

Clearly, the smaller the neighborhood, the stronger the optimality notion. Our instance optimality notion is thus stronger than those in.³^,¹²^,¹⁷ Note that for such an instance-optimal $M$ (by our definition), there still exist $I, M^{'}$ such that $M^{'}$ does better on $I$ than $M$ , but if this happens, $M^{'}$ must do worse on one of the down-neighbors of $I$ , which is as typical as $I$ itself.

Using the same argument from Vadhan,²⁵ we have $L_{d-nbr} (I) \geq D S_{Q} (I) / 2$ , where

D S_{Q} (I) : = max_{I^{'}, I \sim I, I^{'} \subseteq I} | Q (I) - Q (I^{'}) | = max_{t_{P} \in I (R_{P})} S_{Q} (I, t_{P})

(6)

is the downward local sensitivity of $I$ . Thus, $D S_{Q} (I)$ is a per-instance lower bound, which can be used to replace $L_{inc} (I)$ in (1) in the definition of instance-optimal DP mechanisms.

4. R2T: Instance-Optimal Truncation

Our instance-optimal truncation mechanism, Race-to-the-Top (R2T), can be used in combination with any truncation method $Q (I, τ)$ , which is a function $Q : I \times N \to N$ with the following properties:

For any $τ$ , the global sensitivity of $Q (\cdot, τ)$ is at most $τ$ .
For any $τ$ , $Q (I, τ) \leq Q (I)$ .
For any $I$ , there exists a non-negative integer $τ^{*} (I) \leq G S_{Q}$ such that for any $τ \geq τ^{*} (I)$ , $Q (I, τ) = Q (I)$ .

We describe various choices for $Q (I, τ)$ depending on the DP policy and whether the query contains self-joins and/or projections in the subsequent sections. Intuitively, such a $Q (I, τ)$ gives a stable (property (1)) underestimate (property (2)) of $Q (I)$ , while reaches $Q (I)$ for a sufficiently large $τ$ (property (3)). Note that $Q (I, τ)$ itself is not DP. To make it DP, we can add $L a p (τ / ε)$ , which would turn it into an $ε$ -DP mechanism by property (1). The issue, of course, is how to set $τ$ . The basic idea of R2T is to try geometrically increasing values of $τ$ and somehow pick the “winner” of the race.

Assuming such a $Q (I, τ)$ , R2T works as follows. For a probability^c $β$ , we first compute^d

\begin{matrix} \tilde{Q} (I, τ^{(j)}) : = & Q (I, τ^{(j)}) + L a p (log (G S_{Q}) \frac{τ^{(j)}}{ε}) \\ - log (G S_{Q}) ln (\frac{log (G S_{Q})}{β}) \cdot \frac{τ^{(j)}}{ε}, \end{matrix}

(7)

for $τ^{(j)} = 2^{j}$ , $j = 1, \dots, log (G S_{Q})$ . Then R2T outputs

\tilde{Q} (I) : = max \{max_{j} \tilde{Q} (I, τ^{(j)}), Q (I, 0)\} .

(8)

The privacy of R2T is straightforward: Since $Q (I, τ^{(j)})$ has global sensitivity at most $τ^{(j)}$ , and the third term of (7) is independent of $I$ , each $\tilde{Q} (I, τ^{(j)})$ satisfies $\frac{ε}{log (G S_{Q})}$ -DP by the standard Laplace mechanism. Collectively, all the $\tilde{Q} (I, τ^{(j)})$ ’s satisfy $ε$ -DP by the basic composition theorem.¹⁵ Finally, returning the maximum preserves DP by the post-processing property of DP.

Utility analysis. For some intuition on why R2T offers good utility, see Figure 1. By property (2) and (3), as we increase $τ$ , $Q (I, τ)$ gradually approaches the true answer $Q (I)$ from below and reaches $Q (I, τ) = Q (I)$ when $τ \geq τ^{*} (I)$ . However, we cannot use $Q (I, τ)$ or $τ^{*} (I)$ directly as this would violate DP. Instead, we only get to see $\tilde{Q} (I, τ)$ , which is masked with the noise of scale proportional to $τ$ . We thus face a dilemma: The closer we get to $Q (I)$ , the more uncertain we are about the estimate $\tilde{Q} (I, τ)$ . To get out of the dilemma, we shift $Q (I, τ)$ down by an amount that equals the scale of the noise (if ignoring the $log log$ factor). This penalty for $\tilde{Q} (I, \hat{τ})$ , where $\hat{τ}$ is the smallest power of 2 above $τ^{*} (I)$ , will be on the same order as $τ^{*} (I)$ , so it will not affect its error by more than a constant factor, while taking the maximum ensures that the winner is at least as good as $\tilde{Q} (I, \hat{τ})$ . Meanwhile, the extra $log log$ factor ensures that no $\tilde{Q} (I, τ)$ overshoots the target. Below, we formalize the intuition.

Theorem 1.

On any instance $I$ , with probability at least $1 - β$ , we have $Q (I) - 4 log (G S_{Q}) ln (\frac{log (G S_{Q})}{β}) \frac{τ^{*} (I)}{ε} \leq \tilde{Q} (I) \leq Q (I) .$

5. Truncation for SJA Queries

In this section, we will design a $Q (I, τ)$ with $τ^{*} (I) = D S_{Q} (I)$ for SJA queries. Plugged into Theorem 1 with $β = 1 / 3$ and the definition of instance optimality, this turns R2T into an instance-optimal DP mechanism with an optimality ratio of $O (log (G S_{Q}) log log (G S_{Q}) / ε)$ .

For self-join-free SJA queries, each join result $q \in J (I)$ references only one tuple in $R_{P}$ . Thus, the tuples in $R_{P}$ are independent, that is, removing one does not affect the sensitivities of others. This means that naive truncation (that is, removing all $S_{Q} (I, t_{P}) > τ$ and then summing up the rest) is a valid $Q (I, τ)$ that satisfies the three properties required by R2T with $τ^{*} (I) = D S_{Q} (I)$ .

When there are self-joins, naive truncation does not satisfy property (1), as illustrated in Example 1.2, where all $S_{Q} (I, t_{P})$ ’s in two neighboring instances may differ. Below we generalize the LP-based mechanism for graph pattern counting²⁰ to arbitrary SJA queries, and show that it satisfies the three properties with $τ^{*} (I) = D S_{Q} (I)$ .

Given a SJA query $Q$ and instance $I$ , recall that $Q (I) = \sum_{q \in J (I)} ψ (q)$ , where $J (I)$ is the join results. For $k \in [| J (I) |]$ , let $q_{k} (I)$ be the $k$ th join result. For each $j \in [| I (R_{P}) |]$ , let $t_{j} (I)$ be the $j$ th tuple in $I (R_{P})$ . We use $C_{j} (I)$ to denote (the indices of) the set of join results that reference $t_{j} (I)$ . More precisely,

C_{j} (I) : = {k : q_{k} (I) references t_{j} (I)} .

(9)

For each $k \in [| J (I) |]$ , introduce a variable $u_{k}$ , which represents the weight assigned to the join result $q_{k} (I)$ . We return the optimal solution of the following LP as $Q (I, τ)$ :

\begin{matrix} maximize & Q (I, τ) = \sum_{k \in [| J (I) |]} u_{k}, \\ subject to & \sum_{k \in C_{j} (I)} u_{k} \leq τ, & j \in [| I (R_{P}) |], \\ 0 \leq u_{k} \leq ψ (q_{k} (I)), & k \in [| J (I) |] . \end{matrix}

Lemma 1.

For SJA queries, the $Q (I, τ)$ defined above satisfies the three properties required by R2T with $τ^{*} (I) = D S_{Q} (I)$ .

Example 5.1.

We now give a step-by-step example to show how this truncation method works together with R2T. Consider the problem of edge counting under node-DP, which corresponds to the SJA query $Q : = | σ_{ID 1 < ID 2} (Node (ID 1) ⋈ Node (ID 2) ⋈ Edge (ID 1, ID 2)) |$

on the graph data schema introduced in Example 2.1. Note that in SQL, the query would be written as $\begin{matrix} SELECT count (*) FROM Node AS Node 1, Node AS Node 2, Edge \\ WHERE Edge . src = Node 1 . ID AND Edge . dst = Node 2 . ID \\ AND Node 1 . ID < Node 2 . ID \end{matrix}$

Suppose we set $G S_{Q} = 2^{8} = 256$ . For this particular $Q$ , this means the maximum degree of any node in any instance $I \in I$ is 256. We set $β = 0.1$ and $ε = 1$ .

Now, suppose we are given an $I$ containing 8,103 nodes, which form 1,000 triangles, 1,000 4-cliques, 100 8-stars, 10 16-stars, and one 32-star as shown in Figure 2. The true query result is $Q (I) = 3 \times 1,000 + 6 \times 1,000 + 8 \times 100 + 16 \times 10 + 32 = 9,992 .$

We run R2T with $τ^{(j)} = 2^{j}$ for $j = 1, \dots, 8$ . For each $τ = τ^{(j)}$ , we assign a weight $u_{k} \in [0, 1]$ to each join result (that is, an edge) that satisfies the predicate $ID 1 < ID 2$ . To calculate $Q (I, τ)$ , we can consider the LP on each clique/star separately. For a triangle, the optimal LP solution always assigns $u_{k} = 1$ for each edge. For each 4-clique, it assigns $2 / 3$ to each edge for $τ = 2$ and 1 for $τ \geq 4$ . For each $k$ -star, the LP optimal solution is $min {k, τ}$ . Thus, the optimal LP solutions are $\begin{matrix} Q (I, 2) = & 1 \times 3,000 + \frac{2}{3} \times 6000 + 2 \times 100 \\ + 2 \times 10 + 2 \times 1 = 7,222, \\ Q (I, 4) = & 1 \times 3,000 + 1 \times 6,000 + 4 \times 100 \\ + 4 \times 10 + 4 \times 1 = 9,444, \\ Q (I, 8) = & 1 \times 3,000 + 1 \times 6,000 + 8 \times 100 \\ + 8 \times 10 + 8 \times 1 = 9,888, \\ Q (I, 16) = & 1 \times 3,000 + 1 \times 6,000 + 8 \times 100 \\ + 16 \times 10 + 16 \times 1 = 9,976 . \end{matrix}$

In addition, we have $Q (I, 0) = 0$ and $Q (I, τ) = 9,992$ for $τ \geq 32$ .

Then, let’s see how to run R2T with these $Q (I, τ)$ ’s. Let $ε = 1$ , $β = 0.1$ , and $G S = 2^{10}$ . Besides, for convenience, assume $L a p (1)$ returns $- 1$ and 1 by turns. Plugging these into (7), we have $\begin{matrix} \tilde{Q} (I, 2) = & 7,222 + (- 1) \cdot 20 - 92.1 = 7,110 \\ \tilde{Q} (I, 4) = & 9,444 + 1 \cdot 40 - 184 = 9,300 \\ \tilde{Q} (I, 8) = & 9,888 + (- 1) \cdot 80 - 368 = 9,440 \\ \tilde{Q} (I, 16) = & 9,976 + 1 \cdot 160 - 737 = 9,399 \\ \tilde{Q} (I, 32) = & 9,992 + (- 1) \cdot 320 - 1474 = 8,198 \\ \tilde{Q} (I, 64) = & 9,992 + 1 \cdot 640 - 2947 = 7,685 \\ . . . \end{matrix}$

Finally, with (1), we have $\tilde{Q} (I) = \tilde{Q} (I, 8) = 9,440$ .

6. Truncation for SPJA Queries

A projection reduces the query answer, hence its sensitivity, so it requires less noise. However, it makes achieving instance optimality harder: Even in the simple case $| π_{x_{2}} (R_{1} (x_{1})$ $⋈ R_{2} (x_{1}, x_{2})) |$ , it is impossible to achieve the error $f (I) \cdot D S_{Q} (I)$ at each instance $I$ for any function $f (I)$ . To address this issue, we propose a truncation for SPJA queries with error depending on another instance-specific notation. Please read the full-version paper for more details.

7. System Implementation

Based on the R2T algorithm, we have implemented a system on top of PostgreSQL and CPLEX. The system structure is shown in Figure 3. The input to our system is any SPJA query written in SQL, together with a designated primary private relation $R_{P}$ (interestingly, while R2T satisfies the DP policy with FK constraints, the algorithm itself does not need to know the PK-FK constraints).

The system supports $SUM$ and $COUNT$ aggregation. Our SQL parser first unpacks the aggregation into a reporting query so as to find $ψ (q_{k} (I))$ for each join result, as well as $C_{j} (I)$ , which stores the referencing relationships between tuples in $I (R_{P})$ and $J (I)$ .

Example 7.1.

Suppose we use the TPC-H schema (shown in Figure 4), where we designate $Supplier$ and $Customer$ as primary private relations. Consider the following query: $\begin{matrix} SELECT SUM (price * (1 - discount)) \\ FROM Supplier, Lineitem, Orders, Customer \\ WHERE Supplier . SK = Lineitem . SK AND Lineitem . OK = Orders . OK \\ AND Orders . CK = Customer . CK \\ AND Orders . orderdate > =^{'} 2020 - 08 - 01^{'} \end{matrix}$

We rewrite it as $\begin{matrix} SELECT Supplier . SK, Customer . CK, price * (1 - discount) \\ FROM Supplier, Lineitem, Orders, Customer \\ WHERE Supplier . SK = Lineitem . SK AND Lineitem . OK = Orders . OK \\ AND Orders . CK = Customer . CK \\ AND Orders . orderdate > =^{'} 2020 - 08 - 01^{'} \end{matrix}$

The $price * (1 - discount)$ column in the query results gives all the $ψ (q_{k} (I))$ values, while $Supplier . SK$ and $Customer . CK$ yield the referencing relationships from each supplier and customer to all the join results they contribute to.

Figure 4. The foreign-key graph of TPC-H schema.

We execute the rewritten query in PostgreSQL, and export the query results to a file. Then, an external program is invoked to construct the $log (G S_{Q})$ LPs from the query results, which are then solved by CPLEX. Finally, we use R2T to compute a privatized output.

The computation bottleneck is the $log (G S_{Q})$ LPs, each of which contains $| J (I) |$ variables and $| J (I) | + | I (R_{P}) |$ constraints. This takes polynomial time, but can still be very expensive in practice. One immediate optimization is to solve them in parallel and we present another effective technique to speed up the process in our full-version paper.

8. Experiments

We conducted experiments on graph pattern-counting queries under node-DP, an important special case of the SPJA queries with FK constraints. Here, we compare R2T with naive truncation with smooth sensitivity (NT),²⁰ smooth distance estimator (SDE),⁴ recursive mechanism (RM),⁶ and the LP-based mechanism (LP).²⁰ We also implement some experiments on general SPJA queries to compare R2T with the local sensitivity-based mechanism (LS).²⁴ The experimental results show R2T achieves order-of-magnitude improvements over LS in terms of utility, with similar running times. This section is covered in our full-version paper.

8.1 Setup.

For graph pattern-counting queries, we used four queries: edge counting $Q_{1 -}$ , length-2 path counting $Q_{2 -}$ , triangle counting $Q_{▵}$ , and rectangle counting $Q_{□}$ . We used five real-world networks datasets: $Deezer$ , $Amazon 1$ , $Amazon 2$ , $RoadnetPA$ and $RoadnetCA$ . $Deezer$ collects the friendships of users from the music-streaming service Deezer. $Amazon 1$ and $Amazon 2$ are two Amazon co-purchasing networks. $RoadnetPA$ and $RoadnetCA$ are road networks of Pennsylvania and California, respectively. All these datasets are obtained from SNAP.²² Table 1 shows the basic statistics of these datasets.

Table 1. Graph datasets used in the experiments.

Dataset	$Deezer$	$Amazon 1$	$Amazon 2$	$RoadnetPA$	$RoadnetCA$
Nodes	144,000	262,000	335,000	1,090,000	1,970,000
Edges	847,000	900,000	926,000	1,540,000	2,770,000
Maximum degree	420	420	549	9	12
Degree bound $D$	1,024	1,024	1,024	16	16

Most algorithms need to assume a $G S_{Q}$ in advance. Note that the value of $G S_{Q}$ should not depend on the instance, but may use some background knowledge for a particular class of instances. Thus, for the three social networks, we set a degree upper bound of $D = 1024$ , while for the two road networks, we set $D = 16$ . Then we set $G S_{Q}$ as the maximum number of graph patterns containing any node. This means that $G S_{Q_{1 -}} = D$ , $G S_{Q_{2 -}} = G S_{Q_{▵}} = D^{2}$ , and $G S_{Q_{□}} = D^{3}$ .

The LP mechanism requires a truncation threshold $τ$ , but Kasiviswanathan et al.²⁰ does not discuss how this should be set. Initially, we used a random threshold uniformly chosen from $[1, G S_{Q}]$ . This turned out to be very bad as with constant probability, the picked threshold is $Ω (G S_{Q})$ , which makes these mechanisms as bad as the naive mechanism that adds $G S_{Q}$ noise. To achieve better results, as in R2T, we consider ${2, 4, 8, \dots, G S_{Q}}$ as the possible choices. Similarly, NT and SDE need a truncation threshold $θ$ on the degree, and we choose one from ${2, 4, 8, \dots, D}$ randomly.

All experiments were conducted on a Linux server with a 24-core 2.2GHz Intel Xeon CPU and 256GB of memory. Each program was allowed to use at most 10 threads and we set a time limit of 6 hours for each run. Each experiment was repeated 100 times and we report the average running time. The errors are less stable due to the random noise, so we remove the best 20 and worst 20 runs, and report the average error of the remaining 60 runs. The failure probability $β$ in R2T is set to $0.1$ . The default DP parameter is $ε = 0.8$ .

8.2 Experimental results.

The errors and running times of all mechanisms over the graph pattern counting queries are shown in Table 2. These results indicate a clear superiority of R2T in terms of utility, offering order-of-magnitude improvements over other methods in many cases. What is more desirable is its robustness: In all the 20 query-dataset combinations, R2T consistently achieves an error below $20 %$ , while the error is below $10 %$ in all but three cases. We also notice that, given a query, R2T performs better in road networks than social networks. This is because the error of R2T is proportional to $D S_{Q} (I)$ by our theoretical analysis. Thus the relative error is proportional to $D S_{Q} (I) / | Q (I) |$ . Therefore, larger and sparser graphs, such as road networks, lead to smaller relative errors.

Table 2. Comparison between R2T, naive truncation with smooth sensitivity (NT), smooth distance estimator (SDE), LP-based Mechanism (LP), and recursive mechanism (RM) on graph pattern counting queries.

Dataset		$Deezer$		$Amazon 1$		$Amazon 2$		$Roadnet - PA$		$Roadnet - CA$
Result type		Relative error(%)	Time(s)	Relative error(%)	Time(s)	Relative error(%)	Time(s)	Relative error(%)	Time(s)	Relative error(%)	Time(s)
$q_{1 -}$	Query result	847,000	1.28	900,000	1.52	926,000	1.62	1,540,000	1.51	2,770,000	2.64
	R2T	0.535	12.3	0.557	15.6	0.432	16.2	0.0114	26.8	0.00635	48.7
	NT	59.1	18.1	101	29.3	125	40.4	1,370	21.9	1,410	39.7
	SDE	548	9,870	363	4,570	286	1,130	55.2	105	81.8	292
	LP	14.3	16.9	5.72	14.7	6.75	14.4	3.6	28.3	3.02	54
$q_{2 -}$	Query result	21,800,000	13.8	9,120,000	11.8	9,750,000	13.8	3,390,000	6.39	6,000,000	6.06
	R2T	6.64	356	12.2	170	9.06	196	0.0539	80.2	0.0352	145
	NT	116	21.0	398	28.4	390	41.0	6,160	23.2	6,530	44.2
	SDE	8,900	9,870	5,110	4,570	1,930	1,130	211	104	228	296
	LP	35.9	8,820	23.2	3,600	27.8	461	11.1	148	13.3	404
$q_{▵}$	Query result	794,000	4.53	718,000	5.03	667,000	4.20	67,200	2.96	121,000	5.17
	R2T	5.58	17.3	1.27	18.8	2.03	19.9	0.102	4.21	0.061	7.5
	NT	782	23.0	1,660	31.7	1,920	41.0	110,000	23.3	105,000	45.0
	SDE	67,300	9,880	26,000	4,570	9,600	1,130	4,150	106	3,830	297
	LP	24.6	131	12.8	18.2	14.2	18.3	0.104	3.95	0.0625	7.06
	RM	Over time limit						0.0388	1,280	0.0193	2,550
$q_{□}$	Query result	11,900,000	74.3	2,480,000	21.6	3,130,000	15.6	158,000	4.50	262,000	10.1
	R2T	16.9	289	6.29	70.5	10.5	86.8	0.0729	8.18	0.0638	16.2
	NT	3,750	57.6	30,700	35.8	26,100	50.6	319,000	24.8	368,000	45.0
	SDE	6,970,000	9,930	11,400,000	4,580	202,000	1,140	10,300	108	9,130	300
	LP	92.6	2,530	94.8	70.4	77.8	81.2	0.223	7.83	0.165	14.2
	RM	Over time limit						0.0217	10,500	Over time limit

In terms of running time, all mechanisms are reasonable, except for RM and SDE. RM can only complete within the six-hour time limit on three cases, although it achieves very small errors on these three cases. SDE is faster than RM but runs a bit slower than others. It is also interesting to see that R2T sometimes even runs faster than LP, despite the fact that R2T needs to solve $O (log G S_{Q})$ LPs. This is due to the early stop optimization: The running time of R2T is determined by the LP that corresponds to the near-optimal $τ$ , which often happens to be one of the LPs that can be solved fastest. We also conducted experiments to see how the privacy parameter $ε$ affects various mechanisms in our full version of paper. The result shows R2T has a high utility even for a small $ε$ .

Selection of $τ$ In the next set of experiments, we dive deeper and see how sensitive the utility is with respect to the truncation threshold $τ$ . We tested the queries on $Amazon 2$ and measured the error of the LP-based mechanism²⁰ with different $τ$ . For each query, we tried various $τ$ from 2 to $G S_{Q}$ and compare their errors with R2T. The results are shown in Table 3, where the optimal error is marked in gray. The results indicate that the error is highly sensitive to $τ$ , and more importantly, the optimal choice of $τ$ closely depends on the query, and there is no fixed $τ$ that works for all cases. On the other hand, the error of R2T is within a small constant factor (around 6) to the optimal choice of $τ$ , which is exactly the value of instance-optimality.

Table 3. Error levels of R2T and LP-based mechanism (LP) with different

$τ$ .

Query		$Q_{1 -}$	$Q_{2 -}$	$Q_{▵}$	$Q_{□}$
Query result		926,000	9,750,000	667,000	3,130,000
R2T		4,000	883,000	13,500	328,000
LP	$τ = G S_{Q}$	1,440	1,580,000	1,290,000	1,370,000,000
	$τ = G S_{Q} / 8$	2,100	181,000	157,000	140,000,000
	$τ = G S_{Q} / 64$	110,000	259,000	15,100	25,800,000
	$τ = G S_{Q} / 512$	645,000	1,260,000	2,790	2,630,000
	$τ = G S_{Q} / 4096$	810,000	3,950,000	2,090	274,000
	$τ = G S_{Q} / 32768$	911,000	7,580,000	92,300	48,700
	$τ = G S_{Q} / 262144$	924,000	9,340,000	459,000	76,400
	Average error	62,500	2,710,000	94,900	2,430,000

9. More Discussions

Following this work, there have been many efforts put into query evaluation in relational databases under DP. For instance, Dong and Yi¹⁴ and Dong et al.⁸ improve the logarithmic factor in the error for self-join-free queries and self-join queries, while Fang et al.¹⁶ explores answering SPJA queries with Max aggregation. In addition, Cai et al.⁵ and Dong et al.¹⁰ focus on answering multiple queries, while Dong et al.⁷^,⁹ investigate SPJA queries over dynamic databases. For more details, please refer to this recent survey.¹³ Moreover, by integrating this work with Dong et al.¹⁰ and Fang et al.,¹⁶ we have developed a DP SQL system²⁵ capable of answering a broad class of queries that include selection, projection, aggregation, join, and group by operations.

Acknowledgement

This work has been supported by HKRGC under grants 16201318, 16201819, and 16205420; by NTU-NAP startup grant 024584-00001; by the National Science Foundation under grant 2016393; and by DARPA and SPAWAR under contract N66001-15-C-4067.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

R2T: Instance-Optimal Truncation for Differentially Private Query Evaluation with Foreign Keys

View in the ACM Digital Library

DOI

10.1145/3708494

March 2025 Issue

Vol. 68 No. 3

Pages: 93-101

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Oct 16 2025

Verification Systems Face an Identity Crisis

Samuel Greengard

Artificial Intelligence and Machine Learning

BLOG@CACM Oct 16 2025

AI Literacy Should Be a Core Engineering Skill, Not an Afterthought

Alex Williams

Artificial Intelligence and Machine Learning

BLOG@CACM Oct 15 2025

Stop Measuring AI Like Software

Vivek Sunkara

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Abstract

1. Introduction

1.1 The truncation mechanism.

1.2 The issue with self-joins.

1.3 Our contributions.

2. Preliminaries

2.1 Database queries.

2.2 DP in relational databases with FK constraints.

3. Instance Optimality of DP Mechanisms with FK Constraints

4. R2T: Instance-Optimal Truncation

5. Truncation for SJA Queries

6. Truncation for SPJA Queries

7. System Implementation

8. Experiments

8.1 Setup.

8.2 Experimental results.

9. More Discussions

Acknowledgement

R2T: Instance-Optimal Truncation for Differentially Private Query Evaluation with Foreign Keys

DOI

March 2025 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.