Loading [MathJax]/jax/element/mml/optable/GreekAndCoptic.js
Research Highlights
Data and Information

R2T: Instance-Optimal Truncation for Differentially Private Query Evaluation with Foreign Keys

In this paper, we propose the first differential privacy mechanism for answering arbitrary SPJA queries in a database with foreign-key constraints.

Posted
colored balls in a bingo cage
Read the related article

Abstract

Answering SPJA queries under differential privacy (DP), including graph-pattern counting under node-DP as an important special case, has received considerable attention in recent years. The dual challenge of foreign-key constraints and self-joins is particularly tricky to deal with, and no existing DP mechanisms can correctly handle both. For the special case of graph pattern counting under node-DP, the existing mechanisms are correct (that is, satisfy DP), but they do not offer nontrivial utility guarantees or are very complicated and costly. In this paper, we propose the first DP mechanism for answering arbitrary SPJA queries in a database with foreign-key constraints. Meanwhile, it achieves a fairly strong notion of optimality, which can be considered as a small and natural relaxation of instance optimality. Finally, our mechanism is simple enough that it can be easily implemented on top of any RDBMS and an LP solver. Experimental results show that it offers order-of-magnitude improvements in terms of utility over existing techniques, even those specifically designed for graph pattern counting.

1. Introduction

Differential privacy (DP) has become the standard notion for private data release, due to its strong protection of individual information. Informally speaking, DP requires indistinguishability of the query results whether any particular individual’s data is in the database or not. The standard Laplace mechanism first finds GSQ, the global sensitivity, of the query, that is, how much the query result may change if an individual’s data is added/removed from the database. Then it adds a Laplace noise calibrated accordingly to the query result to mask this difference. However, this mechanism runs into issues in a relational database, as illustrated in the following example.

Example 1.1.

Consider a simple join-counting queryQ:=|R1(x1_,)R2(x1,x2,)|.

Here, the underlined attribute x1_ is the primary key (PK), while R2.x1 is a foreign key (FK) referencing R1.x1. For instance, R1 may store customer information where x1 is the customer ID and R2 stores the orders the customers have placed. Then this query simply returns the total number of orders; more meaningful queries could be formed with some predicates, for example, all customers from a certain region and/or orders in a certain category. Furthermore, suppose the customers, namely, the tuples in R1, are the entities whose privacy we aim to protect.

What’s the GSQ for this query? It is, unfortunately, . This is because a customer, theoretically, could have an unbounded number of orders, and adding such a customer to the database can cause an unbounded change in the query result. A simple fix is to assume a finite GSQ, which can be justified in practice because we may never have a customer with, say, more than a million orders. However, as assuming such a GSQ limits the allowable database instances, one tends to be conservative and sets a large GSQ. This allows the Laplace mechanism to work, but adding noise of this scale clearly eliminates any utility of the released query answer.

1.1 The truncation mechanism.

The issue above was first identified by Kotsogiannis et al.,21 who also formalized the DP policy for relational databases with foreign key (FK) constraints. The essence of their model (a rigorous definition is given in Section 2) is that individuals and their private data are stored in separate relations linked by FKs. This is perhaps the most crucial feature of the relational model, yet it causes a major difficulty in designing DP mechanisms as illustrated above. Their solution is the truncation mechanism, which simply deletes all customers with more than τ orders before applying the Laplace mechanism, for some threshold τ. After truncation, the query has sensitivity τ, so adding a noise of scale τ is sufficient.

Truncation is a special case of Lipshitz extensions and has been studied extensively for graph pattern-counting queries20 and machine learning (ML).1 A well-known issue for the truncation mechanism is the bias-variance trade-off: In one extreme τ=GSQ; it degenerates into the naive Laplace mechanism with a large noise (that is, large variance). In the other extreme τ=0, the truncation introduces a bias as large as the query answer. The issue of how to choose a near-optimal τ has been extensively studied in the statistics and ML communities.2,17 In fact, the particular query in Example 1.1 is equivalent to the 1-dimensional mean (sum) estimation problem, which is important for many ML tasks. A key challenge there is that the selection of τ must also be done in a DP manner.

1.2 The issue with self-joins.

While self-join-free queries are equivalent to mean (sum) estimation (see Section 3 for a more formal statement), self-joins introduce another challenge unique to relational queries. In particular, all techniques from the statistics and machine-learning literature for choosing a τ critically rely on the fact that the individuals are independent, that is, adding/removing one individual does not affect the data associated with another, which is not true when the query involves self-joins. In fact, when there are self-joins, even the truncation mechanism itself fails, as illustrated in the example below.

Example 1.2.

Suppose we extend the query from Example 1.1 to the following one with a self-join:Q:=|R1(x1_,,)R1(y1_,)R2(x1,y1,x2,)|.

Note that the PK of R1 has been renamed differently in the two logical copies R1, so that they join different attributes of R2. For instance, R2 may store the transactions between pairs of customers, and this query would count the total number of transactions. Again, predicates can be added to make the query more meaningful.

Let G be an undirected τ-regular graph (that is, every vertex has degree τ) with n vertices. We will construct an instance I=(I1,I2) on which the truncation mechanism fails. Let I1 be the vertices of G and let I2 be the edges (each edge will appear twice as G is undirected). Thus, Q simply returns the number of edges in the graph times 2. Let I be the neighboring instance corresponding to G, to which we add a vertex v that connects to every existing vertex. Note that in G, v has degree n while every other vertex has degree τ+1. Now truncating by τ fails DP: The query answer on I is nτ, and that on I is 0 (all vertices are truncated). Adding noise of scale τ cannot mask this gap, violating the DP definition.

The reason why the truncation mechanism fails is that the underlined claim above does not hold in the presence of self-joins. More fundamentally, this is due to the correlation among the individuals introduced by self-joins. In the example above, we see that the addition of one node may cause the degrees of many others to increase. For the problem of graph pattern counting under node-DP, which can be formulated as a multi-way self-join counting query on the special schema R={Node(ID_), Edge(src,dst)}, Kasiviswanathan et al.20 propose an LP-based truncation mechanism (to differentiate, we will call the truncation mechanism above naive truncation) to fix the issue, but they do not study how to choose τ. As a result, while their mechanism satisfies DP, there is no optimality guarantee in terms of utility. In fact, if τ is chosen inappropriately, their error can be even larger than GSQ, namely, worse than the naive Laplace mechanism.

1.3 Our contributions.

In this paper, we start by studying how to choose a near-optimal τ in a DP manner in the presence of self-joins. As with all prior τ-selection mechanisms over mean (sum) estimation2,17 and self-join-free queries,24 we assume that the global sensitivity of the given query Q is bounded by GSQ. Since one tends to set a large GSQ as argued in Example 1.1, we must try to minimize the dependency on GSQ.

The first contribution of this paper (Section 4) is a simple and general DP mechanism, called Race-to-the-Top (R2T), which can be used to adaptively choose τ in combination with any valid DP truncation mechanism that satisfies certain properties. In fact, it does not choose τ per se; instead, it directly returns a privatized query answer with error at most O(log(GSQ)loglog(GSQ)·DSQ(I)) for any instance I with constant probability. While we defer the formal definition of DSQ(I) to Section 3, what we can show is that it is an per-instance lower bound, that is, any valid DP mechanism has to incur error Ω(DSQ(I)) on I (in a certain sense). Thus, the error of R2T is instance-optimal up to logarithmic factors in GSQ. Furthermore, a logarithmic dependency on GSQ is also unavoidable,19 even for the mean estimation problem, that is, the simple self-join-free query in Example 1.1. In practice, these log factors are usually between 10 to 100, and our experiments show that R2T has better utility than previous methods in most cases.

However, as we see in Example 1.2, naive truncation is not a valid DP mechanism in the presence of self-joins. As our second contribution (Section 5), we extend the LP-based mechanism of Kasiviswanathan et al,20 which only works for graph pattern-counting queries, to general queries on an arbitrary relational schema that uses the four basic relational operators: Selection (with arbitrary predicates), Projection, Join (including self-join), and sum Aggregation. When plugged into R2T, this yields the first DP mechanism for answering arbitrary SPJA queries in a database with FK constraints. For SJA queries, the utility is instance-optimal, while the optimality guarantee for SPJA queries is slightly weaker, but we argue that this is unavoidable.

Furthermore, the simplicity of our mechanism allows it to be built on top of any RDMBS and an LP solver. To demonstrate its practicality, we built a system prototype (Section 7) using PostgreSQL and CPLEX. Experimental results (Section 8) show it can provide order-of-magnitude improvements in terms of utility over the state-of-the-art DP-SQL engines. We obtain similar improvements even over node-DP mechanisms specifically designed for graph pattern-counting problems, which are just special SJA queries.

2. Preliminaries

2.1 Database queries.

Let R be a database schema. We start with a multi-way join:

J : = R 1 ( x 1 ) R n ( x n ) ,

where R1,,Rn are relation names in R and each xi is a set of arity(Ri) variables. When considering self-joins, there can be repeats, i.e., Ri=Rj; in this case, we must have xixj, or one of the two atoms will be redundant. Let var(J):=x1xn.

Let I be a database instance over R. For any RR, denote the corresponding relation instance in I as I(R). This is a physical relation instance of R. We use I(R,x) to denote I(R) after renaming its attributes to x, which is also called a logical relation instance of R. When there are self-joins, one physical relation instance may have multiple logical relation instances; they have the same rows but with different column (variable) names.

A JA or an SJA query Q aggregates over the join results J(I). More abstractly, let ψ:dom(var(J))N be a function that assigns non-negative integer weights to the join results, where dom(var(J)) denotes the domain of var(J). The result of evaluating Q on I is

Q ( I ) : = q J ( I ) ψ ( q ) .

Note that the function ψ only depends on the query. For a counting query, ψ(·)1; for an aggregation query, for example, SUM(A*B), ψ(q) is the value of A*B for q. And an SJA query with an arbitrary predicate over var(J) can be easily incorporated into this formulation: If some qJ(I) does not satisfy the predicate, we simply set ψ(q)=0.

Example 2.1.

Graph pattern-counting queries can be formulated as SJA queries. Suppose we store a graph in a relational database by the schema R={Edge(src,dst), Node(ID_)} where src and dst are FKs referencing ID, then the number of length-3 paths can be counted by first computing the joinEdge(A,B)Edge(B,C)Edge(C,D),

followed by a count aggregation. Note that this also counts triangles and non-simple paths (for example, xyxz), which may or may not be considered as length-3 paths depending on the application. If not, they can be excluded by introducing a predicate (that is, redefining ψ) ACADBD. If the graph is undirected, then the query counts every path twice, so we should divide the answer by 2. Alternatively, we may introduce the predicate A<D to eliminate the double counting.

2.2 DP in relational databases with FK constraints.

We adopt the DP policy in Kotsogiannis. et al,21 which defines neighboring instances by taking FK constraints into consideration. We model all the FK relationships as a directed acyclic graph (DAG) over R by adding a directed edge from R to R if R has an FK referencing the PK of R. There is a.a designated primary private relation RP, and any relation that has a direct or indirect FK referencing RP is called a secondary private relation. The referencing relationship over the tuples is defined recursively as follows: (1) any tuple tPI(RP) said to reference itself; (2) for tPI(RP), tI(R),tI(R), if t references tP, R has an FK referencing the PK of R, and the FK of t equals the PK of t, then we say that t references tP. Then two instances I and I are considered neighbors if I can be obtained from I by deleting a set of tuples, all of which reference the same tuple tPI(RP), or vice versa. In particular, tP may also be deleted, in which case all tuples referencing tP must be deleted in order to preserve the FK constraints. Finally, for a join result qJ(I), we say that q references tPI(RP) if |tPq|=1.

We use the notation II to denote two neighboring instances and ItPI denotes that all tuples in the difference between I and I reference the tuple tPRP.

Example 2.2.

Consider the TPC-H schema:

R = { Nation ( NK _ ) , Customer ( CK _ , NK ) , Order ( OK _ , CK ) , Lineitem ( OK ) } .

If the customers are the individuals whose privacy we wish to protect, then we designate Customer as the primary private relation, which implies that Order and Lineitem will be secondary private relations, while Nation will be public. Note that once Customer is designated as a primary private relation, the information in Order and Lineitem is also protected since the privacy induced by Customer is stronger than that induced by Order and Lineitem. Alternatively, one may designate Order as the primary private relation, which implies that Lineitem will be a secondary private relation, while Customer and Nation will be public. This would result in weaker privacy protection but offer higher utility.

Some queries, as given, may be incomplete, that is, it has a variable that is an FK but its referenced PK does not appear in the query Q. The query in Example 2.1 is such an example. Following Kotsogiannis. et al,21 we always make the query complete by iteratively adding those relations whose PKs are referenced to Q. The PKs will be given variable names matching the FKs. For example, for the query in Example 2.1, we add Node(A), Node(B), Node(C), and Node(D).

The DP policy above incorporates both edge-DP and node-DP, two commonly used DP policies for private graph analysis, as special cases. In Example 2.1, by designating Edge as the private relation (Node is thus public, and we may even assume it contains all possible vertex IDs), we obtain edge-DP; for node-DP, we add FK constraints from src and dst to ID, and designate Node as the primary private relation, while Edge becomes a secondary private relation.

A mechanism M is ε-DP if for any neighboring instance I,I, and any output y, we have

Pr [ M ( I ) = y ] e ε Pr [ M ( I ) = y ] .

Typical values of ε used in practice range from 0.1 to 10, where a smaller value corresponds to stronger privacy protection.

3. Instance Optimality of DP Mechanisms with FK Constraints

Global sensitivity and worst-case optimality.  The standard DP mechanism is the Laplace mechanism,15 which adds Lap(GSQ) to the query answer. Here, Lap(b) denotes a random variable drawn from the Laplace distribution with scale b and GSQ=maxII|Q(I)Q(I)| is the global sensitivity of Q. However, either a join or a sum aggregation makes GSQ unbounded. The issue with the former is illustrated in Example 1.1, where a customer may have unbounded orders; a sum aggregation with an unbounded ψ results in the same situation. Thus, as with prior work,2,17,24 we restrict to a set of instances I such that

max I I , I I , I I | Q ( I ) Q ( I ) | = G S Q ,

where GSQ is a parameter given in advance. For the query in Example 1.1, this is equivalent to assuming that a customer is allowed to have at most GSQ orders in any instance.

For general queries, the situation is more complicated. We first consider SJA queries. Given an instance I and an SJA query Q, for a tuple tPI(RP), its sensitivity is

S Q ( I , t P ) : = q J ( I ) ψ ( q ) I ( q references t P ) ,

where I(·) is the indicator function. For SJA queries, (1) is equivalent to

max I I   max t P I ( R P ) S Q ( I , t P ) = G S Q .

For self-join-free SJA queries, it is clear that

Q ( I ) = t P R P S Q ( I , t P ) ,

which turns the problem into a sum estimation problem. However, when self-joins are present, this equality no longer holds since one join result q references multiple tP’s. This also implies that removing one tuple from I(RP) may affect multiple SQ(I,tP)’s, making the neighboring relationship more complicated than in the sum estimation problem, where two neighboring instances differ by only one datum.2,17

What notion of optimality shall we use for DP mechanisms over SJA queries? The traditional worst-case optimality is meaningless, since the naive Laplace mechanism that adds noise of scale GSQ is already worst-case optimal, just by the definition of GSQ. In fact, the basis of the entire line of work on the truncation mechanism and smooth sensitivity is the observation that typical instances should be much easier than the worst case, so these mechanisms all add instance-specific noises, which are often much smaller than the worst-case noise level GSQ.

Instance optimality.  The standard notion of optimality for measuring the performance of an algorithm on a per-instance basis is instance optimality. More precisely, let M be the class of DP mechanisms and letb

L ins ( I ) : = min M M min { ξ : Pr [ | M ( I ) Q ( I ) | ξ ] 2 / 3 }

be the lower bound any MM can achieve (with probability 2/3) on I, then the standard definition of instance optimality requires us to design an M such that

Pr [ | M ( I ) Q ( I ) | c · L ins ( I ) ] 2 / 3

for every I, where c is called the optimality ratio. Unfortunately, for any I, one can design a trivial M(·)Q(I) that has 0 error on I (but fails miserably on other instances), so Lins(·)0, which rules out instance-optimal DP mechanisms by a standard argument.15

To avoid such a trivial M,3,12 consider a relaxed version of instance optimality where we compare M against any M that is required to work well not just on I, but also on its neighbors, that is, we raise the target error from Lins(I) to

L nbr ( I ) : = min M M max I : I I min { ξ : Pr [ | M ( I ) Q ( I ) | ξ ] 2 / 3 } .

Vadhan25 observes that Lnbr(I)LSQ(I)/2, where

L S Q ( I ) : = max I I , I I | Q ( I ) Q ( I ) |

is the local sensitivity of Q at I. This instance optimality has been used for certain ML problems3 and conjunctive queries without FKs.12 However, it has an issue for SJA queries in a database with FK constraints: For any I, we can add a tP to I(RP) together with tuples in the secondary private relations all referencing tP, obtaining an I such that SQ(I,tP)=GSQ, that is, LSQ(·)GSQ. This means that this relaxed instance optimality degenerates into worst-case optimality. This is also why smooth sensitivity, including all its efficiently computable versions,11,12,18,23 will not have better utility than the naive Laplace mechanism on databases with FK constraints, since they are all no lower than the local sensitivity.

The reason why the above relaxation is “too much” is that we require M to work well on any neighbor I of I. Under the neighborhood definition with FK constraints, this means that I can be any instance obtained from I by adding a tuple tP and arbitrary tuples referencing tP in the secondary private relations. This is too high a requirement for M, hence too low an optimality notion for M.

To address the issue, Huang et al.17 restricts the neighborhood in which M is required to work well, but their definition only works for the mean estimation problem. For SJA queries under FK constraints, we revise Lnbr(·) to

L d-nbr ( I ) : = min M M max I : I I , I I min { ξ : Pr [ | M ( I ) Q ( I ) | ξ ] 2 / 3 } ,

namely, we require M to work well only on I and its down-neighbors, which can be obtained only by removing a tuple tP already in I(RP) and all tuples referencing tP. Correspondingly, an instance-optimal M (with regard to the down-neighborhood) is one such that (1) holds where Lins is replaced by Ld-nbr.

Clearly, the smaller the neighborhood, the stronger the optimality notion. Our instance optimality notion is thus stronger than those in.3,12,17 Note that for such an instance-optimal M (by our definition), there still exist I,M such that M does better on I than M, but if this happens, M must do worse on one of the down-neighbors of I, which is as typical as I itself.

Using the same argument from Vadhan,25 we have Ld-nbr(I)DSQ(I)/2, where

D S Q ( I ) : = max I , I I , I I | Q ( I ) Q ( I ) | = max t P I ( R P ) S Q ( I , t P )

is the downward local sensitivity of I. Thus, DSQ(I) is a per-instance lower bound, which can be used to replace Linc(I) in (1) in the definition of instance-optimal DP mechanisms.

4. R2T: Instance-Optimal Truncation

Our instance-optimal truncation mechanism, Race-to-the-Top (R2T), can be used in combination with any truncation method Q(I,τ), which is a function Q:I×NN with the following properties:

  1. For any τ, the global sensitivity of Q(·,τ) is at most τ.

  2. For any τ, Q(I,τ)Q(I).

  3. For any I, there exists a non-negative integer τ*(I)GSQ such that for any ττ*(I), Q(I,τ)=Q(I).

We describe various choices for Q(I,τ) depending on the DP policy and whether the query contains self-joins and/or projections in the subsequent sections. Intuitively, such a Q(I,τ) gives a stable (property (1)) underestimate (property (2)) of Q(I), while reaches Q(I) for a sufficiently large τ (property (3)). Note that Q(I,τ) itself is not DP. To make it DP, we can add Lap(τ/ε), which would turn it into an ε-DP mechanism by property (1). The issue, of course, is how to set τ. The basic idea of R2T is to try geometrically increasing values of τ and somehow pick the “winner” of the race.

Assuming such a Q(I,τ), R2T works as follows. For a probabilitycβ, we first computed

Q ~ ( I , τ ( j ) ) : = Q ( I , τ ( j ) ) + L a p log ( G S Q ) τ ( j ) ε log ( G S Q ) ln log ( G S Q ) β · τ ( j ) ε ,

for τ(j)=2j, j=1,,log(GSQ). Then R2T outputs

Q ~ ( I ) : = max max j   Q ~ ( I , τ ( j ) ) , Q ( I , 0 ) .

The privacy of R2T is straightforward: Since Q(I,τ(j)) has global sensitivity at most τ(j), and the third term of (7) is independent of I, each Q~(I,τ(j)) satisfies εlog(GSQ)-DP by the standard Laplace mechanism. Collectively, all the Q~(I,τ(j))’s satisfy ε-DP by the basic composition theorem.15 Finally, returning the maximum preserves DP by the post-processing property of DP.

Utility analysis.  For some intuition on why R2T offers good utility, see Figure 1. By property (2) and (3), as we increase τ, Q(I,τ) gradually approaches the true answer Q(I) from below and reaches Q(I,τ)=Q(I) when ττ*(I). However, we cannot use Q(I,τ) or τ*(I) directly as this would violate DP. Instead, we only get to see Q~(I,τ), which is masked with the noise of scale proportional to τ. We thus face a dilemma: The closer we get to Q(I), the more uncertain we are about the estimate Q~(I,τ). To get out of the dilemma, we shift Q(I,τ) down by an amount that equals the scale of the noise (if ignoring the loglog factor). This penalty for Q~(I,τ^), where τ^ is the smallest power of 2 above τ*(I), will be on the same order as τ*(I), so it will not affect its error by more than a constant factor, while taking the maximum ensures that the winner is at least as good as Q~(I,τ^). Meanwhile, the extra loglog factor ensures that no Q~(I,τ) overshoots the target. Below, we formalize the intuition.

Figure 1.  An illustration of R2T.
Theorem 1.

On any instance I, with probability at least 1β, we haveQ(I)4 log(GSQ)lnlog(GSQ)βτ*(I)εQ~(I)Q(I).

5. Truncation for SJA Queries

In this section, we will design a Q(I,τ) with τ*(I)=DSQ(I) for SJA queries. Plugged into Theorem 1 with β=1/3 and the definition of instance optimality, this turns R2T into an instance-optimal DP mechanism with an optimality ratio of O(log(GSQ)log log(GSQ)/ε).

For self-join-free SJA queries, each join result qJ(I) references only one tuple in RP. Thus, the tuples in RP are independent, that is, removing one does not affect the sensitivities of others. This means that naive truncation (that is, removing all SQ(I,tP)>τ and then summing up the rest) is a valid Q(I,τ) that satisfies the three properties required by R2T with τ*(I)=DSQ(I).

When there are self-joins, naive truncation does not satisfy property (1), as illustrated in Example 1.2, where all SQ(I,tP)’s in two neighboring instances may differ. Below we generalize the LP-based mechanism for graph pattern counting20 to arbitrary SJA queries, and show that it satisfies the three properties with τ*(I)=DSQ(I).

Given a SJA query Q and instance I, recall that Q(I)=qJ(I)ψ(q), where J(I) is the join results. For k[|J(I)|], let qk(I) be the kth join result. For each j[|I(RP)|], let tj(I) be the jth tuple in I(RP). We use Cj(I) to denote (the indices of) the set of join results that reference tj(I). More precisely,

C j ( I ) : = { k : q k ( I ) references t j ( I ) } .

For each k[|J(I)|], introduce a variable uk, which represents the weight assigned to the join result qk(I). We return the optimal solution of the following LP as Q(I,τ):

maximize Q ( I , τ ) = k [ | J ( I ) | ] u k , subject to k C j ( I ) u k τ , j [ | I ( R P ) | ] , 0 u k ψ ( q k ( I ) ) , k [ | J ( I ) | ] .
Lemma 1.

For SJA queries, the Q(I,τ) defined above satisfies the three properties required by R2T with τ*(I)=DSQ(I).

Example 5.1.

We now give a step-by-step example to show how this truncation method works together with R2T. Consider the problem of edge counting under node-DP, which corresponds to the SJA queryQ:=|σID1<ID2(Node(ID1)Node(ID2)Edge(ID1,ID2))|

on the graph data schema introduced in Example 2.1. Note that in SQL, the query would be written asSELECTcount(*)FROMNodeASNode1,NodeASNode2,EdgeWHEREEdge.src=Node1.IDANDEdge.dst=Node2.IDANDNode1.ID<Node2.ID

Suppose we set GSQ=28=256. For this particular Q, this means the maximum degree of any node in any instance II is 256. We set β=0.1 and ε=1.

Now, suppose we are given an I containing 8,103 nodes, which form 1,000 triangles, 1,000 4-cliques, 100 8-stars, 10 16-stars, and one 32-star as shown in Figure 2. The true query result isQ(I)=3×1,000+6×1,000+8×100+16×10+32=9,992.

We run R2T with τ(j)=2j for j=1,,8. For each τ=τ(j), we assign a weight uk[0,1] to each join result (that is, an edge) that satisfies the predicate ID1<ID2. To calculate Q(I,τ), we can consider the LP on each clique/star separately. For a triangle, the optimal LP solution always assigns uk=1 for each edge. For each 4-clique, it assigns 2/3 to each edge for τ=2 and 1 for τ4. For each k-star, the LP optimal solution is min{k,τ}. Thus, the optimal LP solutions areQ(I,2)=1×3,000+23×6000+2×100+2×10+2×1=7,222,Q(I,4)=1×3,000+1×6,000+4×100+4×10+4×1=9,444,Q(I,8)=1×3,000+1×6,000+8×100+8×10+8×1=9,888,Q(I,16)=1×3,000+1×6,000+8×100+16×10+16×1=9,976.

In addition, we have Q(I,0)=0 and Q(I,τ)=9,992 for τ32.

Then, let’s see how to run R2T with these Q(I,τ)’s. Let ε=1, β=0.1, and GS=210. Besides, for convenience, assume Lap(1) returns 1 and 1 by turns. Plugging these into (7), we haveQ~(I,2)=7,222+(1)·2092.1=7,110Q~(I,4)=9,444+1·40184=9,300Q~(I,8)=9,888+(1)·80368=9,440Q~(I,16)=9,976+1·160737=9,399Q~(I,32)=9,992+(1)·3201474=8,198Q~(I,64)=9,992+1·6402947=7,685...

Finally, with (1), we have Q~(I)=Q~(I,8)=9,440.

Figure 2.  Example of edge counting.

6. Truncation for SPJA Queries

A projection reduces the query answer, hence its sensitivity, so it requires less noise. However, it makes achieving instance optimality harder: Even in the simple case |πx2(R1(x1)R2(x1,x2))|, it is impossible to achieve the error f(I)·DSQ(I) at each instance I for any function f(I). To address this issue, we propose a truncation for SPJA queries with error depending on another instance-specific notation. Please read the full-version paper for more details.

7. System Implementation

Based on the R2T algorithm, we have implemented a system on top of PostgreSQL and CPLEX. The system structure is shown in Figure 3. The input to our system is any SPJA query written in SQL, together with a designated primary private relation RP (interestingly, while R2T satisfies the DP policy with FK constraints, the algorithm itself does not need to know the PK-FK constraints).

Figure 3.  System structure.

The system supports SUM and COUNT aggregation. Our SQL parser first unpacks the aggregation into a reporting query so as to find ψ(qk(I)) for each join result, as well as Cj(I), which stores the referencing relationships between tuples in I(RP) and J(I).

Example 7.1.

Suppose we use the TPC-H schema (shown in Figure 4), where we designate Supplier and Customer as primary private relations. Consider the following query:SELECTSUM(price*(1discount))FROMSupplier,Lineitem,Orders,CustomerWHERESupplier.SK=Lineitem.SKANDLineitem.OK=Orders.OKANDOrders.CK=Customer.CKANDOrders.orderdate>=20200801

We rewrite it asSELECTSupplier.SK,Customer.CK,price*(1discount)FROMSupplier,Lineitem,Orders,CustomerWHERESupplier.SK=Lineitem.SKANDLineitem.OK=Orders.OKANDOrders.CK=Customer.CKANDOrders.orderdate>=20200801

The price*(1discount) column in the query results gives all the ψ(qk(I)) values, while Supplier.SK and Customer.CK yield the referencing relationships from each supplier and customer to all the join results they contribute to.

Figure 4.  The foreign-key graph of TPC-H schema.

We execute the rewritten query in PostgreSQL, and export the query results to a file. Then, an external program is invoked to construct the log(GSQ) LPs from the query results, which are then solved by CPLEX. Finally, we use R2T to compute a privatized output.

The computation bottleneck is the log(GSQ) LPs, each of which contains |J(I)| variables and |J(I)|+|I(RP)| constraints. This takes polynomial time, but can still be very expensive in practice. One immediate optimization is to solve them in parallel and we present another effective technique to speed up the process in our full-version paper.

8. Experiments

We conducted experiments on graph pattern-counting queries under node-DP, an important special case of the SPJA queries with FK constraints. Here, we compare R2T with naive truncation with smooth sensitivity (NT),20 smooth distance estimator (SDE),4 recursive mechanism (RM),6 and the LP-based mechanism (LP).20 We also implement some experiments on general SPJA queries to compare R2T with the local sensitivity-based mechanism (LS).24 The experimental results show R2T achieves order-of-magnitude improvements over LS in terms of utility, with similar running times. This section is covered in our full-version paper.

8.1 Setup.

For graph pattern-counting queries, we used four queries: edge counting Q1, length-2 path counting Q2, triangle counting Q, and rectangle counting Q. We used five real-world networks datasets: Deezer, Amazon1, Amazon2, RoadnetPA and RoadnetCA. Deezer collects the friendships of users from the music-streaming service Deezer. Amazon1 and Amazon2 are two Amazon co-purchasing networks. RoadnetPA and RoadnetCA are road networks of Pennsylvania and California, respectively. All these datasets are obtained from SNAP.22 Table 1 shows the basic statistics of these datasets.

Table 1. Graph datasets used in the experiments.
Dataset Deezer Amazon 1 Amazon 2 RoadnetPA RoadnetCA
Nodes144,000262,000335,0001,090,0001,970,000
Edges847,000900,000926,0001,540,0002,770,000
Maximum degree420420549912
Degree bound D1,0241,0241,0241616

Most algorithms need to assume a GSQ in advance. Note that the value of GSQ should not depend on the instance, but may use some background knowledge for a particular class of instances. Thus, for the three social networks, we set a degree upper bound of D=1024, while for the two road networks, we set D=16. Then we set GSQ as the maximum number of graph patterns containing any node. This means that GSQ1=D, GSQ2=GSQ=D2, and GSQ=D3.

The LP mechanism requires a truncation threshold τ, but Kasiviswanathan et al.20 does not discuss how this should be set. Initially, we used a random threshold uniformly chosen from [1,GSQ]. This turned out to be very bad as with constant probability, the picked threshold is Ω(GSQ), which makes these mechanisms as bad as the naive mechanism that adds GSQ noise. To achieve better results, as in R2T, we consider {2,4,8,,GSQ} as the possible choices. Similarly, NT and SDE need a truncation threshold θ on the degree, and we choose one from {2,4,8,,D} randomly.

All experiments were conducted on a Linux server with a 24-core 2.2GHz Intel Xeon CPU and 256GB of memory. Each program was allowed to use at most 10 threads and we set a time limit of 6 hours for each run. Each experiment was repeated 100 times and we report the average running time. The errors are less stable due to the random noise, so we remove the best 20 and worst 20 runs, and report the average error of the remaining 60 runs. The failure probability β in R2T is set to 0.1. The default DP parameter is ε=0.8.

8.2 Experimental results.

The errors and running times of all mechanisms over the graph pattern counting queries are shown in Table 2. These results indicate a clear superiority of R2T in terms of utility, offering order-of-magnitude improvements over other methods in many cases. What is more desirable is its robustness: In all the 20 query-dataset combinations, R2T consistently achieves an error below 20%, while the error is below 10% in all but three cases. We also notice that, given a query, R2T performs better in road networks than social networks. This is because the error of R2T is proportional to DSQ(I) by our theoretical analysis. Thus the relative error is proportional to DSQ(I)/|Q(I)|. Therefore, larger and sparser graphs, such as road networks, lead to smaller relative errors.

Table 2. Comparison between R2T, naive truncation with smooth sensitivity (NT), smooth distance estimator (SDE), LP-based Mechanism (LP), and recursive mechanism (RM) on graph pattern counting queries.
Dataset Deezer Amazon 1 Amazon 2 Roadnet PA Roadnet CA
Result typeRelative error(%)Time(s)Relative error(%)Time(s)Relative error(%)Time(s)Relative error(%)Time(s)Relative error(%)Time(s)
q 1 Query result847,0001.28900,0001.52926,0001.621,540,0001.512,770,0002.64
R2T0.53512.30.55715.60.43216.20.011426.80.0063548.7
NT59.118.110129.312540.41,37021.91,41039.7
SDE5489,8703634,5702861,13055.210581.8292
LP14.316.95.7214.76.7514.43.628.33.0254
q 2 Query result21,800,00013.89,120,00011.89,750,00013.83,390,0006.396,000,0006.06
R2T6.6435612.21709.061960.053980.20.0352145
NT11621.039828.439041.06,16023.26,53044.2
SDE8,9009,8705,1104,5701,9301,130211104228296
LP35.98,82023.23,60027.846111.114813.3404
q Query result794,0004.53718,0005.03667,0004.2067,2002.96121,0005.17
R2T5.5817.31.2718.82.0319.90.1024.210.0617.5
NT78223.01,66031.71,92041.0110,00023.3105,00045.0
SDE67,3009,88026,0004,5709,6001,1304,1501063,830297
LP24.613112.818.214.218.30.1043.950.06257.06
RMOver time limit0.03881,2800.01932,550
q Query result11,900,00074.32,480,00021.63,130,00015.6158,0004.50262,00010.1
R2T16.92896.2970.510.586.80.07298.180.063816.2
NT3,75057.630,70035.826,10050.6319,00024.8368,00045.0
SDE6,970,0009,93011,400,0004,580202,0001,14010,3001089,130300
LP92.62,53094.870.477.881.20.2237.830.16514.2
RMOver time limit0.021710,500Over time limit

In terms of running time, all mechanisms are reasonable, except for RM and SDE. RM can only complete within the six-hour time limit on three cases, although it achieves very small errors on these three cases. SDE is faster than RM but runs a bit slower than others. It is also interesting to see that R2T sometimes even runs faster than LP, despite the fact that R2T needs to solve O(logGSQ) LPs. This is due to the early stop optimization: The running time of R2T is determined by the LP that corresponds to the near-optimal τ, which often happens to be one of the LPs that can be solved fastest. We also conducted experiments to see how the privacy parameter ε affects various mechanisms in our full version of paper. The result shows R2T has a high utility even for a small ε.

Selection of τ In the next set of experiments, we dive deeper and see how sensitive the utility is with respect to the truncation threshold τ. We tested the queries on Amazon2 and measured the error of the LP-based mechanism20 with different τ. For each query, we tried various τ from 2 to GSQ and compare their errors with R2T. The results are shown in Table 3, where the optimal error is marked in gray. The results indicate that the error is highly sensitive to τ, and more importantly, the optimal choice of τ closely depends on the query, and there is no fixed τ that works for all cases. On the other hand, the error of R2T is within a small constant factor (around 6) to the optimal choice of τ, which is exactly the value of instance-optimality.

Table 3. Error levels of R2T and LP-based mechanism (LP) with different 
τ.
Query Q 1 Q 2 Q Q
Query result926,0009,750,000667,0003,130,000
R2T4,000883,00013,500328,000
LP τ = G S Q 1,4401,580,0001,290,0001,370,000,000
τ = G S Q / 8 2,100181,000157,000140,000,000
τ = G S Q / 64 110,000259,00015,10025,800,000
τ = G S Q / 512 645,0001,260,0002,7902,630,000
τ = G S Q / 4096 810,0003,950,0002,090274,000
τ = G S Q / 32768 911,0007,580,00092,30048,700
τ = G S Q / 262144 924,0009,340,000459,00076,400
Average error62,5002,710,00094,9002,430,000

9. More Discussions

Following this work, there have been many efforts put into query evaluation in relational databases under DP. For instance, Dong and Yi14 and Dong et al.8 improve the logarithmic factor in the error for self-join-free queries and self-join queries, while Fang et al.16 explores answering SPJA queries with Max aggregation. In addition, Cai et al.5 and Dong et al.10 focus on answering multiple queries, while Dong et al.7,9 investigate SPJA queries over dynamic databases. For more details, please refer to this recent survey.13 Moreover, by integrating this work with Dong et al.10 and Fang et al.,16 we have developed a DP SQL system25 capable of answering a broad class of queries that include selection, projection, aggregation, join, and group by operations.

Acknowledgement

This work has been supported by HKRGC under grants 16201318, 16201819, and 16205420; by NTU-NAP startup grant 024584-00001; by the National Science Foundation under grant 2016393; and by DARPA and SPAWAR under contract N66001-15-C-4067.

    References

    • 1. Abadi, M. et al. Deep learning with differential privacy. In Proceedings of the 2016 ACM Conf. on Computer and Communications Security (2016), 308318.
    • 2. Amin, K., Kulesza, A., Munoz, A., and Vassilvtiskii, S. Bounding user contributions: A bias-variance trade-off in differential privacy. In Intern. Conf. on Machine Learning, PMLR (2019), 263271.
    • 3. Asi, H. and Duchi, J.C. Instance-optimality in differential privacy via approximate inverse sensitivity mechanisms. Advances in Neural Information Processing Systems 33 (2020).
    • 4. Blocki, J., Blum, A., Datta, A., and Sheffet, O. Differentially private data analysis of social networks via restricted sensitivity. In Proceedings of the 4th Conf. on Innovations in Theoretical Computer Science (2013), 8796.
    • 5. Cai, K., Xiao, X., and Cormode, G. PrivLava: synthesizing relational data with foreign keys under differential privacy. In Proceedings of the ACM on Management of Data 1, 2 (2023), 125.
    • 6. Chen, S. and Zhou, S. Recursive mechanism: towards node differential privacy and unrestricted joins. In Proceedings of the 2013 ACM SIGMOD Intern. Conf. on Management of Data  (2013), 653664.
    • 7. Dong, W. et al. Continual observation of joins under differential privacy. Proceedings of the ACM on Management of Data 2, 3 (2024), 127.
    • 8. Dong, W. et al. Instance-optimal truncation for differentially private query evaluation with foreign keys. ACM Transactions on Database Systems 49, 4 (2024), 140.
    • 9. Dong, W., Luo, Q., and Yi, K. Continual observation under user-level differential privacy. In 2023 IEEE Symp. on Security and Privacy (SP). IEEE (2023), 21902207.
    • 10. Dong, W., Sun, D., and Yi, K. Better than composition: How to answer multiple relational queries under differential privacy. Proceedings of the ACM on Management of Data 1, 2 (2023), 126.
    • 11. Dong, W. and Yi, K. Residual sensitivity for differentially private multi-way joins. In Proc. ACM SIGMOD Intern. Conf. on Management of Data, (2021).
    • 12. Dong, W. and Yi, K. A nearly instance-optimal differentially private mechanism for conjunctive queries. In Proc. ACM Symp. on Principles of Database Systems, (2022).
    • 13. Dong, W. and Yi, K. Query evaluation under differential privacy. ACM SIGMOD Record 52, 3 (2023), 617.
    • 14. Dong, W. and Yi, K. Universal private estimators. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symp. on Principles of Database Systems  (2023), 195206.
    • 15. Dwork, C. and Roth, A. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4 (2014), 211407.
    • 16. Fang, J., Dong, W., and Yi, K. Shifted inverse: A general mechanism for monotonic functions under user differential privacy, (2022).
    • 17. Huang, Z., Liang, Y., and Yi, K. Instance-optimal mean estimation under differential privacy. In NeurIPS, (2021).
    • 18. Johnson, N., Near, J. P., and Song, D. Towards practical differential privacy for SQL queries. In Proceedings of the VLDB Endowment 11, 5 (2018), 526539.
    • 19. Karwa, V. and Vadhan, S. Finite sample differentially private confidence intervals. In Proceedings of the 9th Innovations in Theoretical Computer Science Conf. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
    • 20. Kasiviswanathan, S. P., Nissim, K., Raskhodnikova, S., and Smith, A. Analyzing graphs with node differential privacy. In Theory of Cryptography Conf. Springer (2013), 457476.
    • 21. Kotsogiannis, I. et al. PrivateSQL: A differentially private SQL query engine. In Proceedings of the VLDB Endowment 12, 11 (2019), 13711384.
    • 22. Leskovec, J. and Krevl, A. Snap datasets: Stanford large network dataset collection (2014), 49; https://tinyurl.com/22cypyg3
    • 23. Nissim, K., Raskhodnikova, S., and Smith, A. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Ann. ACM Symp. on Theory of Computing (2007), 7584.
    • 24. Tao, Y., He, X., Machanavajjhala, A., and Roy, S. Computing local sensitivities of counting queries with joins. In Proceedings of the 2020 ACM SIGMOD Intern. Conf. on Management of Data (2020), 479494.
    • 25. Vadhan, S. The complexity of differential privacy. In Tutorials on the Foundations of Cryptography. Springer (2017), 347450.
    • 26. Yu, J.et al. Dop-SQL: A general-purpose, high-utility, and extensible private SQL system. In Proc. Intern. Conf. on Very Large Data Bases, (2024).
    • For most parts of the paper, we consider the case where there is only one primary private relation in R; the case with multiple primary private relations can be transformed to a case with a single primary private relation (see our full version paper for more details)
    • The probability constant 2/3 can be changed to any constant larger than 1/2 without affecting the asymptotics.
    • The probability β only concerns about the utility but not privacy.
    • log has base 2 and ln has base e.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More