Abstract
Answering SPJA queries under differential privacy (DP), including graph-pattern counting under node-DP as an important special case, has received considerable attention in recent years. The dual challenge of foreign-key constraints and self-joins is particularly tricky to deal with, and no existing DP mechanisms can correctly handle both. For the special case of graph pattern counting under node-DP, the existing mechanisms are correct (that is, satisfy DP), but they do not offer nontrivial utility guarantees or are very complicated and costly. In this paper, we propose the first DP mechanism for answering arbitrary SPJA queries in a database with foreign-key constraints. Meanwhile, it achieves a fairly strong notion of optimality, which can be considered as a small and natural relaxation of instance optimality. Finally, our mechanism is simple enough that it can be easily implemented on top of any RDBMS and an LP solver. Experimental results show that it offers order-of-magnitude improvements in terms of utility over existing techniques, even those specifically designed for graph pattern counting.
1. Introduction
Differential privacy (DP) has become the standard notion for private data release, due to its strong protection of individual information. Informally speaking, DP requires indistinguishability of the query results whether any particular individual’s data is in the database or not. The standard Laplace mechanism first finds GSQ, the global sensitivity, of the query, that is, how much the query result may change if an individual’s data is added/removed from the database. Then it adds a Laplace noise calibrated accordingly to the query result to mask this difference. However, this mechanism runs into issues in a relational database, as illustrated in the following example.
Consider a simple join-counting queryQ:=|R1(x1_,…)⋈R2(x1,x2,…)|.
Here, the underlined attribute x1_ is the primary key (PK), while R2.x1 is a foreign key (FK) referencing R1.x1. For instance, R1 may store customer information where x1 is the customer ID and R2 stores the orders the customers have placed. Then this query simply returns the total number of orders; more meaningful queries could be formed with some predicates, for example, all customers from a certain region and/or orders in a certain category. Furthermore, suppose the customers, namely, the tuples in R1, are the entities whose privacy we aim to protect.
What’s the GSQ for this query? It is, unfortunately, ∞. This is because a customer, theoretically, could have an unbounded number of orders, and adding such a customer to the database can cause an unbounded change in the query result. A simple fix is to assume a finite GSQ, which can be justified in practice because we may never have a customer with, say, more than a million orders. However, as assuming such a GSQ limits the allowable database instances, one tends to be conservative and sets a large GSQ. This allows the Laplace mechanism to work, but adding noise of this scale clearly eliminates any utility of the released query answer.
1.1 The truncation mechanism.
The issue above was first identified by Kotsogiannis et al.,21 who also formalized the DP policy for relational databases with foreign key (FK) constraints. The essence of their model (a rigorous definition is given in Section 2) is that individuals and their private data are stored in separate relations linked by FKs. This is perhaps the most crucial feature of the relational model, yet it causes a major difficulty in designing DP mechanisms as illustrated above. Their solution is the truncation mechanism, which simply deletes all customers with more than τ orders before applying the Laplace mechanism, for some threshold τ. After truncation, the query has sensitivity τ, so adding a noise of scale τ is sufficient.
Truncation is a special case of Lipshitz extensions and has been studied extensively for graph pattern-counting queries20 and machine learning (ML).1 A well-known issue for the truncation mechanism is the bias-variance trade-off: In one extreme τ=GSQ; it degenerates into the naive Laplace mechanism with a large noise (that is, large variance). In the other extreme τ=0, the truncation introduces a bias as large as the query answer. The issue of how to choose a near-optimal τ has been extensively studied in the statistics and ML communities.2,17 In fact, the particular query in Example 1.1 is equivalent to the 1-dimensional mean (sum) estimation problem, which is important for many ML tasks. A key challenge there is that the selection of τ must also be done in a DP manner.
1.2 The issue with self-joins.
While self-join-free queries are equivalent to mean (sum) estimation (see Section 3 for a more formal statement), self-joins introduce another challenge unique to relational queries. In particular, all techniques from the statistics and machine-learning literature for choosing a τ critically rely on the fact that the individuals are independent, that is, adding/removing one individual does not affect the data associated with another, which is not true when the query involves self-joins. In fact, when there are self-joins, even the truncation mechanism itself fails, as illustrated in the example below.
Suppose we extend the query from Example 1.1 to the following one with a self-join:Q:=|R1(x1_,…,)⋈R1(y1_,…)⋈R2(x1,y1,x2,…)|.
Note that the PK of R1 has been renamed differently in the two logical copies R1, so that they join different attributes of R2. For instance, R2 may store the transactions between pairs of customers, and this query would count the total number of transactions. Again, predicates can be added to make the query more meaningful.
Let G be an undirected τ-regular graph (that is, every vertex has degree τ) with n vertices. We will construct an instance I=(I1,I2) on which the truncation mechanism fails. Let I1 be the vertices of G and let I2 be the edges (each edge will appear twice as G is undirected). Thus, Q simply returns the number of edges in the graph times 2. Let I′ be the neighboring instance corresponding to G′, to which we add a vertex v that connects to every existing vertex. Note that in G′, v has degree n while every other vertex has degree τ+1. Now truncating by τ fails DP: The query answer on I is nτ, and that on I′ is 0 (all vertices are truncated). Adding noise of scale τ cannot mask this gap, violating the DP definition.
The reason why the truncation mechanism fails is that the underlined claim above does not hold in the presence of self-joins. More fundamentally, this is due to the correlation among the individuals introduced by self-joins. In the example above, we see that the addition of one node may cause the degrees of many others to increase. For the problem of graph pattern counting under node-DP, which can be formulated as a multi-way self-join counting query on the special schema R={Node(ID_), Edge(src,dst)}, Kasiviswanathan et al.20 propose an LP-based truncation mechanism (to differentiate, we will call the truncation mechanism above naive truncation) to fix the issue, but they do not study how to choose τ. As a result, while their mechanism satisfies DP, there is no optimality guarantee in terms of utility. In fact, if τ is chosen inappropriately, their error can be even larger than GSQ, namely, worse than the naive Laplace mechanism.
1.3 Our contributions.
In this paper, we start by studying how to choose a near-optimal τ in a DP manner in the presence of self-joins. As with all prior τ-selection mechanisms over mean (sum) estimation2,17 and self-join-free queries,24 we assume that the global sensitivity of the given query Q is bounded by GSQ. Since one tends to set a large GSQ as argued in Example 1.1, we must try to minimize the dependency on GSQ.
The first contribution of this paper (Section 4) is a simple and general DP mechanism, called Race-to-the-Top (R2T), which can be used to adaptively choose τ in combination with any valid DP truncation mechanism that satisfies certain properties. In fact, it does not choose τ per se; instead, it directly returns a privatized query answer with error at most O(log(GSQ)loglog(GSQ)·DSQ(I)) for any instance I with constant probability. While we defer the formal definition of DSQ(I) to Section 3, what we can show is that it is an per-instance lower bound, that is, any valid DP mechanism has to incur error on (in a certain sense). Thus, the error of R2T is instance-optimal up to logarithmic factors in . Furthermore, a logarithmic dependency on is also unavoidable,19 even for the mean estimation problem, that is, the simple self-join-free query in Example 1.1. In practice, these log factors are usually between 10 to 100, and our experiments show that R2T has better utility than previous methods in most cases.
However, as we see in Example 1.2, naive truncation is not a valid DP mechanism in the presence of self-joins. As our second contribution (Section 5), we extend the LP-based mechanism of Kasiviswanathan et al,20 which only works for graph pattern-counting queries, to general queries on an arbitrary relational schema that uses the four basic relational operators: Selection (with arbitrary predicates), Projection, Join (including self-join), and sum Aggregation. When plugged into R2T, this yields the first DP mechanism for answering arbitrary SPJA queries in a database with FK constraints. For SJA queries, the utility is instance-optimal, while the optimality guarantee for SPJA queries is slightly weaker, but we argue that this is unavoidable.
Furthermore, the simplicity of our mechanism allows it to be built on top of any RDMBS and an LP solver. To demonstrate its practicality, we built a system prototype (Section 7) using PostgreSQL and CPLEX. Experimental results (Section 8) show it can provide order-of-magnitude improvements in terms of utility over the state-of-the-art DP-SQL engines. We obtain similar improvements even over node-DP mechanisms specifically designed for graph pattern-counting problems, which are just special SJA queries.
2. Preliminaries
2.1 Database queries.
Let be a database schema. We start with a multi-way join:
(1)where are relation names in and each is a set of variables. When considering self-joins, there can be repeats, i.e., ; in this case, we must have , or one of the two atoms will be redundant. Let .
Let be a database instance over . For any , denote the corresponding relation instance in as . This is a physical relation instance of . We use to denote after renaming its attributes to , which is also called a logical relation instance of . When there are self-joins, one physical relation instance may have multiple logical relation instances; they have the same rows but with different column (variable) names.
A JA or an SJA query aggregates over the join results . More abstractly, let be a function that assigns non-negative integer weights to the join results, where denotes the domain of . The result of evaluating on is
(2)Note that the function only depends on the query. For a counting query, ; for an aggregation query, for example, , is the value of for . And an SJA query with an arbitrary predicate over can be easily incorporated into this formulation: If some does not satisfy the predicate, we simply set .
Graph pattern-counting queries can be formulated as SJA queries. Suppose we store a graph in a relational database by the schema , where and are FKs referencing , then the number of length-3 paths can be counted by first computing the join
followed by a count aggregation. Note that this also counts triangles and non-simple paths (for example, –––), which may or may not be considered as length-3 paths depending on the application. If not, they can be excluded by introducing a predicate (that is, redefining ) . If the graph is undirected, then the query counts every path twice, so we should divide the answer by 2. Alternatively, we may introduce the predicate to eliminate the double counting.
2.2 DP in relational databases with FK constraints.
We adopt the DP policy in Kotsogiannis. et al,21 which defines neighboring instances by taking FK constraints into consideration. We model all the FK relationships as a directed acyclic graph (DAG) over by adding a directed edge from to if has an FK referencing the PK of . There is a.a designated primary private relation , and any relation that has a direct or indirect FK referencing is called a secondary private relation. The referencing relationship over the tuples is defined recursively as follows: (1) any tuple said to reference itself; (2) for , , if references , has an FK referencing the PK of , and the FK of equals the PK of , then we say that references . Then two instances and are considered neighbors if can be obtained from by deleting a set of tuples, all of which reference the same tuple , or vice versa. In particular, may also be deleted, in which case all tuples referencing must be deleted in order to preserve the FK constraints. Finally, for a join result , we say that references if .
We use the notation to denote two neighboring instances and denotes that all tuples in the difference between and reference the tuple .
Consider the TPC-H schema:
If the customers are the individuals whose privacy we wish to protect, then we designate as the primary private relation, which implies that and will be secondary private relations, while will be public. Note that once is designated as a primary private relation, the information in and is also protected since the privacy induced by is stronger than that induced by and . Alternatively, one may designate as the primary private relation, which implies that will be a secondary private relation, while and will be public. This would result in weaker privacy protection but offer higher utility.
Some queries, as given, may be incomplete, that is, it has a variable that is an FK but its referenced PK does not appear in the query . The query in Example 2.1 is such an example. Following Kotsogiannis. et al,21 we always make the query complete by iteratively adding those relations whose PKs are referenced to . The PKs will be given variable names matching the FKs. For example, for the query in Example 2.1, we add , , , and .
The DP policy above incorporates both edge-DP and node-DP, two commonly used DP policies for private graph analysis, as special cases. In Example 2.1, by designating as the private relation ( is thus public, and we may even assume it contains all possible vertex IDs), we obtain edge-DP; for node-DP, we add FK constraints from and to , and designate as the primary private relation, while becomes a secondary private relation.
A mechanism is -DP if for any neighboring instance , and any output , we have
Typical values of used in practice range from to 10, where a smaller value corresponds to stronger privacy protection.
3. Instance Optimality of DP Mechanisms with FK Constraints
Global sensitivity and worst-case optimality. The standard DP mechanism is the Laplace mechanism,15 which adds to the query answer. Here, denotes a random variable drawn from the Laplace distribution with scale and is the global sensitivity of . However, either a join or a sum aggregation makes unbounded. The issue with the former is illustrated in Example 1.1, where a customer may have unbounded orders; a sum aggregation with an unbounded results in the same situation. Thus, as with prior work,2,17,24 we restrict to a set of instances such that
(3)where is a parameter given in advance. For the query in Example 1.1, this is equivalent to assuming that a customer is allowed to have at most orders in any instance.
For general queries, the situation is more complicated. We first consider SJA queries. Given an instance and an SJA query , for a tuple , its sensitivity is
(4)where is the indicator function. For SJA queries, (1) is equivalent to
For self-join-free SJA queries, it is clear that
which turns the problem into a sum estimation problem. However, when self-joins are present, this equality no longer holds since one join result references multiple ’s. This also implies that removing one tuple from may affect multiple ’s, making the neighboring relationship more complicated than in the sum estimation problem, where two neighboring instances differ by only one datum.2,17
What notion of optimality shall we use for DP mechanisms over SJA queries? The traditional worst-case optimality is meaningless, since the naive Laplace mechanism that adds noise of scale is already worst-case optimal, just by the definition of . In fact, the basis of the entire line of work on the truncation mechanism and smooth sensitivity is the observation that typical instances should be much easier than the worst case, so these mechanisms all add instance-specific noises, which are often much smaller than the worst-case noise level .
Instance optimality. The standard notion of optimality for measuring the performance of an algorithm on a per-instance basis is instance optimality. More precisely, let be the class of DP mechanisms and letb
be the lower bound any can achieve (with probability ) on , then the standard definition of instance optimality requires us to design an such that
(5)for every , where is called the optimality ratio. Unfortunately, for any , one can design a trivial that has 0 error on (but fails miserably on other instances), so , which rules out instance-optimal DP mechanisms by a standard argument.15
To avoid such a trivial ,3,12 consider a relaxed version of instance optimality where we compare against any that is required to work well not just on , but also on its neighbors, that is, we raise the target error from to
Vadhan25 observes that , where
is the local sensitivity of at . This instance optimality has been used for certain ML problems3 and conjunctive queries without FKs.12 However, it has an issue for SJA queries in a database with FK constraints: For any , we can add a to together with tuples in the secondary private relations all referencing , obtaining an such that , that is, . This means that this relaxed instance optimality degenerates into worst-case optimality. This is also why smooth sensitivity, including all its efficiently computable versions,11,12,18,23 will not have better utility than the naive Laplace mechanism on databases with FK constraints, since they are all no lower than the local sensitivity.
The reason why the above relaxation is “too much” is that we require to work well on any neighbor of . Under the neighborhood definition with FK constraints, this means that can be any instance obtained from by adding a tuple and arbitrary tuples referencing in the secondary private relations. This is too high a requirement for , hence too low an optimality notion for .
To address the issue, Huang et al.17 restricts the neighborhood in which is required to work well, but their definition only works for the mean estimation problem. For SJA queries under FK constraints, we revise to
namely, we require to work well only on and its down-neighbors, which can be obtained only by removing a tuple already in and all tuples referencing . Correspondingly, an instance-optimal (with regard to the down-neighborhood) is one such that (1) holds where is replaced by .
Clearly, the smaller the neighborhood, the stronger the optimality notion. Our instance optimality notion is thus stronger than those in.3,12,17 Note that for such an instance-optimal (by our definition), there still exist such that does better on than , but if this happens, must do worse on one of the down-neighbors of , which is as typical as itself.
Using the same argument from Vadhan,25 we have , where
(6)is the downward local sensitivity of . Thus, is a per-instance lower bound, which can be used to replace in (1) in the definition of instance-optimal DP mechanisms.
4. R2T: Instance-Optimal Truncation
Our instance-optimal truncation mechanism, Race-to-the-Top (R2T), can be used in combination with any truncation method , which is a function with the following properties:
For any , the global sensitivity of is at most .
For any , .
For any , there exists a non-negative integer such that for any , .
We describe various choices for depending on the DP policy and whether the query contains self-joins and/or projections in the subsequent sections. Intuitively, such a gives a stable (property (1)) underestimate (property (2)) of , while reaches for a sufficiently large (property (3)). Note that itself is not DP. To make it DP, we can add , which would turn it into an -DP mechanism by property (1). The issue, of course, is how to set . The basic idea of R2T is to try geometrically increasing values of and somehow pick the “winner” of the race.
Assuming such a , R2T works as follows. For a probabilityc, we first computed
(7)for , . Then R2T outputs
(8)The privacy of R2T is straightforward: Since has global sensitivity at most , and the third term of (7) is independent of , each satisfies -DP by the standard Laplace mechanism. Collectively, all the ’s satisfy -DP by the basic composition theorem.15 Finally, returning the maximum preserves DP by the post-processing property of DP.
Utility analysis. For some intuition on why R2T offers good utility, see Figure 1. By property (2) and (3), as we increase , gradually approaches the true answer from below and reaches when . However, we cannot use or directly as this would violate DP. Instead, we only get to see , which is masked with the noise of scale proportional to . We thus face a dilemma: The closer we get to , the more uncertain we are about the estimate . To get out of the dilemma, we shift down by an amount that equals the scale of the noise (if ignoring the factor). This penalty for , where is the smallest power of 2 above , will be on the same order as , so it will not affect its error by more than a constant factor, while taking the maximum ensures that the winner is at least as good as . Meanwhile, the extra factor ensures that no overshoots the target. Below, we formalize the intuition.
On any instance , with probability at least , we have
5. Truncation for SJA Queries
In this section, we will design a with for SJA queries. Plugged into Theorem 1 with and the definition of instance optimality, this turns R2T into an instance-optimal DP mechanism with an optimality ratio of .
For self-join-free SJA queries, each join result references only one tuple in . Thus, the tuples in are independent, that is, removing one does not affect the sensitivities of others. This means that naive truncation (that is, removing all and then summing up the rest) is a valid that satisfies the three properties required by R2T with .
When there are self-joins, naive truncation does not satisfy property (1), as illustrated in Example 1.2, where all ’s in two neighboring instances may differ. Below we generalize the LP-based mechanism for graph pattern counting20 to arbitrary SJA queries, and show that it satisfies the three properties with .
Given a SJA query and instance , recall that , where is the join results. For , let be the th join result. For each , let be the th tuple in . We use to denote (the indices of) the set of join results that reference . More precisely,
(9)For each , introduce a variable , which represents the weight assigned to the join result . We return the optimal solution of the following LP as :
For SJA queries, the defined above satisfies the three properties required by R2T with .
We now give a step-by-step example to show how this truncation method works together with R2T. Consider the problem of edge counting under node-DP, which corresponds to the SJA query
on the graph data schema introduced in Example 2.1. Note that in SQL, the query would be written as
Suppose we set . For this particular , this means the maximum degree of any node in any instance is 256. We set and .
Now, suppose we are given an containing 8,103 nodes, which form 1,000 triangles, 1,000 4-cliques, 100 8-stars, 10 16-stars, and one 32-star as shown in Figure 2. The true query result is
We run R2T with for . For each , we assign a weight to each join result (that is, an edge) that satisfies the predicate . To calculate , we can consider the LP on each clique/star separately. For a triangle, the optimal LP solution always assigns for each edge. For each 4-clique, it assigns to each edge for and 1 for . For each -star, the LP optimal solution is . Thus, the optimal LP solutions are
In addition, we have and for .
Then, let’s see how to run R2T with these ’s. Let , , and . Besides, for convenience, assume returns and 1 by turns. Plugging these into (7), we have
Finally, with (1), we have .
6. Truncation for SPJA Queries
A projection reduces the query answer, hence its sensitivity, so it requires less noise. However, it makes achieving instance optimality harder: Even in the simple case , it is impossible to achieve the error at each instance for any function . To address this issue, we propose a truncation for SPJA queries with error depending on another instance-specific notation. Please read the full-version paper for more details.
7. System Implementation
Based on the R2T algorithm, we have implemented a system on top of PostgreSQL and CPLEX. The system structure is shown in Figure 3. The input to our system is any SPJA query written in SQL, together with a designated primary private relation (interestingly, while R2T satisfies the DP policy with FK constraints, the algorithm itself does not need to know the PK-FK constraints).
The system supports and aggregation. Our SQL parser first unpacks the aggregation into a reporting query so as to find for each join result, as well as , which stores the referencing relationships between tuples in and .
Suppose we use the TPC-H schema (shown in Figure 4), where we designate and as primary private relations. Consider the following query:
We rewrite it as
The column in the query results gives all the values, while and yield the referencing relationships from each supplier and customer to all the join results they contribute to.
We execute the rewritten query in PostgreSQL, and export the query results to a file. Then, an external program is invoked to construct the LPs from the query results, which are then solved by CPLEX. Finally, we use R2T to compute a privatized output.
The computation bottleneck is the LPs, each of which contains variables and constraints. This takes polynomial time, but can still be very expensive in practice. One immediate optimization is to solve them in parallel and we present another effective technique to speed up the process in our full-version paper.
8. Experiments
We conducted experiments on graph pattern-counting queries under node-DP, an important special case of the SPJA queries with FK constraints. Here, we compare R2T with naive truncation with smooth sensitivity (NT),20 smooth distance estimator (SDE),4 recursive mechanism (RM),6 and the LP-based mechanism (LP).20 We also implement some experiments on general SPJA queries to compare R2T with the local sensitivity-based mechanism (LS).24 The experimental results show R2T achieves order-of-magnitude improvements over LS in terms of utility, with similar running times. This section is covered in our full-version paper.
8.1 Setup.
For graph pattern-counting queries, we used four queries: edge counting , length-2 path counting , triangle counting , and rectangle counting . We used five real-world networks datasets: , , , and . collects the friendships of users from the music-streaming service Deezer. and are two Amazon co-purchasing networks. and are road networks of Pennsylvania and California, respectively. All these datasets are obtained from SNAP.22 Table 1 shows the basic statistics of these datasets.
Dataset | |||||
---|---|---|---|---|---|
Nodes | 144,000 | 262,000 | 335,000 | 1,090,000 | 1,970,000 |
Edges | 847,000 | 900,000 | 926,000 | 1,540,000 | 2,770,000 |
Maximum degree | 420 | 420 | 549 | 9 | 12 |
Degree bound | 1,024 | 1,024 | 1,024 | 16 | 16 |
Most algorithms need to assume a in advance. Note that the value of should not depend on the instance, but may use some background knowledge for a particular class of instances. Thus, for the three social networks, we set a degree upper bound of , while for the two road networks, we set . Then we set as the maximum number of graph patterns containing any node. This means that , , and .
The LP mechanism requires a truncation threshold , but Kasiviswanathan et al.20 does not discuss how this should be set. Initially, we used a random threshold uniformly chosen from . This turned out to be very bad as with constant probability, the picked threshold is , which makes these mechanisms as bad as the naive mechanism that adds noise. To achieve better results, as in R2T, we consider as the possible choices. Similarly, NT and SDE need a truncation threshold on the degree, and we choose one from randomly.
All experiments were conducted on a Linux server with a 24-core 2.2GHz Intel Xeon CPU and 256GB of memory. Each program was allowed to use at most 10 threads and we set a time limit of 6 hours for each run. Each experiment was repeated 100 times and we report the average running time. The errors are less stable due to the random noise, so we remove the best 20 and worst 20 runs, and report the average error of the remaining 60 runs. The failure probability in R2T is set to . The default DP parameter is .
8.2 Experimental results.
The errors and running times of all mechanisms over the graph pattern counting queries are shown in Table 2. These results indicate a clear superiority of R2T in terms of utility, offering order-of-magnitude improvements over other methods in many cases. What is more desirable is its robustness: In all the 20 query-dataset combinations, R2T consistently achieves an error below , while the error is below in all but three cases. We also notice that, given a query, R2T performs better in road networks than social networks. This is because the error of R2T is proportional to by our theoretical analysis. Thus the relative error is proportional to . Therefore, larger and sparser graphs, such as road networks, lead to smaller relative errors.
Dataset | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Result type | Relative error(%) | Time(s) | Relative error(%) | Time(s) | Relative error(%) | Time(s) | Relative error(%) | Time(s) | Relative error(%) | Time(s) | |
Query result | 847,000 | 1.28 | 900,000 | 1.52 | 926,000 | 1.62 | 1,540,000 | 1.51 | 2,770,000 | 2.64 | |
R2T | 0.535 | 12.3 | 0.557 | 15.6 | 0.432 | 16.2 | 0.0114 | 26.8 | 0.00635 | 48.7 | |
NT | 59.1 | 18.1 | 101 | 29.3 | 125 | 40.4 | 1,370 | 21.9 | 1,410 | 39.7 | |
SDE | 548 | 9,870 | 363 | 4,570 | 286 | 1,130 | 55.2 | 105 | 81.8 | 292 | |
LP | 14.3 | 16.9 | 5.72 | 14.7 | 6.75 | 14.4 | 3.6 | 28.3 | 3.02 | 54 | |
Query result | 21,800,000 | 13.8 | 9,120,000 | 11.8 | 9,750,000 | 13.8 | 3,390,000 | 6.39 | 6,000,000 | 6.06 | |
R2T | 6.64 | 356 | 12.2 | 170 | 9.06 | 196 | 0.0539 | 80.2 | 0.0352 | 145 | |
NT | 116 | 21.0 | 398 | 28.4 | 390 | 41.0 | 6,160 | 23.2 | 6,530 | 44.2 | |
SDE | 8,900 | 9,870 | 5,110 | 4,570 | 1,930 | 1,130 | 211 | 104 | 228 | 296 | |
LP | 35.9 | 8,820 | 23.2 | 3,600 | 27.8 | 461 | 11.1 | 148 | 13.3 | 404 | |
Query result | 794,000 | 4.53 | 718,000 | 5.03 | 667,000 | 4.20 | 67,200 | 2.96 | 121,000 | 5.17 | |
R2T | 5.58 | 17.3 | 1.27 | 18.8 | 2.03 | 19.9 | 0.102 | 4.21 | 0.061 | 7.5 | |
NT | 782 | 23.0 | 1,660 | 31.7 | 1,920 | 41.0 | 110,000 | 23.3 | 105,000 | 45.0 | |
SDE | 67,300 | 9,880 | 26,000 | 4,570 | 9,600 | 1,130 | 4,150 | 106 | 3,830 | 297 | |
LP | 24.6 | 131 | 12.8 | 18.2 | 14.2 | 18.3 | 0.104 | 3.95 | 0.0625 | 7.06 | |
RM | Over time limit | 0.0388 | 1,280 | 0.0193 | 2,550 | ||||||
Query result | 11,900,000 | 74.3 | 2,480,000 | 21.6 | 3,130,000 | 15.6 | 158,000 | 4.50 | 262,000 | 10.1 | |
R2T | 16.9 | 289 | 6.29 | 70.5 | 10.5 | 86.8 | 0.0729 | 8.18 | 0.0638 | 16.2 | |
NT | 3,750 | 57.6 | 30,700 | 35.8 | 26,100 | 50.6 | 319,000 | 24.8 | 368,000 | 45.0 | |
SDE | 6,970,000 | 9,930 | 11,400,000 | 4,580 | 202,000 | 1,140 | 10,300 | 108 | 9,130 | 300 | |
LP | 92.6 | 2,530 | 94.8 | 70.4 | 77.8 | 81.2 | 0.223 | 7.83 | 0.165 | 14.2 | |
RM | Over time limit | 0.0217 | 10,500 | Over time limit |
In terms of running time, all mechanisms are reasonable, except for RM and SDE. RM can only complete within the six-hour time limit on three cases, although it achieves very small errors on these three cases. SDE is faster than RM but runs a bit slower than others. It is also interesting to see that R2T sometimes even runs faster than LP, despite the fact that R2T needs to solve LPs. This is due to the early stop optimization: The running time of R2T is determined by the LP that corresponds to the near-optimal , which often happens to be one of the LPs that can be solved fastest. We also conducted experiments to see how the privacy parameter affects various mechanisms in our full version of paper. The result shows R2T has a high utility even for a small .
Selection of In the next set of experiments, we dive deeper and see how sensitive the utility is with respect to the truncation threshold . We tested the queries on and measured the error of the LP-based mechanism20 with different . For each query, we tried various from 2 to and compare their errors with R2T. The results are shown in Table 3, where the optimal error is marked in gray. The results indicate that the error is highly sensitive to , and more importantly, the optimal choice of closely depends on the query, and there is no fixed that works for all cases. On the other hand, the error of R2T is within a small constant factor (around 6) to the optimal choice of , which is exactly the value of instance-optimality.
Query | |||||
---|---|---|---|---|---|
Query result | 926,000 | 9,750,000 | 667,000 | 3,130,000 | |
R2T | 4,000 | 883,000 | 13,500 | 328,000 | |
LP | 1,440 | 1,580,000 | 1,290,000 | 1,370,000,000 | |
2,100 | 181,000 | 157,000 | 140,000,000 | ||
110,000 | 259,000 | 15,100 | 25,800,000 | ||
645,000 | 1,260,000 | 2,790 | 2,630,000 | ||
810,000 | 3,950,000 | 2,090 | 274,000 | ||
911,000 | 7,580,000 | 92,300 | 48,700 | ||
924,000 | 9,340,000 | 459,000 | 76,400 | ||
Average error | 62,500 | 2,710,000 | 94,900 | 2,430,000 |
9. More Discussions
Following this work, there have been many efforts put into query evaluation in relational databases under DP. For instance, Dong and Yi14 and Dong et al.8 improve the logarithmic factor in the error for self-join-free queries and self-join queries, while Fang et al.16 explores answering SPJA queries with Max
aggregation. In addition, Cai et al.5 and Dong et al.10 focus on answering multiple queries, while Dong et al.7,9 investigate SPJA queries over dynamic databases. For more details, please refer to this recent survey.13 Moreover, by integrating this work with Dong et al.10 and Fang et al.,16 we have developed a DP SQL system25 capable of answering a broad class of queries that include selection, projection, aggregation, join, and group by operations.
Acknowledgement
This work has been supported by HKRGC under grants 16201318, 16201819, and 16205420; by NTU-NAP startup grant 024584-00001; by the National Science Foundation under grant 2016393; and by DARPA and SPAWAR under contract N66001-15-C-4067.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment