Technical Perspective: Data Distribution For Fast Joins

When we talk about big data and data analytics, a big—some say, the biggest—component of it is what is known as data wrangling: extracting, integrating, querying, and otherwise preparing data for meaningful analytic algorithms to be applied. Data wrangling relies on well-known and trusted database technology, but many classical database questions now are posed in new settings. One reason for this is that parallel processing becomes very important for handling large amounts of data. This has given rise to a steady line of research on classical database problems in new environments where costs caused by massive parallelism dominate the usual I/O costs of the standard database environment. These new costs are primarily related to communication.

What is the most drastic way to reduce the cost of communication for parallel data processing algorithms, for example, query evaluation? If we could distribute data to servers in a single round of communication, let them do their work, and then collect the results to produce the answer to our query, that would be ideal. This is precisely the kind of questions studied in the following paper. It looks at join algorithms: the most common and important task in database query processing, and investigates conditions on joins that make one-round parallel algorithms produce correct results.

They are not the first to look at this problem. In 2010, Afrati and Ullman initiated the study of such multi-join algorithms. A refinement, Hypercube, algorithm was proposed in 2013 by Beame, Koutris, and Suciu. In those algorithms, the network topology is a hypercube. To evaluate a query, one replicates each tuple in several of its nodes and then lets each node perform its computation. While the hypercube is a rather natural distribution policy, it is certainly not the only one. But can we reason about single-round join evaluation under arbitrary distribution policies?

Also, distribution policies are query-dependent. While finding one policy for all scenarios is of course unrealistic, what about a more down-to-earth requirement: if we already know that a policy works for a query Q, perhaps we can use the same policy for another query Q‘, without redistributing data? This paper addresses these questions.

The formalism. It is very simple and elegant. A network is a set of node names; a distribution policy assigns each tuple in a relation to a set of nodes. This is the communication round. The query Q is then executed locally at each node. It is parallel correct if such a distributed evaluation gives the result of Q; that is, tuples in the answer to Q are exactly those produced locally at network nodes.

Next, if we have two queries Q and Q‘, and we know that each distribution policy that makes Q parallel-correct does the same for Q‘, we say that parallel-correctness transfers from Q to Q‘. In this case, the work done for Q in terms of looking for the right distribution policy need not be redone for Q‘.

The results, and what they tell us. This is a theory paper; the main results are about the complexity of checking parallel-correctness and parallel-transferability. It concentrates on the class of conjunctive queries, that is, slightly more general queries than multi-way joins.

The following paper looks at join algorithms: the most common and important task in database query processing.

Parallel-correctness, under mild assumptions, is -complete. That is, it is a bit harder than NP or coNP, but still well within polynomial space. In practice, this means that checking whether a distribution policy guarantees correctness for all databases can be done in exponential time. Note that this is a static analysis problem (the database is not an input), and exponential time is tolerable and in fact the expected best case for conjunctive queries (as their containment is NP-complete).

The authors then show the same problems for conjunctive queries with negations requires (modulo some complexity theory assumptions) double-exponential time, that is, is realistically unsolvable, which means one needs to restrict attention to simple joins.

Finally, transferability of parallel-correctness for conjunctive queries is solvable in exponential time (remember, this is a problem about queries, not about data), and importantly it is in NP for many classes of conjunctive queries, like multi-joins (which hints at the possibility of using efficient NP solvers to address this problem in practice).

To conclude, I would like to explain why I view this as a model database theory paper. Such a paper ought to have several key ingredients:

It should consider a real data management problem of interest in practice;
It should provide a clean and simple formalism that can be followed by theoreticians and practitioners alike;
It should provide theoretical results that give us insights about the original practical problem.

The paper ticks all these boxes: It provides an elegant theoretical investigation of a practically important problem, and gives a good set of results that delineate the feasibility boundary.

Footnotes

To view the accompanying paper, visit doi.acm.org/10.1145/3041063

Technical Perspective: Data Distribution For Fast Joins

Technical Perspective: Data Distribution For Fast Joins

DOI

March 2017 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Technical Perspective: Data Distribution For Fast Joins

DOI

March 2017 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.