Sign In

Communications of the ACM

Research highlights

Reasoning on Data Partitioning For Single-Round Multi-Join Evaluation in Massively Parallel Systems


View as: Print Mobile App ACM Digital Library Full Text (PDF) In the Digital Edition Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook
Reasoning on Data Partitioning for Single-Round Multi-Join Evaluation in Massively Parallel Systems, illustration

Credit: iStockPhoto.com

Evaluating queries over massive amounts of data is a major challenge in the big data era. Modern massively parallel systems, such as, Spark, organize query answering as a sequence of rounds each consisting of a distinct communication phase followed by a computation phase. The communication phase redistributes data over the available servers, while in the subsequent computation phase each server performs the actual computation on its local data. There is a growing interest in single-round algorithms for evaluating multiway joins where data is first reshuffled over the servers and then evaluated in a parallel but communication-free way. As the amount of communication induced by a reshuffling of the data is a dominating cost in such systems, we introduce a framework for reasoning about data partitioning to detect when we can avoid the data reshuffling step. Specifically, we formalize the decision problems parallel-correctness and transfer of parallel-correctness, provide semantical characterizations, and obtain tight complexity bounds.

Back to Top

1. Introduction

The background scenario for this work is that of large-scale data analytics where massive parallelism is utilized to answer complex join queries over multiple database tables. For instance, as described by Chu et al.,7 data analytics engines face new kinds of workloads, where multiple large tables are joined, or where the query graph has cycles. Furthermore, recent in-memory systems (e.g., Refs.11, 13, 19, 23) can fit data in main memory by utilizing a multitude of servers. Koutris and Suciu12 introduced the Massively Parallel Communication (MPC) model to facilitate an understanding of the complexity of query processing on shared-nothing parallel architectures. For such systems, performance is no longer dominated by the number of I/O requests to external memory as in traditional systems but by the communication cost for reshuffling data during query execution. When queries need to be evaluated in several rounds, such reshuffling can repartition the whole database and can thus be very expensive.

While in traditional distributed query evaluation, multi-join queries are computed in several stages over a join tree possibly transferring data over the network at each step, we focus on query evaluation algorithms within the MPC model that only require one round of communication. Such algorithms consist of two phases: a distribution phase (where data is repartitioned or reshuffled over the servers) followed by a computation phase, where each server contributes to the query answer in isolation, by evaluating the query at hand over the local data without any further communication. We refer to such algorithms as generic one-round algorithms. Afrati and Ullman1 describe an algorithm that computes a multi-join query in a single communication round. The algorithm uses a technique that can be traced back to Ganguly et al.9 Beame et al.4, 5 refined the algorithm, named it HyperCube, and showed that it is a communication-optimal algorithm for single-round distributed evaluation of conjunctive queries.

The generic one-round HyperCube algorithm requires a reshuffling of the base data for every separate query. As the amount of communication induced by a reshuffling of the data can be huge, it is important to detect when the reshuffle step can be avoided. We present a framework for reasoning about data partitioning for generic one-round algorithms for the evaluation of queries under arbitrary distribution policies, not just those resulting from the HyperCube algorithm. To target the widest possible range of repartitioning strategies, the initial distribution phase is therefore modeled by a distribution policy that can be any mapping from facts to subsets of servers.

The optimization framework is motivated by two concrete scenarios. In the first scenario, we assume that the data is already partitioned over the servers and we want to know whether a given query can be evaluated correctly over the given data distribution without reshuffling the data. In the second scenario, the data distribution might be unknown or hidden, but it is known that it allowed the correct evaluation of the previous query. Here, we ask whether this knowledge guarantees that the given (next) query can be evaluated correctly without reshuffling. To this end, we formalize the following decision problems:

  • Parallel-Correctness: Given a distribution policy and a query, can we be sure that the corresponding generic one-round algorithm will always compute the query result correctlyno matter the actual data?
  • Parallel-Correctness Transfer: Given two queries

     

    No entries found