Technical Perspective: Distributing Your Data and Having It, Too

Interconnected Systems—the Internet, wireless networks, or sensor nets—embrace virtually all computing environments. Thus our data no longer needs to be stored, nicely organized, in centralized databases; it may instead span a great many heterogeneous locations and be connected through communication links. Records about stock-exchange transactions, for example, reside in broker firms and other financial entities around the world.

But data is worthless if it cannot be efficiently processed to yield useful information. Enter a data set’s basic aggregates, such as the maximum, mean, or median, which can be combined to evaluate statistical properties of the data and guide decisions and actions, for example, by detecting trends stock markets.

Some of these aggregates can be computed quickly by means of a common network infrastructure—namely, a "spanning tree"—that connects all nodes storing information. For example, a simple recursive algorithm can compute the mean: each node averages the means computed in each subtree rooted at a node; and the resulting mean, together with the count, is then forwarded to the parent, which in turn averages the results from its own subtrees.

Assuming a unit of time that allows for receiving messages from all children and processing them; each iteration takes a single time unit, and the number of time units is proportional to the depth of the spanning tree, or diameter of the network, denoted D.

Essentially, the same algorithm computes the maximal or minimal value and similar aggregates. But the mean is sensitive to outliers: when the data has large variations the mean may misrepresent the data. On days when stock rates are highly fluctuating, a single stock transaction at an extremely low quote can significantly sway the mean, rendering it irrelevant. The median, other quartiles, or in general the k^th element of the data set are more significant in these situations.

But although the simple algorithm described here is fine for the mean, it does not work for computing the median, given that the median at an interior node is not necessarily the median of the medians in its sub-trees. A simple divide-and-conquer approach can be used instead. This algorithm starts with the entire set of elements and in each round randomly chooses a single pivot element. The algorithm then counts the number of elements larger than the pivot, and it recourses in the corresponding subinterval. A fairly straightforward analysis shows that when the pivots are chosen uniformly at random and the element counts are exact, then the expected number of iterations is asymptotically bounded by the logarithm of m, the number of nodes in the tree.

Putting this sequential algorithm to work in a large heterogeneous network, as done by Kuhn, Locher, and Watten-hofer, illustrates the challenges facing designers of network algorithms today and the innovations they must come up with to tackle them.

The main barrier is in sampling the pivot from a vast and scattered data set. The three researchers sidestep this barrier, however, by instead sending a search expedition to look for the pivot within the data. The search takes a random walk down the parent spanning tree, starting from the root and randomly choosing to which child to proceed. The choice is biased by the size of the sub-trees rooted at the children, but a careful tuning of the biases allows a pivot to be picked uniformly at random within time 2D.

Instead of choosing one pivot at a time, the algorithm of Kuhn, Locher, and Wattenhofer finds D pivots within time O(D) by staggering the search for several pivots in overlapping time intervals. Using D pivots, the number of candidates reduces by a factor of D in each iteration, leading to a total time complexity of O (D log_D n), with n being the number of nodes in the network. Their paper then shows how to avoid the randomized pivot selection so as to make the algorithm deterministic, though at some cost.

Armed with these algorithms, the problem of computing data aggregates distributively is now largely solved. Data can be spread across a network, yet we can mine it to deduce important information.

And in case you were wondering: the answer is negative. No other algorithm can do better. The authors show that any algorithm for finding the median (and in general the k^th item) must take time proportional to D log_D n, even if it can flip coins. In the true tradition of the theory of distributed computing, the lower bound follows from the uncertainty inherent in the problem.

The idea is to construct many scenarios, each with a different median, and use an information-theoretic argument to show that a lot of time is needed in order to disambiguate among these scenarios.

Footnotes

DOI: http://doi.acm.org/10.1145/1378727.1378748

Technical Perspective: Distributing Your Data and Having It, Too

Technical Perspective: Distributing Your Data and Having It, Too

DOI

September 2008 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Technical Perspective: Distributing Your Data and Having It, Too

DOI

September 2008 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.