Home/Magazine Archive/September 2008 (Vol. 51, No. 9)/Technical Perspective: Distributing Your Data and.../Full Text

Research highlights
# Technical Perspective: Distributing Your Data and Having It, Too

**Interconnected Systems**the Internet, wireless networks, or sensor netsembrace virtually all computing environments. Thus our data no longer needs to be stored, nicely organized, in centralized databases; it may instead span a great many heterogeneous locations and be connected through communication links. Records about stock-exchange transactions, for example, reside in broker firms and other financial entities around the world.

But data is worthless if it cannot be efficiently processed to yield useful information. Enter a data set's basic aggregates, such as the maximum, mean, or median, which can be combined to evaluate statistical properties of the data and guide decisions and actions, for example, by detecting trends stock markets.

Some of these aggregates can be computed quickly by means of a common network infrastructurenamely, a "spanning tree"that connects all nodes storing information. For example, a simple recursive algorithm can compute the mean: each node averages the means computed in each subtree rooted at a node; and the resulting mean, together with the count, is then forwarded to the parent, which in turn averages the results from its own subtrees.

Assuming a unit of time that allows for receiving messages from all children and processing them; each iteration takes a single time unit, and the number of time units is proportional to the depth of the spanning tree, or diameter of the network, denoted *D*.

Essentially, the same algorithm computes the maximal or minimal value and similar aggregates. But the mean is sensitive to outliers: when the data has large variations the mean may misrepresent the data. On days when stock rates are highly fluctuating, a single stock transaction at an extremely low quote can significantly sway the mean, rendering it irrelevant. The median, other quartiles, or in general the *k*^{th} element of the data set are more significant in these situations.

But although the simple algorithm described here is fine for the mean, it does not work for computing the median, given that the median at an interior node is not necessarily the median of the medians in its sub-trees. A simple divide-and-conquer approach can be used instead. This algorithm starts with the entire set of elements and in each round randomly chooses a single *pivot* element. The algorithm then counts the number of elements larger than the pivot, and it recourses in the corresponding subinterval. A fairly straightforward analysis shows that when the pivots are chosen uniformly at random and the element counts are exact, then the expected number of iterations is asymptotically bounded by the logarithm of *m*, the number of nodes in the tree.

Putting this sequential algorithm to work in a large heterogeneous network, as done by Kuhn, Locher, and Watten-hofer, illustrates the challenges facing designers of network algorithms today and the innovations they must come up with to tackle them.

The main barrier is in sampling the pivot from a vast and scattered data set. The three researchers sidestep this barrier, however, by instead sending a search expedition to look for the pivot within the data. The search takes a random walk down the parent spanning tree, starting from the root and randomly choosing to which child to proceed. The choice is biased by the size of the sub-trees rooted at the children, but a careful tuning of the biases allows a pivot to be picked uniformly at random within time *2D*.

Instead of choosing one pivot at a time, the algorithm of Kuhn, Locher, and Wattenhofer finds *D* pivots within time *O*(*D*) by staggering the search for several pivots in overlapping time intervals. Using *D* pivots, the number of candidates reduces by a factor of *D* in each iteration, leading to a total time complexity of *O* (*D log _{D}* n), with

Armed with these algorithms, the problem of computing data aggregates distributively is now largely solved. Data can be spread across a network, yet we can mine it to deduce important information.

And in case you were wondering: the answer is negative. No other algorithm can do better. The authors show that any algorithm for finding the median (and in general the *k ^{th}* item) must take time proportional to

The idea is to construct many scenarios, each with a different median, and use an information-theoretic argument to show that a lot of time is needed in order to disambiguate among these scenarios.

**©2008 ACM 0001-0782/08/0900 $5.00**

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2008 ACM, Inc.

No entries found