Home/Magazine Archive/August 2019 (Vol. 62, No. 8)/Heavy Hitters via Cluster-Preserving Clustering/Full Text

Research highlights
# Heavy Hitters via Cluster-Preserving Clustering

We develop a new algorithm for the turnstile heavy hitters problem in general turnstile streams, the EXPANDERSKETCH, which finds the approximate top-*k* items in a universe of size *n* using the same asymptotic *O*(*k* log *n*) words of memory and *O*(log *n*) update time as the COUNTMIN and COUNTSKETCH, but requiring only *O*(*k poly*(log *n*)) time to answer queries instead of the *O*(*n* log *n*) time of the other two. The notion of "approximation" is the same *l*_{2} sense as the COUNTSKETCH, which given known lower bounds is the strongest guarantee one can achieve in sublinear memory.

Our main innovation is an efficient reduction from the heavy hitters problem to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We thus need a "cluster-preserving clustering" algorithm that partitions the graph into pieces while finding every cluster. To do this we first apply standard spectral graph partitioning, and then we use some novel local search techniques to modify the cuts obtained so as to make sure that the original clusters are sufficiently preserved. Our clustering algorithm may be of broader interest beyond heavy hitters and streaming algorithms.

Finding "frequent" or "top-*k*" items in a dataset is a common task in data mining. In the data streaming literature, this problem is typically referred to as the *heavy hitters problem*, which is as follows: a frequency vector *x* R^{n} is initialized to the zero vector, and we process a stream of updates `update`

(*i*, ) for R, with each such update causing the change *x _{i}*

Returning to technical definitions, we define item weights *w _{i}* :=

The earliest literature on the heavy hitters problem focused on the case of *insertion-only streams*, in which case `update`

(*i*, ) has = 1 for every update (in contrast with the *general turnstile model*, which allows arbitrary R). This corresponds to seeing a stream of indices *i*_{1}*i*_{2} ... *i _{m}* and wanting to find those items which occur frequently. The perhaps most well-studied form of the problem in insertion-only streams is

We now observe the following: consider the weighting function *f*(*x _{i}*) = |

Then we want that

This inequality indeed holds for *p* ; specifically it suffices that *p* > log((*n* - *k*)/*k*)/ log(*x*_{i} / *x*_{i'}). One then may expect that, due to larger *p* forcing a "more exact" solution to the top-*k* problem, solving *l _{p}* tail heavy hitters should be harder as

Thus far we have described previous results in the insertion-only model, and we now turn back to the turnstile model. Considering both positive and negative is important when one does not necessarily want the top *k* items in a single stream, but perhaps the top *k* items in terms of frequency *change* across two different streams (e.g., two time windows). For example, we may have two streams of query words to a search engine across two time windows and want to know which words had their frequencies change the most. If one interprets stream items in the first window as = +1 updates and those in the second window as corresponding to = -1 updates, then *x _{i}* for any word

The complexity of the heavy hitters problem varies dramatically when one moves from the insertion-only model to the turnstile model. For example, it is clear that the first-mentioned algorithm of sampling the stream no longer works (consider for the example the stream which does `update`

(1, +1) *N* times followed by `update`

(1, -1) *N* times for large *N*, followed by `update`

(2, 1); only 2 is a heavy hitter, but a sampling algorithm will likely not notice). It turns out that the Frequent, BPTree, and CountSieve data structures also do not have versions that can operate in the turnstile model; in fact a lower bound of the work of Jowhari et al.^{9} shows that any solution to *l _{p}* heavy hitters for

Given that the space complexity of the heavy hitters problem is resolved up to a constant factor in the turnstile model, what room for improvement is left? In fact, the quality of a data structure is not measured only along the axis of space complexity, but also update time, query time, and failure probability (all the algorithms mentioned thus far, even in the insertion-only model but with the exception of `Frequent`

, are randomized and have a bounded failure probability of incorrectly answering a query). Using *k* log *n* words of memory the COUNTMIN sketch and COUNTSKETCH each have *O*(log *n*) update time and failure probability 1/*poly*(*n*), which are both the best known, but they suffer from a query time of (*n* log *n*). That is, even though the final output list *L* only has size *O*(*k*), the time to construct it is slightly superlinear in the size of the universe the items are drawn from. For *p* = 1 and only in the strict turnstile model, this was remedied by the so-called "dyadic trick" as shown in the work of Cormode and Muthukrishnan^{7} and "hierarchical COUNTSKETCH" as shown in the work of Cormode and Hadjieleftheriou,^{6} which each could trade off space and update time for improved query time; see Figure 1 for detailed bounds. None of these solutions though were able to simultaneously achieve the best of all worlds, namely (1) working in the general turnstile model, (2) achieving the *l*_{2} guarantee, (3) achieving the *O*(*k* log *n*) space and *O*(log *n*) update time of the COUNTSKETCH, and (4) achieving the *k poly*(log *n*) query time of the dyadic trick. That is, none were able to do so until the EXPANDERSKETCH.

**Figure 1. Previous results for the turnstile l_{p} heavy hitters problem, stated for failure probability 1/poly(n). The "general" column states whether the algorithm works in the general turnstile model (as opposed to only strict turnstile). Memory consumption is stated in machine words. The Hierarchical COUNTSKETCH query time could be made k n^{} for arbitrarily small constant > 0. The constant c in the log^{c} n term in the EXPANDERSKETCH query time can be made 3 + o(1) and most likely even 2 + o(1) using more sophisticated graph clustering subroutines, and in the strict turnstile model it can even be made 1 + o(1); see full paper.**

**1.1. Our main contribution**

We give the first data structure, the EXPANDERSKETCH, for general turnstile *l*_{2} tail heavy hitters using optimal *O*(*k* log *n*) words of memory with 1/*poly*(*n*) failure probability, fast *O*(log *n*) update time, and fast *k poly*(log *n*) query time.

We here provide an overview of our algorithm, EXPANDER-SKETCH. Henceforth to avoid repetitiveness, we will often drop the "tail" in "tail heavy hitters"; all heavy hitters problems we consider henceforth are the tail version. Our first step is a reduction from arbitrary *k* to several heavy hitters problems for which *k* is "small". In this reduction, we initialize data structures (to be designed) *D*_{1}, ..., *D _{q}* for the

There is an algorithm, the Hierarchical COUNTSKETCH (see Figure 1), which is almost ideal. It achieves *O*(*k* log *n*) space and *O*(log *n*) update time, but the query time is polynomial in the universe size instead of our desired polylogarithmic. Our main idea is to reduce to the universe size to polylogarithmic so that this solution then becomes viable for fast query. We accomplish this reduction while maintaining *O*(log *n*) update time and optimal space by reducing our heavy hitter problem into *m* = (log *n*/log log *n*) separate *partition heavy hitters problems*, which we now describe.

In the partition *k*-heavy hitters problem there is some partition *P* = {*S*_{1}, ..., *S _{N}*} of [

We remark at this point that the *l*_{1} version of partition heavy hitters in the strict turnstile model is simple to solve (and for *l*_{1} strict turnstile, our algorithm already provides improvements over previous work and is thus delving into). In particular, one can simply use a standard strict turnstile *l*_{1} heavy hitters algorithm for an *N*-dimensional vector and translate every `update`

(*i*, ) to `update`

(*O*(*i*), ). This in effect treats *y _{j}* as

Now we explain how we make use of partition heavy hitters. We take the pedagogical approach of explaining a simple but flawed solution, then iteratively refine to fix flaws until we arrive at a working solution.

**Take 1.** Recall our overall plan: reduce the universe size so that (the partition heavy hitters version of the) Hierarchical COUNTSKETCH has fast query time. For each index *i* [*n*] we can write *i* in base *b* (for some *b* yet to be determined) so that it has digit expansion *d*_{m-1}*d*_{m-2} ... *d*_{0} with each *d _{i}* {0, ...,

**Figure 2. Simplified version of final data structure. The update is x_{29} x_{29} + with m = 3, t = 2 in this example. Each P_{i} is a b-tree operating on a partition of size 2^{t}.**

There are of course two main problems: first, each *P _{j}* actually outputs a list

**Take 2.** We continue from where we left off in the previous take. The second issue, that some digits are likely to be wrong, is easy to fix. Specifically, for *b* = *poly*(log *n*) we have *m* = log_{b} *n* = *O*(log *n*/log log *n*). For this setting of *b, m*, using that the failure probability of each *P _{j}* is 1/

We now return to the first issue, which is the main complication: in general, there may be more than one heavy hitter (there may be up to (log *n*) of them). Thus, even if we performed wishful thinking and pretended that every *P _{j}* succeeded, and furthermore that every

To aid us in knowing which chunks to concatenate with which across the *L _{j}*, the attempt we describe now (which also does not quite work) is as follows. Define

**Figure 3. Each vertex in row j corresponds to an element of L_{j}, that is the heavy hitter chunks out-put by P_{j}. When indices in P_{j} are partitioned by h_{j}(i) enc(i)_{j} h_{j+1}(i), we connect chunks along paths. Case (a) is the ideal case, when all j are good. In (b) P_{2} failed, producing a wrong output that triggered incorrect edge insertions. In (c) both P_{2} and P_{3} failed, triggering an incorrect edge and a missing vertex, respectively. In (d) two heavy hitters collided under h_{3}, causing their vertices to have the same name thereby giving the appearance of a merged vertex. Alternatively, light items masking as a heavy hitter might have appeared in L_{3} with the same h_{3} evaluation as a heavy hitter but different h_{4} evaluation, causing the red vertex to have two outgoing edges to level 4.**

**Take 3.** The above consideration motivates our final scheme, which uses an expander-based idea first proposed in the work of Gilbert et al.^{8} in the context of "for all" *l*_{1}/*l*_{1} sparse recovery, a problem in compressed sensing. Although our precise motivation for the next step is slightly different than in the work of Gilbert et al.,^{8} and our query algorithm and our definition of "robustness" for a graph will be completely different, the idea of connecting chunks using expander graphs is very similar to an ingredient in that work. The idea is to replace the path in the last paragraph by a graph which is robust to a small fraction of edge insertions and deletions, still allowing the identification of a large connected component in such scenarios. Expander graphs will allow us to accomplish this. For us, "robust" will mean that over the randomness in our algorithm, whp each corrupted expander is still a spectral cluster (to be defined shortly). For,^{8} robustness meant that each corrupted expander still contains an induced small-diameter subgraph (in fact an expander) on a small but constant fraction of the vertices, which allowed them a recovery procedure based on a shallow breadth-first search. They then feed the output of this breadth-first search into a recovery algorithm for an existing list-recoverable code (namely Parvaresh-Vardy codes). Due to suboptimality of known list-recoverable codes, such an approach would not allow us to obtain our optimal results.

**2.1. An expander-based approach**

Let *F* be an arbitrary *D*-regular connected graph on the vertex set [*m*] for some *D* = *O*(1). For *j* [*m*], let (*j*) [*m*] be the set of neighbors of vertex *j.* We partition [*n*] according to *O*_{j}(*i*) = z(*i*)_{j} = *h*_{j}(*i*) $enc(*i*)_{j} $*h*_{}_{(j)1} $...$ *h*_{}_{(j)D} where (*j*)_{k} is the *k*th neighbor of *j* in *F.* Given some such *z*, we say its *name* is the first *s* = *O*(log log *n*) bits comprising the *h _{j}* portion of the concatenation. Now, we can imagine a graph

Now, the intuitive picture is that *G should* be the vertex-disjoint union of several copies of the expander *F*, one for each heavy hitter, plus other junk edges and vertices coming from other non-heavy hitters in the *L _{j}*. Due to certain bad levels

*Definition 1.* An -*spectral cluster* in an undirected graph *G* = (*V, E*) is a vertex set *W* *V* of any size satisfying the following two conditions: First, only an -fraction of the edges incident to *W* leave *W*, that is, |(*W*)| *vol*(*W*), where *vol*(*W*) is the sum of edge degrees of vertices inside *W.* Second, given any subsets *A* of *W*, let *r* = *vol*(*A*)/*vol*(*W*) and *B* = *W\A.* Then

Note *r*(1*r*) *vol*(*W*) is the number of edges one would expect to see between *A* and *B* had *W* been a random graph with a prescribed degree distribution.

Roughly, the above means that (a) the cut separating *W* from the rest of *G* has *O*() conductance (i.e., it is a very sparse cut), and (b) for any cut (*A, W*\A) within *W*, the number of edges crossing the cut is what is guaranteed from a spectral expander, minus *O*() · *vol*(*W*). Our task then reduces to finding all -spectral clusters in a given graph. We devise a scheme CUTGRABCLOSE that for each such cluster *W*, we are able to find a (1 *O*() )-fraction of its volume with at most *O*()·*vol*(*W*) erroneous volume from outside *W.* This suffices for decoding for a sufficiently small constant, since this means we find most vertices, that is chunks of the encoding, of the heavy hitter.

For the special case of *l*_{1} heavy hitters in the strict turnstile model, we are able to devise a much simpler query algorithm that works; see the full paper for details. For this special case we also in the full paper devise a space-optimal algorithm with *O*(log *n*) update time, whp success, and *expected* query time *O*(*k* log *n*) (though unfortunately the variance of the query time may be quite high).

**2.2. Cluster-preserving clustering**

As mentioned, an -spectral cluster is a subset *W* of the vertices in a graph *G* = (*V, E*) such that (1) |*W*| *vol*(*W*), and (2) for any ** A W** with

Our algorithm CUTGRABCLOSE is somewhat similar to the spectral clustering algorithm of (Kannan et al.,^{10} Section 4), but with local search. That algorithm is simple: find a low-conductance cut (e.g., a Fiedler cut) to split *G* into two pieces, then recurse on both pieces. Details aside, Fiedler cuts are guaranteed by Cheeger's inequality to find a cut of conductance as long as a cut of conductance at most exists in the graph. The problem with this basic recursive approach is shown in Figure 4 (in particular cut (b)). Note that a cluster can be completely segmented after a few levels of recursion, so that a large portion of the cluster is never found.

**Figure 4. Each small oval is a spectral cluster. They are well-connected internally, with sparse cuts to the outside. The large oval is the rest of the graph, which can look like anything. Cut (a) represents a good low-conductance cut, which makes much progress (cutting the graph in roughly half) while not losing any mass from any cluster. Cut (b) is also a low-conductance cut as long as the number of small ovals is large, since then cutting one cluster in half has negligible effect on the cut's conductance. However, (b) is problematic since recursing on both sides loses half of one cluster forever.**

Our approach is as follows. Like the above, we find a low-conductance cut then recurse on both sides. However, before recursing on both sides we make certain "improvements" to the cut. We say *A* *V* is *closed* in *G* if there is no vertex *v* *G\A* with at least 5/9ths of its neighbors in *A.* Our algorithm maintains that all recursive subcalls are to closed subsets in *G* as follows. Suppose we are calling CUTGRABCLOSE on some set *A* which we inductively know is closed in *G.* We first try to find a low-conductance cut within *A.* If we do not find one, we terminate and let *A* be one of the sets in the partition. Otherwise, if we cut *A* into (*S*, ), then we close both *S*, by finding vertices violating closure and simply moving them. It can be shown that if the (*S*, ) cut had sufficiently low conductance, then these local moves can only improve conductance further. Now both *S* and are closed in *A* (which by a transitivity lemma we show, implies they are closed in *G* as well). We then show that if (1) some set *S* is closed, and (2) *S* has much more than half the volume of some spectral cluster *W* (e.g., a 2/3rds fraction), then in fact *S* contains a (1 - *O*())-fraction of *W.* Thus after closing both *S*, , we have that *S* either: (a) has almost none of *W*, (b) has almost all of *W*, or (c) has roughly half of *W* (between 1/3 and 2/3, say). To fix the latter case, we then "grab" all vertices in with some (1)-fraction, for example 1/6th, of their neighbors in *S* and simply move them all to *S.* Doing this some constant number of times implies *S* has much more than 2/3rds of *W* (and if *S* was in case (a), then we show it still has almost none of *W*). Then by doing another round of closure moves, one can ensure that both *S*, are closed, and each of them has either an *O*()-fraction of *W* or a (1 *O*())-fraction. It is worth noting that our algorithm can make use of *any* spectral cutting algorithm as a black box and not just Fiedler cuts, followed by our grab and closure steps. For example, algorithms from^{14,15} run in nearly linear time and either (1) report that no -conductance cut exists (in which case we could terminate), (2) find a *balanced* cut of conductance (where both sides have nearly equal volume), or (3) find an -conductance cut in which every *W* *G* with *vol*(*W*) (1/2) *vol*(*G*) and (*W*) *O*() has more than half its volume on the smaller side of the cut. Item (2), if it always occurred, would give a divide-and-conquer recurrence to yield nearly linear time for finding all clusters. It turns out item (3) though is even better! If the small side of the cut has half of every cluster *W*, then by grabs and closure moves we could ensure it is still small and has almost all of *W*, so we could recurse just on the smaller side.

As a result we end up completing the cluster preserving clustering in polynomial time in the graph size, which is polynomial in *poly*(log *n*), and this allows us to find the *k* heavy hitters with whp in *O*(*k poly*(log *n*)) time. For a full description of the algorithm, the reader is referred to reference Larsen et al.^{11}

We thank Noga Alon for pointing us to (Alon and Chung,^{1} Lemma 2.3), Piotr Indyk for the reference,^{8} Yi Li for answering several questions about,^{8} Mary Wootters for making us aware of the formulation of the list-recovery problem in coding theory and its appearance in prior work in compressed sensing and group testing, and Fan Chung Graham and Olivia Simpson for useful conversations about graph partitioning algorithms.

1. Alon, N., Chung, F.R.K. Explicit construction of linear sized tolerant networks. *Discrete Math.* 72 (1988), 1519.

2. Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D. An information statistics approach to data stream and communication complexity. *J. Comput. Syst. Sci.* 68, 4 (2004), 702732.

3. Braverman, V., Chestnut, S.R., Ivkin, N., Nelson, J., Wang, Z., Woodruff, D.P. BPTree: An _{2} heavy hitters algorithm using constant memory. In *Proceedings of the 36th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS)* (2017), ACM, Chicago, IL, 361376.

4. Braverman, V., Chestnut, S.R., Ivkin, N., Woodruff, D.P. Beating CountSketch for heavy hitters in insertion streams. In *Proceedings of the 48th STOC* (2016), ACM, Cambridge, MA.

5. Charikar, M., Chen, K., Farach-Colton, M. Finding frequent items in data streams. *Theor. Comput. Sci.* 312, 1 (2004), 315.

6. Cormode, G., Hadjieleftheriou, M. Finding frequent items in data streams. *PVLDB* 1, 2 (2008), 15301541.

7. Cormode, G., Muthukrishnan, S. An improved data stream summary: The count-min sketch and its applications. *J. Algorithms* 55, 1 (2005), 5875.

8. Gilbert, A.C., Li, Y., Porat, E., Strauss, M.J. For-all sparse recovery in near-optimal time. In *Proceedings of the 41st ICALP* (2014), Springer, Copenhagen, Denmark, 538550.

9. Jowhari, H., Saglam, M., Tardos, G. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In *Proceedings of the 30th PODS* (2011), ACM, Athens, Greece, 4958.

10. Kannan, R., Vempala, S., Vetta, A. On clusterings: Good, bad and spectral. *J. ACM* 51, 3 (2004), 497515.

11. Larsen, K.G., Nelson, J., Nguyn, H.L., Thorup, M. Heavy hitters via cluster-preserving clustering. *CoRR*, abs/1511.01111 (2016).

12. Metwally, A., Agrawal, D., El Abbadi, A. Efficient computation of frequent and top-k elements in data streams, In *Proceedings of the 10th ICDT* (2005), Springer, Edinburgh, UK, 398412.

13. Misra, J., Gries, D. Finding repeated elements. *Sci. Comput. Program* 2, 2 (1982), 143152.

14. Orecchia, L., Sachdeva, S., Vishnoi, N.K. Approximating the exponential, the Lanczos method and an *Õ*(*m*)-time spectral algorithm for balanced separator. In *Proceedings of the 44th STOC* (2012), 11411160.

15. Orecchia, L., Vishnoi, N.K. Towards an SDP-based approach to spectral methods: A nearly-linear-time algorithm for graph partitioning and decomposition. In *Proceedings of the 22nd SODA* (2011), SIAM, San Francisco, CA, 532545.

16. Spielman, D.A. Linear-time encodable and decodable error-correcting codes. *IEEE Trans. Information Theory* 42, 6 (1996), 17231731.

* Supported by Center for Massive Data Algorithmics, a Center of the Danish National Research Foundation, grant DNRF84, a Villum Young Investigator Grant and an AUFF Starting Grant.

Supported by NSF grant IIS-1447471 and CAREER award CCF-1350670, ONR Young Investigator award N00014-15-1-2388 and DORECG award N00014-17-1-2127, an Alfred P. Sloan Research Fellowship, and a Google Faculty Research Award.

Supported by NSF CAREER award CCF-1750716.

§ Supported by his Advanced Grant DFF-0602-02499B from the Danish Council for Independent Research and by his Investigator Grant 16582, Basic Algorithms Research Copenhagen (BARC), from the VILLUM Foundation.

The preliminary version of this paper was published in IEEE FOCS 2016.

**©2019 ACM 0001-0782/19/08**

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2019 ACM, Inc.

No entries found