Computing Applications Latin America Regional Special Section: Hot Topics

Three Success Stories About Compact Data Structures

By Diego Arroyuelo, José Fuentes-Sepülveda, and Diego Seco

Posted Nov 1 2020

Article
References
Authors
Footnotes

Technology evolution is no longer keeping pace with the growth of data. We are facing problems storing and processing the huge amounts of data produced every day. People rely on data-intensive applications and new paradigms (for example, edge computing) to try to keep computation closer to where data is produced and needed. Thus, the need to store and query data in devices where capacity is surpassed by data volume is routine today, ranging from astronomy data to be processed by supercomputers, to personal data to be processed by wearable sensors. The scale is different, yet the underlying problem is the same.

Keeping data structures in fast memory has been the major workhorse of compact data structures (CDS), with enormous practical benefits, such as speeding up data processing and querying.¹⁰ This has situated CDS at the forefront of research in data structures over the last 20 years. Remarkable contributions have been made, many of them by Latin American researchers, forming an active and prolific community spread mainly over three Chilean universities, and branching from there to other universities in Chile and Latin America as well as to Europe, the U.S., Canada, East Asia, and Australia. The senior researcher at Universidad de Chile is Gonzalo Navarro; the groups at Universidad Tecnica Federico Santa María and Universidad de Concepción are integrated by Diego Arroyuelo, José Fuentes-Sepúlveda, and Diego Seco. Here, we describe three success stories with strong roots in the region.

The need to store and query data in devices where capacity is surpassed by data volume is routine today.

Trees (particularly ordinal, cardinal, and binary trees) are a paradigmatic example of CDS success. After Jacobson’s seminal work,⁹ several ordinal tree CDS were presented, most achieving the 2n-bit lower bound for an n-node tree: only 2-bits per node, up to 32x smaller than a pointer-based representation! Within this bound, researchers supported, progressively, more tree operations. The most outstanding result in this line supports a striking set of over 25 operations in constant time for static ordinal trees (sublogarithmic for dynamic trees).¹¹ The key was the ability to translate all tree operations into operations on a sequence of 2n balanced parentheses (as former approaches), which are in turn supported by a single data structure of sublinear (o(n)-bit) additional space: the so-called range min-max tree.¹¹ This represented remarkable progress, as former approaches typically needed a separate data structure per operation, which is impractical. Indeed, this approach is practical: about 2.1-bits per tree node on average,¹ with operation times within a few microseconds. For dynamic k-ary cardinal trees, the best result uses space close to the optimal, supporting a set of 13 operations in constant time for k = O(log n)² (sublogarithmic time for general k). Besides its independent theoretical interest, this motivated new results in related areas, for example, the construction of compressed text indexes.³ For dynamic binary trees, one can use optimal space and support tree operations in constant time (including insertions/deletions), while associating values to tree nodes space-efficiently.² Both features were not achieved jointly before. On a related line, the progress on balanced parenthesis sequences¹¹ led to compact representations of planar graphs of e edges using only 4e + o(e) bits,⁶ with direct applications in geographic information systems (GIS).⁷

The second success story is that of k²-trees,⁴ a compact representation of well-known quadtrees. Although originally designed to support fast navigation on Web graphs, their versatility allows one to support operations like variants of range queries on grid points, and use them in domains such as binary relations, RDF databases, raster and trajectory data on GIS, and OLAP cubes in data warehouses. Even though k²-trees do not provide good theoretical guarantees, they usually perform well in practice. In a nutshell, they exploit large empty areas that arise in the adjacency matrix of usual graphs (or any binary matrix, in general), requiring less space when the binary matrix is clustered. For Web graphs, they require 1–3-bits per graph edge (that is, per non-empty entry in the matrix), while supporting usual graph operations efficiently. A remarkable contribution was the support of direct and reverse neighbors within such space, whereas previous approaches needed to double the space. Conceptually, a k²-tree is obtained by recursively partitioning the matrix into k² submatrices, which for k=2 resembles the quadtree partitioning. The tree obtained from this hierarchical subdivision is represented level-wise using bitmaps, avoiding the use of pointers. Operations on the tree are simulated using fast operations on the bitmaps. This also facilitates the addition of data to the tree nodes, which is used to augment the basic structure with, for example, integer values of a raster or OLAP cube. k2-trees have been used recently to support worst-case optimal joins in graph databases.¹²

Figure. South American map graph.

The third success story regards the synergy between CDS and bioinformatics,⁵ evidenced by an international collaboration project,^a where Latin American researchers played a central role. In this context, the research on indexing highly repetitive text collections has provided outstanding results. These collections can be found naturally in genomic databases (storing genome sequences of individuals of the same species), versioned text collections (for example, Wikipedia), and software repositories (for example, GitHub). Classical compressed text indexes, such as FM-indexes, provide search functionalities using space close to the statistical entropy of the text collection, being insensitive to repetitiveness. Recently, a novel index, called r-index,⁸ was proposed to exploit text repetitiveness. It uses space close to r, the number of runs (maximal substring consisting of a single character) of the Burrows-Wheeler Transform¹⁰ of the text. A highly repetitive text yields a small r. Within O(r) space, the r-index supports a complete set of search functionalities in O(log (n/r)) time, n being the text length. Compared to the state-of-the-art short-read aligner Bowtie,^b the r-index typically uses less than 0.1 bits per base, whereas Bowtie uses 4-bits per base, both having comparable running times. The r-index is expected to be a breakthrough in the development of a new generation of space-efficient bioinformatic tools.

Today, space-efficient algorithms and data structures have become a must. Initially unsuspected results have been accomplished over the years. The introduction of the Internet of Things, scientific initiatives (for example, particle colliders and astronomical observatories), seismic/volcano monitoring systems, edge computing (which involves semi-autonomous devices with limited memory), and smartphone applications, among others, impose challenges to CDS. For instance, a new observatory in northern Chile is expected to generate 20 terabytes of data every night. This challenges the ability to build CDS for huge amounts of data and handle data in streaming mode (so it can be queried as it is received). Similarly, supporting multi-petabyte catalogs of satellite imagery and geospatial datasets, like the one by Google Earth Engine, is a challenge for raster representations based on CDS. Finally, since CDS aim at reducing memory transfers, they might help to design energy-aware algorithms. The maturity of the field will be key to face the data flood.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Three Success Stories About Compact Data Structures

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/3416971

November 2020 Issue

Published: November 1, 2020

Vol. 63 No. 11

Pages: 64-65

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Apr 26 2024

Optimizing Energy Efficiency in Datacenters with Advanced Cooling Technologies

Alex Williams

Architecture and Hardware

Credit: Getty Images Servers in snowy setting.

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Three Success Stories About Compact Data Structures

DOI

November 2020 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.