Technology evolution is no longer keeping pace with the growth of data. We are facing problems storing and processing the huge amounts of data produced every day. People rely on data-intensive applications and new paradigms (for example, edge computing) to try to keep computation closer to where data is produced and needed. Thus, the need to store and query data in devices where capacity is surpassed by data volume is routine today, ranging from astronomy data to be processed by supercomputers, to personal data to be processed by wearable sensors. The scale is different, yet the underlying problem is the same.
Keeping data structures in fast memory has been the major workhorse of compact data structures (CDS), with enormous practical benefits, such as speeding up data processing and querying.10 This has situated CDS at the forefront of research in data structures over the last 20 years. Remarkable contributions have been made, many of them by Latin American researchers, forming an active and prolific community spread mainly over three Chilean universities, and branching from there to other universities in Chile and Latin America as well as to Europe, the U.S., Canada, East Asia, and Australia. The senior researcher at Universidad de Chile is Gonzalo Navarro; the groups at Universidad Tecnica Federico Santa María and Universidad de Concepción are integrated by Diego Arroyuelo, José Fuentes-Sepúlveda, and Diego Seco. Here, we describe three success stories with strong roots in the region.
The need to store and query data in devices where capacity is surpassed by data volume is routine today.
Trees (particularly ordinal, cardinal, and binary trees) are a paradigmatic example of CDS success. After Jacobson's seminal work,9 several ordinal tree CDS were presented, most achieving the 2n-bit lower bound for an n-node tree: only 2-bits per node, up to 32x smaller than a pointer-based representation! Within this bound, researchers supported, progressively, more tree operations. The most outstanding result in this line supports a striking set of over 25 operations in constant time for static ordinal trees (sublogarithmic for dynamic trees).11 The key was the ability to translate all tree operations into operations on a sequence of 2n balanced parentheses (as former approaches), which are in turn supported by a single data structure of sublinear (o(n)-bit) additional space: the so-called range min-max tree.11 This represented remarkable progress, as former approaches typically needed a separate data structure per operation, which is impractical. Indeed, this approach is practical: about 2.1-bits per tree node on average,1 with operation times within a few microseconds. For dynamic k-ary cardinal trees, the best result uses space close to the optimal, supporting a set of 13 operations in constant time for k = O(log n)2 (sublogarithmic time for general k). Besides its independent theoretical interest, this motivated new results in related areas, for example, the construction of compressed text indexes.3 For dynamic binary trees, one can use optimal space and support tree operations in constant time (including insertions/deletions), while associating values to tree nodes space-efficiently.2 Both features were not achieved jointly before. On a related line, the progress on balanced parenthesis sequences11 led to compact representations of planar graphs of e edges using only 4e + o(e) bits,6 with direct applications in geographic information systems (GIS).7
The second success story is that of k2-trees,4 a compact representation of well-known quadtrees. Although originally designed to support fast navigation on Web graphs, their versatility allows one to support operations like variants of range queries on grid points, and use them in domains such as binary relations, RDF databases, raster and trajectory data on GIS, and OLAP cubes in data warehouses. Even though k2-trees do not provide good theoretical guarantees, they usually perform well in practice. In a nutshell, they exploit large empty areas that arise in the adjacency matrix of usual graphs (or any binary matrix, in general), requiring less space when the binary matrix is clustered. For Web graphs, they require 1–3-bits per graph edge (that is, per non-empty entry in the matrix), while supporting usual graph operations efficiently. A remarkable contribution was the support of direct and reverse neighbors within such space, whereas previous approaches needed to double the space. Conceptually, a k2-tree is obtained by recursively partitioning the matrix into k2 submatrices, which for k=2 resembles the quadtree partitioning. The tree obtained from this hierarchical subdivision is represented level-wise using bitmaps, avoiding the use of pointers. Operations on the tree are simulated using fast operations on the bitmaps. This also facilitates the addition of data to the tree nodes, which is used to augment the basic structure with, for example, integer values of a raster or OLAP cube. k2-trees have been used recently to support worst-case optimal joins in graph databases.12
The third success story regards the synergy between CDS and bioinformatics,5 evidenced by an international collaboration project,a where Latin American researchers played a central role. In this context, the research on indexing highly repetitive text collections has provided outstanding results. These collections can be found naturally in genomic databases (storing genome sequences of individuals of the same species), versioned text collections (for example, Wikipedia), and software repositories (for example, GitHub). Classical compressed text indexes, such as FM-indexes, provide search functionalities using space close to the statistical entropy of the text collection, being insensitive to repetitiveness. Recently, a novel index, called r-index,8 was proposed to exploit text repetitiveness. It uses space close to r, the number of runs (maximal substring consisting of a single character) of the Burrows-Wheeler Transform10 of the text. A highly repetitive text yields a small r. Within O(r) space, the r-index supports a complete set of search functionalities in O(log (n/r)) time, n being the text length. Compared to the state-of-the-art short-read aligner Bowtie,b the r-index typically uses less than 0.1 bits per base, whereas Bowtie uses 4-bits per base, both having comparable running times. The r-index is expected to be a breakthrough in the development of a new generation of space-efficient bioinformatic tools.
Today, space-efficient algorithms and data structures have become a must. Initially unsuspected results have been accomplished over the years. The introduction of the Internet of Things, scientific initiatives (for example, particle colliders and astronomical observatories), seismic/volcano monitoring systems, edge computing (which involves semi-autonomous devices with limited memory), and smartphone applications, among others, impose challenges to CDS. For instance, a new observatory in northern Chile is expected to generate 20 terabytes of data every night. This challenges the ability to build CDS for huge amounts of data and handle data in streaming mode (so it can be queried as it is received). Similarly, supporting multi-petabyte catalogs of satellite imagery and geospatial datasets, like the one by Google Earth Engine, is a challenge for raster representations based on CDS. Finally, since CDS aim at reducing memory transfers, they might help to design energy-aware algorithms. The maturity of the field will be key to face the data flood.
12. Navarro, G., Reutter, J.L. and Rojas-Ledesma, J. Optimal joins using compact data structures. In Proceedings of the Intern. Conf. Database Theory Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020, 21:1–21:21.
a. BIRDS Project: Bioinformatics and Information Retrieval Data Structures Analysis and Design; http://www.birdsproject.eu.
b. Bowtie: An ultrafast memory-efficient short read aligner; http://bowtie-bio.sf.net/.
©2020 ACM 0001-0782/20/11
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from email@example.com or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2020 ACM, Inc.
No entries found