Corporations today collect data at an unprecedented and accelerating scale, making the need to run queries on large datasets increasingly important. Technologies such as columnar block-based data organization and compression have become standard practice in most commercial database systems. However, the problem of best assigning records to data blocks on storage is still open. For example, today's systems usually partition data by arrival time into row groups, or range/hash partition the data based on selected fields. For a given workload, however, such techniques are unable to optimize for the important metric of the number of blocks accessed by a query. This metric directly relates to the I/O cost, and therefore performance, of most analytical queries. Further, they are unable to exploit additional available storage to drive this metric down further. In this paper, we propose a new framework called a query-data routing tree, or qd-tree, to address this problem, and propose two algorithms for their construction based on greedy and deep reinforcement learning techniques. Experiments over benchmark and real workloads show that a qd-tree can provide physical speedups of more than an order of magnitude compared to current blocking schemes, and can reach within 2X of the lower bound for data skipping based on selectivity, while providing complete semantic descriptions of created blocks.2020-06-11
The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DBSCAN in Euclidean space that take O(nłog n) work for two dimensions, sub-quadratic work for three or more dimensions, and can be computed approximately in linear work for any constant number of dimensions. However, existing parallel DBSCAN algorithms require quadratic work in the worst case. This paper bridges the gap between theory and practice of parallel DBSCAN by presenting new parallel algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the work bounds of their sequential counterparts, and are highly parallel (polylogarithmic depth). We present implementations of our algorithms along with optimizations that improve their practical performance. We perform a comprehensive experimental evaluation of our algorithms on a variety of datasets and parameter settings. Our experiments on a 36-core machine with two-way hyper-threading show that our implementations outperform existing parallel implementations by up to several orders of magnitude, and achieve speedups of up to 33x over the best sequential algorithms.2020-06-11
Locality-Sensitive Hashing (LSH) is one of the most popular methods for c-Approximate Nearest Neighbor Search (c-ANNS) in high-dimensional spaces. In this paper, we propose a novel LSH scheme based on the Longest Circular Co-Substring (LCCS) search framework (LCCS-LSH) with a theoretical guarantee. We introduce a novel concept of LCCS and a new data structure named Circular Shift Array (CSA) for k-LCCS search. The insight of LCCS search framework is that close data objects will have a longer LCCS than the far-apart ones with high probability. LCCS-LSH is LSH-family-independent, and it supports c-ANNS with different kinds of distance metrics. We also introduce a multi-probe version of LCCS-LSH and conduct extensive experiments over five real-life datasets. The experimental results demonstrate that LCCS-LSH outperforms state-of-the-art LSH schemes.2020-06-11
Posit is a recently proposed alternative to the floating point representation (FP). It provides tapered accuracy. Given a fixed number of bits, the posit representation can provide better precision for some numbers compared to FP, which has generated significant interest in numerous domains. Being a representation with tapered accuracy, it can introduce high rounding errors for numbers outside the above golden zone. Programmers currently lack tools to detect and debug errors while programming with posits.
This paper presents PositDebug, a compile-time instrumentation that performs shadow execution with high precision values to detect various errors in computation using posits. To assist the programmer in debugging the reported error, PositDebug also provides directed acyclic graphs of instructions, which are likely responsible for the error. A contribution of this paper is the design of the metadata per memory location for shadow execution that enables productive debugging of errors with long-running programs. We have used PositDebug to detect and debug errors in various numerical applications written using posits. To demonstrate that these ideas are applicable even for FP programs, we have built a shadow execution framework for FP programs that is an order of magnitude faster than Herbgrind.2020-06-11
In spatial query processing, the popular index R-tree may incur large storage consumption and high IO cost. Inspired by the recent learned index  that replaces B-tree with machine learning models, we study an analogy problem for spatial data. We propose a novel Learned Index structure for Spatial dAta (LISA for short). Its core idea is to use machine learning models, through several steps, to generate searchable data layout in disk pages for an arbitrary spatial dataset. In particular, LISA consists of a mapping function that maps spatial keys (points) into 1-dimensional mapped values, a learned shard prediction function that partitions the mapped space into shards, and a series of local models that organize shards into pages. Based on LISA, a range query algorithm is designed, followed by a lattice regression model that enables us to convert a KNN query to range queries. Algorithms are also designed for LISA to handle data updates. Extensive experiments demonstrate that LISA clearly outperforms R-tree and other alternatives in terms of storage consumption and IO cost for queries. Moreover, LISA can handle data insertions and deletions efficiently.2020-06-11
Similarity search is an important and challenging problem that is typically modeled as nearest neighbor search in high dimensional space, where objects are represented as high dimensional vectors and their (dis)similarity is evaluated using a distance measure such as the Euclidean distance.2020-06-11
Given an integer array A[1..n], the Range Minimum Query problem (RMQ) asks to preprocess A into a data structure, supporting RMQ queries: given a,b∈ [1,n], return the index i∈[a,b] that minimizes A[i], i.e., argmini∈[a,b] A[i]. This problem has a classic solution using O(n) space and O(1) query time by Gabow, Bentley, Tarjan (STOC, 1984) and Harel, Tarjan (SICOMP, 1984). The best known data structure by Fischer, Heun (SICOMP, 2011) and Navarro, Sadakane (TALG, 2014) uses 2n+n/(logn/t)t+Õ(n3/4) bits and answers queries in O(t) time, assuming the word-size is w=Θ(logn). In particular, it uses 2n+n/polylogn bits of space as long as the query time is a constant.
In this paper, we prove the first lower bound for this problem, showing that 2n+n/polylogn space is necessary for constant query time. In general, we show that if the data structure has query time O(t), then it must use at least 2n+n/(logn)Õ(t2) space, in the cell-probe model with word-size w=Θ(logn).2020-06-08
We address the problem of performing efficient spatial and topological queries on large tetrahedral meshes with arbitrary topology and complex boundaries. Such meshes arise in several application domains, such as 3D Geographic Information Systems (GISs), scientific visualization, and finite element analysis. To this aim, we propose Tetrahedral trees, a family of spatial indexes based on a nested space subdivision (an octree or a kD-tree) and defined by several different subdivision criteria. We provide efficient algorithms for spatial and topological queries on Tetrahedral trees and compare to state-of-the-art approaches. Our results indicate that Tetrahedral trees are an improvement over R*-trees for querying tetrahedral meshes; they are more compact, faster in many queries, and stable at variations of construction thresholds. They also support spatial queries on more general domains than topological data structures, which explicitly encode adjacency information for efficient navigation but have difficulties with domains with a non-trivial geometric or topological shape.2020-06-03
Self-monitoring devices are becoming increasingly popular in the support of physical activity experiences. These devices mostly represent on-screen data using numbers and graphs and in doing so, they may miss multi-sensorial methods for engaging with data. Embracing the opportunity for pleasurable interactions with one's own data through the use of different materials and digital fabrication technology, we designed and studied three systems that turn this data into 3D-printed plastic artifacts, sports drinks, and 3D-printed chocolate treats. We utilize the insights gained from associated studies, related literature, and our experiences in designing these systems to develop a conceptual framework, “Shelfie.” The “Shelfie” framework has 13 cards that convey key themes for creating material representations of physical activity data. Through this framework, we present a conceptual understanding of relationships between material representation and physical activity data and contribute guidelines to the design of meaningful material representations of physical activity data.2020-05-31