Many standard problems in computational geometry have been solved asymptotically optimally as far as comparison-based algorithms are concerned, but there has been little work focusing on improving the constant factors hidden in big-Oh bounds on the number of comparisons needed. In this paper, we consider orthogonal-type problems and present a number of results that achieve optimality in the constant factors of the leading terms, including:
• an algorithm for the 2D maxima problem that uses n lg h +O(n√lg h) comparisons, where h denotes the output size;
• a randomized algorithm for the 3D maxima problem that uses n lg h + O(n lg2/3 h) expected number of comparisons;
• a randomized algorithm for detecting intersections among a set of orthogonal line segments that uses n lg n + O(n√lg n) expected number of comparisons;
• a data structure for point location among 3D disjoint axis-parallel boxes that can answer queries in (3/2) lg n + O(lg lg n) comparisons;
• a data structure for point location in a 3D box subdivision that can answer queries in (4/3)lg n + O(√lgn) comparisons.
Some of the results can be adapted to solve nonorthogonal problems, such as 2D convex hulls and general line segment intersection.
Our algorithms and data structures use a variety of techniques, including Seidel and Adamy's planar point location method, weighted binary search, and height-optimal BSP trees.
2014-06-08Recently, Arya, da Fonseca, and Mount [STOC 2011, SODA 2012] made notable progress in improving the &epsis;-dependencies in the space/query-time tradeoffs for (1 + &epsis;)-factor approximate nearest neighbor search in fixed-dimensional Euclidean spaces. However, &epsis;-dependencies in the preprocessing time were not considered, and so their data structures cannot be used to derive faster algorithms for offline proximity problems. Known algorithms for many such problems, including approximate bichromatic closest pair (BCP) and approximate Euclidean minimum spanning trees (EMST), typically have factors near (1/&epsis;)d/2±O(1) in the running time when the dimension d is a constant.
We describe a technique that breaks the (1/&epsis;)d/2 barrier and yields new results for many well-known proximity problems, including:
• an O((1/&epsis;)d/3+O(1) n)-time randomized algorithm for approximate BCP,
• an O((1/&epsis;)d/3+O(1) n log n)-time algorithm for approximate EMST, and
• an O(n log n + (1/&epsis;)d/3+O(1) n)-time algorithm to answer n approximate nearest neighbor queries on n points.
Using additional bit-packing tricks, we can shave off the log n factor for EMST, and even move most of the &epsis;-factors to a sublinear term.
The improvement arises from a new time bound for exact "discrete Voronoi diagrams", which were previously used in the construction of &epsis;-kernels (or extent-based coresets), a well-known tool for another class of fundamental problems. This connection leads to more results, including:
• a streaming algorithm to maintain an approximate diameter in O((1/&epsis;)d/3+O(1)) time per point using O((1/&epsis;)d/2+O(1)) space, and
• a streaming algorithm to maintain an &epsis;-kernel in O((1/&epsis;)d/4+O(1)) time per point using O((1/&epsis;)d/2+O(1)) space.
2014-06-08Persuasive technology to motivate healthy behavior is a growing area of research within HCI and ubiquitous computing. The emergence of commercial wearable devices for tracking health- and fitness-related activities arguably represents the first widespread adoption of dedicated ubiquitous persuasive technology. The recent ubiquity of commercial systems allows us to learn about their value and use in truly "in the wild" contexts and understand how practices evolve over long-term, naturalistic use. We present a study with 30 participants who had adopted wearable activity-tracking devices of their own volition and had continued to use them for between 3 and 54 months. The findings, which both support and contrast with those of previous research, paint a picture of the evolving benefits and practices surrounding these emerging technologies over long periods of use. They also serve as the basis for design implications for personal informatics technologies for long-term health and fitness support.
2014-04-26Stencil computations are a common class of operations that appear in many computational scientific and engineering applications. Stencil computations often benefit from compile-time analysis, exploiting data-locality, and parallelism. Post-processing of discontinuous Galerkin (dG) simulation solutions with B-spline kernels is an example of a numerical method which requires evaluating computationally intensive stencil operations over a mesh. Previous work on stencil computations has focused on structured meshes, while giving little attention to unstructured meshes. Performing stencil operations over an unstructured mesh requires sampling of heterogeneous elements which often leads to inefficient memory access patterns and limits data locality/reuse. In this paper, we present an efficient method for performing stencil computations over unstructured meshes which increases data-locality and cache efficiency, and a scalable approach for stencil tiling and concurrent execution. We provide experimental results in the context of post-processing of dG solutions that demonstrate the effectiveness of our approach.
2013-11-17Document similarity tasks arise in many areas of information retrieval and natural language processing. A fundamental question when comparing documents is which representation to use. Topic models, which have served as versatile tools for exploratory data analysis and visualization, represent documents as probability distributions over latent topics. Systems comparing topic distributions thus use measures of probability divergence such as Kullback-Leibler, Jensen-Shannon, or Hellinger. This paper presents novel analysis and applications of the reduction of Hellinger divergence to Euclidean distance computations. This reduction allows us to exploit fast approximate nearest-neighbor (NN) techniques, such as locality-sensitive hashing (LSH) and approximate search in k-d trees, for search in the probability simplex. We demonstrate the effectiveness and efficiency of this approach on two tasks using latent Dirichlet allocation (LDA) document representations: discovering relationships between National Institutes of Health (NIH) grants and prior-art retrieval for patents. Evaluation on these tasks and on synthetic data shows that both Euclidean LSH and approximate k-d tree search perform well when a single nearest neighbor must be found. When a larger set of similar documents is to be retrieved, the k-d tree approach is more effective and efficient.
2013-09-29The aim of this work is to describe an exploratory study on the use of a SAX-based Multiresolution Motif Discovery method for Heart Sound Classification. The idea of our work is to discover relevant frequent motifs in the audio signals and use the discovered motifs and their frequency as characterizing attributes. We also describe different configurations of motif discovery for defining attributes and compare the use of a decision tree based algorithm with random forests on this kind of data. Experiments were performed with a dataset obtained from a clinic trial in hospitals using the digital stethoscope DigiScope. This exploratory study suggests that motifs contain valuable information that can be further exploited for Heart Sound Classification.
2013-07-10We answer a basic data structuring question (e.g., raised by Dietz and Raman [1991]): Can van Emde Boas trees be made persistent, without changing their asymptotic query/update time? We present a (partially) persistent data structure that supports predecessor search in a set of integers in {1, ..., U} under an arbitrary sequence of n insertions and deletions, with O(log log U) expected query time and expected amortized update time, and O(n) space. The query bound is optimal in U for linear-space structures and improves previous near-O((log log U)^{2}) methods.
The same method solves a fundamental problem from computational geometry: point location in orthogonal planar subdivisions (where edges are vertical or horizontal). We obtain the first static data structure achieving O(log log U) worst-case query time and linear space. This result is again optimal in U for linear-space structures and improves the previous O((log log U)^{2}) method by de Berg et al. [1995]. The same result also holds for higher-dimensional subdivisions that are orthogonal binary space partitions, and for certain nonorthogonal planar subdivisions such as triangulations without small angles. Many geometric applications follow, including improved query times for orthogonal range reporting for dimensions ≥ 3 on the RAM.
Our key technique is an interesting new van-Emde-Boas--style recursion that alternates between two strategies, both quite simple.
2013-06-01We study optimization in the presence of uncertainty such as noise in measurements, and advocate a novel approach of tackling it. The main difference to any existing approach is that we do not assume any knowledge about the nature of the uncertainty (such as for instance a probability distribution). Instead, we are given several instances of the same optimization problem as input, and, assuming they are typical w.r.t. the uncertainty, we make use of it in order to compute a solution that is good for the sample instances as well as for future (unknown) typical instances.
We demonstrate our approach for the case of two typical input instances. We first propose a measure of similarity of instances with respect to an objective. This concept allows us to assess whether instances are indeed typical. Based on this concept, we then choose a solution randomly among all solutions that are near-optimum for both instances. We show that the exact notion of near-optimum is intertwined with the proposed measure of similarity. Furthermore, we will show that our measure of similarity also allows us to derive formal statements about the expected quality of the computed solution: If the given instances are not similar, or are too noisy, our approach will detect this. We demonstrate for a few optimization problems and real world data that our approach works well not only in theory, but also in practice.
2013-01-09During the past decade, various GPS-equipped devices have generated a tremendous amount of data with time and location information, which we refer to as big spatio-temporal data. In this paper, we present the design and implementation of CloST, a scalable big spatio-temporal data storage system to support data analytics using Hadoop. The main objective of CloST is to avoid scan the whole dataset when a spatio-temporal range is given. To this end, we propose a novel data model which has special treatments on three core attributes including an object id, a location and a time. Based on this data model, CloST hierarchically partitions data using all core attributes which enables efficient parallel processing of spatio-temporal range scans. According to the data characteristics, we devise a compact storage structure which reduces the storage size by an order of magnitude. In addition, we proposes scalable bulk loading algorithms capable of incrementally adding new data into the system. We conduct our experiments using a very large GPS log dataset and the results show that CloST has fast data loading speed, desirable scalability in query processing, as well as high data compression ratio.
2012-10-29DNA fragment assembly problem is one of the crucial challenges faced by computational biologists where, given a set of DNA fragments, we have to construct a complete DNA sequence from them. As it is an NP-hard problem, accurate DNA sequence is hard to find. Moreover, due to experimental limitations, the fragments considered for assembly are exposed to additional errors while reading the fragments. In such scenarios, meta-heuristic based algorithms can come in handy. We analyze the performance of two swarm intelligence based algorithms namely Artificial Bee Colony (ABC) algorithm and Queen Bee Evolution Based on Genetic Algorithm (QEGA) to solve the fragment assembly problem and report quite promising results. Our main focus is to design meta-heuristic based techniques to efficiently handle DNA fragment assembly problem for noisy and noiseless data.
2012-07-07