Recent advancements in learned index structures propose replacing existing index structures, like B-Trees, with approximate learned models. In this work, we present a unified benchmark that compares well-tuned implementations of three learned index structures against several state-of-the-art "traditional" baselines. Using four real-world datasets, we demonstrate that learned index structures can indeed outperform non-learned indexes in read-only in-memory workloads over a dense array. We investigate the impact of caching, pipelining, dataset size, and key size. We study the performance profile of learned index structures, and build an explanation for why learned models achieve such good performance. Finally, we investigate other important properties of learned index structures, such as their performance in multi-threaded systems and their build times.2020-09-01
Clustering is a popular unsupervised data mining technique. It has been applied in various data mining and big data applications. Efficient clustering algorithms and implementation techniques are keys to cope with the scalability and performance requirements of big data analysis. This paper introduces the design and implementation of a density-based clustering algorithm that can deal with big data efficiently and effectively. We present a parallel Shared Nearest Neighbor (SNN) clustering algorithm using the k-dimensional tree (k-d tree) to reduce search time to improve efficiency. The proposed algorithm is implemented in a distributed environment using the Spark framework. The effectiveness of the proposed algorithm has been evaluated through a case study involving four data sets, Bristol Crime Stats, 911 call, Complex9, and TLC Trip datasets.2020-08-26
The massive amount of data and large variety of data distributions in the big data era call for access methods that are efficient in both query processing and index management, and over both practical and worst-case workloads. To address this need, we revisit two classic multidimensional access methods—the R-tree and the space-filling curve. We propose a novel R-tree packing strategy based on space-filling curves. This strategy produces R-trees with an asymptotically optimal I/O complexity for window queries in the worst case. Experiments show that our R-trees are highly efficient in querying both real and synthetic data of different distributions. The proposed strategy is also simple to parallelize, since it relies only on sorting. We propose a parallel algorithm for R-tree bulk-loading based on the proposed packing strategy and analyze its performance under the massively parallel communication model. To handle dynamic data updates, we further propose index update algorithms that process data insertions and deletions without compromising the optimal query I/O complexity. Experimental results confirm the effectiveness and efficiency of the proposed R-tree bulk-loading and updating algorithms over large data sets.2020-08-26
Wireless sensor networks play an important role in today's world: When measuring physical conditions, the quality of the sensor readings ultimately impacts the quality of various data analytical services. To maintain data correctness and quality, run-time measures such as anomaly detection techniques are gaining significance. In particular, the detection of threatening node anomalies caused by sensor node faults has become a crucial task.
The detection of faulty sensor nodes is a non-trivial task because wireless sensor networks typically consist of low-cost embedded systems with strictly limited resources, especially regarding their energy budget. Thus, efficient and lightweight approaches that meet the requirements of sensor networks are required.
In this SoK paper, we contribute with a novel taxonomy of anomaly detection approaches focused on wireless sensor networks and a meta-survey of related classification schemes. To the best of our knowledge, our taxonomy is a comprehensive super-set of all previously published taxonomies in this field. Based on the taxonomy, we present new insights in node-level anomaly detection approaches and the applicability of immune-inspired techniques, and we lay out related research challenges.2020-08-25
Containers are commonly used to run the data-intensive applications of different tenants in cloud infrastructures. The storage I/O of the colocated tenants is typically handled by the shared system kernel of the container host. When a data-intensive container competes with a noisy neighbor, the kernel I/O services can cause performance variability and slowdown. This is a challenging problem for which several approaches have already been explored. Although the dynamic resource allocation, kernel structure replication, and hardware-level virtualization are helpful, they incur costs of high implementation complexity and execution overhead. As a realistic cost-effective alternative, we isolate the I/O path of each tenant by running dedicated storage systems at user level on reserved resources. We introduce the libservices as a unified user-level storage abstraction to dynamically provision per tenant container root filesystems, application data filesystems and image repositories. We outline several examples of container storage systems whose clients and servers can be composed from libservices. With an early prototype, we successfully demonstrate that the libservices combine the required efficiency and flexibility to build isolated I/O services on multitenant hosts with superior performance over existing user-level or kernel-level systems.2020-08-24
In this paper, we report our recent practice at Tencent for user modeling based on mobile app usage. User behaviors on mobile app usage, including retention, installation, and uninstallation, can be a good indicator for both long-term and short-term interests of users. For example, if a user installs Snapseed recently, she might have a growing interest in photographing. Such information is valuable for numerous downstream applications, including advertising, recommendations, etc. Traditionally, user modeling from mobile app usage heavily relies on handcrafted feature engineering, which requires onerous human work for different downstream applications, and could be sub-optimal without domain experts. However, automatic user modeling based on mobile app usage faces unique challenges, including (1) retention, installation, and uninstallation are heterogeneous but need to be modeled collectively, (2) user behaviors are distributed unevenly over time, and (3) many long-tailed apps suffer from serious sparsity. In this paper, we present a tailored Auto Encoder-coupled Transformer Network (AETN), by which we overcome these challenges and achieve the goals of reducing manual efforts and boosting performance. We have deployed the model at Tencent, and both online/offline experiments from multiple domains of downstream applications have demonstrated the effectiveness of the output user embeddings.2020-08-23
One of the long-standing questions in search systems is the role of diversity in results. From a product perspective, showing diverse results provides the user with more choice and should lead to an improved experience. However, this intuition is at odds with common machine learning approaches to ranking which directly optimize the relevance of each individual item without a holistic view of the result set. In this paper, we describe our journey in tackling the problem of diversity for Airbnb search, starting from heuristic based approaches and concluding with a novel deep learning solution that produces an embedding of the entire query context by leveraging Recurrent Neural Networks (RNNs). We hope our lessons learned will prove useful to others and motivate further research in this area.2020-08-23
Accurately and globally mapping human infrastructure is an important and challenging task with applications in routing, regulation compliance monitoring, and natural disaster response management etc.. In this paper we present progress in developing an algorithmic pipeline and distributed compute system that automates the process of map creation using high resolution aerial images. Unlike previous studies, most of which use datasets that are available only in a few cities across the world, we utilizes publicly available imagery and map data, both of which cover the contiguous United States (CONUS). We approach the technical challenge of inaccurate and incomplete training data adopting state-of-the-art convolutional neural network architectures such as the U-Net and the CycleGAN to incrementally generate maps with increasingly more accurate and more complete labels of man-made infrastructure such as roads and houses. Since scaling the mapping task to CONUS calls for parallelization, we then adopted an asynchronous distributed stochastic parallel gradient descent training scheme to distribute the computational workload onto a cluster of GPUs with nearly linear speed-up.2020-08-23
A collection of related files often exhibits strong similarities among its constituents. These similarities, and the dual differences, may be used for both compressing the collection and for organizing it in a manner that reveals human-readable structure and relationships. This paper motivates and studies methods for such organizing and compression of file collections using inter-file differences. It presents an algorithm based on computing a minimum-weight spanning tree of a graph that has vertices corresponding to files and edges with weights corresponding to the size of the difference between the documents of its incident vertices. It describes the design and implementation of a prototype system called diboc (for difference-based organization and compression) that uses these methods to enable both compression and graphical organization and interactive exploration of a file collection. It illustrates the benefits of this system by presenting examples of its operation on a widely deployed and publicly available corpus of file collections (collections of PPD files used to configure the CUPS printing system as packaged by the Debian GNU/Linux distribution). In addition to these qualitative measures, some quantitative experimental results of applying the methods to the same corpus are also presented.2020-08-12
Sustainable tourism development is a key factor for tourism destination development, and evaluating the sustainable tourism development level can reflect the current situation, existing problems and deficiencies of tourism destinations. Based on constructing the evaluation index system of sustainable tourism development from four aspects, they are sustainability of tourism resources, economic feasibility, social acceptability and environmental rationality, this work applied AHP method to calculate the weight of each index, and takes Luojiang District, Deyang, Sichuan Province, China as research area. On the basis of data collection and collation, the sustainable tourism development level of Luojiang District between 2002 and 2016 was evaluated, the results show that from 2002 to 2016, the sustainable tourism development level of Luojiang District showed an overall upward trend, and can be divided into four stages: the first is the preparation stage of 2002-2011, the second is the preliminary stage of 2012-2013, the third is the basic stage of 2014-2015 and the fourth is the sustainable stage from 2016.2020-08-07