Sign In

Communications of the ACM

121 - 130 of 2,173 for bentley

Imbalanced big data classification: a distributed implementation of SMOTE

In the domain of machine learning, quality of data is most critical component for building good models. Predictive analytics is an AI stream used to predict future events based on historical learnings and is used in diverse fields like predicting online frauds, oil slicks, intrusion attacks, credit defaults, prognosis of disease cells etc. Unfortunately, in most of these cases, traditional learning models fail to generate required results due to imbalanced nature of data. Here imbalance denotes small number of instances belonging to the class under prediction like fraud instances in the total online transactions. The prediction in imbalanced classification gets further limited due to factors like small disjuncts which get accentuated during the partitioning of data when learning at scale. Synthetic generation of minority class data (SMOTE [<u>1</u>]) is one pioneering approach by Chawla [<u>1</u>] to offset said limitations and generate more balanced datasets. Although there exists a standard implementation of SMOTE in python, it is unavailable for distributed computing environments for large datasets. Bringing SMOTE to distributed environment under spark is the key motivation for our research. In this paper we present our algorithm, observations and results for synthetic generation of minority class data under spark using Locality Sensitivity Hashing [LSH]. We were able to successfully demonstrate a distributed version of Spark SMOTE which generated quality artificial samples preserving spatial distribution1.

Tweeting live shows: a content analysis of live-tweets from three entertainment programs

In this paper, we explored whether (and if so, how) live-tweets vary across different entertainment television programs in terms of the tweets' content. Using the 2013 Oscars, the Season 3 finale of Downton Abbey, and the 2014 Super Bowl as case studies, we collected over 200,000 live tweets sent during these three live entertainment programs and performed a content analysis of 4,400 of them. Results indicated that live-tweets, in general, reflect the features of the entertainment programs in many ways, suggesting that practitioners should incorporate more tailored social media strategies to better engage audiences. Theoretical implications and limitations were discussed in detail.

Mobile Devices and Professional Equipment Synergies for Sport Summary Production

We present a novel approach for sport video summary production that leverages the best aspects of mobile devices and professional equipment. The proposed recording set-up and workflow, consisting of both types of devices has two main advantages compared to conventional techniques. Firstly, it reduces cost of content production by reducing the cost of equipment and crew required for content capture. Secondly, it reduces the time for content production by leveraging automation. Subsequently, a tunable summary production approach is presented for creating a multi-camera representation of a salient event. Incorporating cinematic rules creates aesthetically pleasant viewing experience. Interactive production of the summary enables professional users as well as second screen device (mobile, tablet, etc.) users to create a summary, where inclusion of highly ranked salient events can be done based on the subjective viewing value. Furthermore, automation provides a framework for easy inclusion of crowdsourced content. The proposed hybrid production method is illustrated here by considering basketball as an example.

On Higher Dimensional Window Query: Revisited using BITS-Tree

We present a dynamic index structure using Balanced Inorder Threaded Segment Tree to store orthogonal rectangles in R2 and an efficient method of finding rectangle intersections. Although PR-Trees is known to be the best known data structure in the I/O model for this problem and performs window query in O((N/B)1-1/d + T/B) I/Os, where N is the number of d-dimensional rectangles, B is the disk block size, and T is the output size, the dynamic insertion of rectangles requires reconstruction of the tree. In the pointer machine model, since the Balanced Inorder Threaded Segment Tree (BITS-Tree) permits dynamic updates of one dimensional segments in logarithmic time, we store the horizontal and vertical segment information of rectangles into two separate BITS-Trees and solve the rectangle intersection problem in O(log n + k) time where k is the maximum number of rectangles reported, with support for dynamic updates in O(log n) time. Also, this is extended to d (d > 2) dimensional objects with a query time of O(d(log n + m)) where k is the maximum number of nodes of rectangles and update time of O(d log n) that shows an improvement over query time of range trees of O(log n + m) where m is the number of rectangles reported.

CrowdPop: Leveraging Multi-Source Crowd-Contributed Data for App Evolutionary Pattern Analysis and Popularity Prediction

The popularity prediction of mobile apps provides substantial value to a broad range of applications, ranging from app development to targeted advertising. However, most previous studies do this work by establishing regression models for impact factors, or using clustering and classification algorithms. It does not fully investigate the process of popularity evolution and the reasons behind it. In this paper, we discuss and analyze the potential predictors, especially the impact of early evolutionary patterns on future popularity. To this end, we first explore six basic evolutionary patterns and six impact factors that are closely related to app popularity. After detailed analysis, we present CrowdPop, a popularity prediction model based on the Random Forest algorithm, to quantify patterns and factors as predictors of CrowdPop. The experiment results with a real-world dataset of 126 apps indicate that, compared with baseline methods, our CrowdPop performs better in mobile app popularity prediction.

Design Opportunities in Three Stages of Relationship Development between Users and Self-Tracking Devices

Recently, self-tracking devices such as wearable activity trackers have become more available to end users. While these emerging products are imbued with new characteristics in terms of human-computer interaction, it is still unclear how to describe and design for user experience in such devices. In this paper, we present a three-week field study, which aimed to unfold users' experience with wearable activity trackers. Drawing from Knapp's model of interaction stages in interpersonal relationship development, we propose three stages of relationship development between users and self-tracking devices: initiation & experimentation, intensifying & integration, and stagnation & termination. We highlight the challenges in each stage and design opportunities for future self-tracking devices.

Efficient set intersection for inverted indexing

Conjunctive Boolean queries are a key component of modern information retrieval systems, especially when Web-scale repositories are being searched. A conjunctive query q is equivalent to a |q|-way intersection over ordered sets of integers, where each set represents the documents containing one of the terms, and each integer in each set is an ordinal document identifier. As is the case with many computing applications, there is tension between the way in which the data is represented, and the ways in which it is to be manipulated. In particular, the sets representing index data for typical document collections are highly compressible, but are processed using random access techniques, meaning that methods for carrying out set intersections must be alert to issues to do with access patterns and data representation. Our purpose in this article is to explore these trade-offs, by investigating intersection techniques that make use of both uncompressed “integer” representations, as well as compressed arrangements. We also propose a simple hybrid method that provides both compact storage, and also faster intersection computations for conjunctive querying than is possible even with uncompressed representations.

Fast development tools considered harmful

This article lays out an intentionally provocative case against modern fast software development tools. Modern compilers run in at most a few seconds when presented with programs of the size students typically encounter. Might this speed encourage programmers to program too much by trial and error? This article will argue that this may in fact be the case.

Personalized Behavior-Powered Systems for Guiding Self-Experiments

The goal of my research is to study how individuals perform self-experiments and to build behavior-powered systems that help them run such experiments. I have developed SleepCoacher, a sleep-tracking system that provides and evaluates the effect of actionable personalized recommendations for improving sleep. Going further, my aim is to expand beyond sleep and develop the first guided self-experimentation system, which educates users about health interventions and helps them plan and carry out their own experiments. My thesis aims to use self-experimentation to help people take better care of their well-being by uncovering the hidden causal relationships in their lives.