Crowdsourcing is a powerful new project management and procurement strategy that enables the realization of values associated with an "open call" to an unlimited pool of people, typically through Web-based technology. Our focus here is on an important form of crowdsourcing where the crowd’s task is to generate or source data. Generally speaking, crowd-based data sourcing is invoked to obtain data, to aggregate and/or fuse data, to process data, or, more directly, to develop dedicated applications or solutions over the sourced data.
Wikipedia is probably the earliest and best-known example of crowdsourced data and an illustration of what can be achieved with a crowd-based data-sourcing model. Other examples include social tagging systems for images that harness millions of Web users to build searchable databases of tagged images, traffic information aggregators like Waze, and hotel and movie ratings like TripAdvisor and IMDb.
Crowd-based data sourcing democratizes the data-collection process, cutting companies’ and researchers’ reliance on stagnant, overused datasets, and can revolutionize our information world. But in order to work with the crowd, one must overcome several nontrivial challenges, such as dealing with users of different expertise and reliability, and whose time, memory, and attention are limited; handling data that is uncertain, subjective, and contradictory; and so on. Particular crowd platforms typically tackle these challenges in an ad hoc manner, which is application-specific and rarely sharable. These challenges along with the evident potential of crowdsourcing have raised the attention of the scientific community, and called for developing sound foundations and provably efficient approaches to crowdsourcing.
The crowd may be harnessed for various data-related tasks, which generally can be divided into two main types. First, the crowd can help in processing data already collected, by providing their judgments, comparing, cleaning, and matching data items. Second, the crowd could be engaged in harvesting new or missing data. An important contribution of the following paper is the observation that by using the crowd for the collection of new data, we are departing from the classical closed word assumption, which underlies traditional database systems, where the database is considered to be complete at the time a query is posed. That is, it contains all data needed to answer a user query. When the crowd can be enlisted to add new data during query processing, this assumption is violated, calling into question the meaning of even simple queries. In particular, a key question one needs to resolve when collecting data from the crowd to answer a query is: "Has all the data relevant for the query been gathered?" Consider, for example, a query that wishes to collect from the crowd names of companies in California interested in green technology, or of child-friendly chef restaurants in New York. How (and when) can we decide all relevant answers were indeed collected? How can we estimate how many more answers are needed to complete the task?
The authors demonstrate that when dealing with the crowd, the process of sampling significantly differs from what traditional estimators, for related problems, assume.
A natural way to approach the problem is to view the collected crowd answers as a sample taken from some unknown underlying distribution of possible answers, and use some statistical methods for estimating the actual distribution. The authors demonstrate that when dealing with the crowd, the process of sampling significantly differs from what traditional estimators, for related problems, assume. First, crowd members typically provide a list of answers without repetitions, or in other words, sample from an underlying distribution without replacement. Workers also might sample from different underlying distributions (for example, one might provide answers alphabetically, while others provide answers in a different order). Thus, the ordered stream of answers from the crowd may be viewed as a with-replacement sampling among workers who are each sampling a data distribution without replacement. Furthermore, when modeling the crowd, these distributions must account for common crowd behaviors: some workers do much more work than others; not all workers arrive at the same time; workers may have different opinions or bias. Moreover, when data is available on the Web, multiple workers may provide data in the same order (for example, follow, in the example queries here, the same companies or chef-restaurants directory), and so on.
A key contribution of this paper is the development of a simple and elegant formalization of the crowdsourcing process for such queries, along with an effective technique to estimate result set size and query progress in the presence of crowd-specific behavior.