Content creation used to be an activity pursued individually or in closed circles of collaborators. Books, encyclopedias, and map collections had either a lone author or a group of authors who knew one another and worked together; it was simply too difficult to coordinate the work of geographically dispersed groups of people when the main form of communication was letters or telephone. The Internet changed all that, so millions of people around the world are now able to collaborate. The first open-collaboration systems—wikis—focused on text content; the range of content that can be created collaboratively has since expanded to include video editing (such as MetaVid5), documents (such as Google Docs,a ZOHOb), architectural sketching (such as Sketchupc), and geographical maps (such as OpenStreetMaps10 and MapMakerd).
- Reputation systems have both a descriptive role, providing information on content and user quality, and a prescriptive role, providing incentives for constructive behavior.
- Content-driven reputation systems derive feedback from analyzing content and interactions; they can scale to very large systems and made resistant to many types of bias.
- Finding “signal in the noise” of crowdsourced content is extremely challenging; a good reputation system allows users to zero in on the best content, spotting vandalism and attacks.
Open collaboration promises immense benefit, as shown by the history of Wikipedia, but is also a challenge to content creators and content consumers. For content creation, the ability and knowledge of contributors is likely to vary. Collaborative systems open to all will inevitably be subjected to spam, vandalism, and attempts to influence information. How can systems be built to encourage constructive interaction and minimize the consequences of vandalism and spam? How can construction of high-quality information be facilitated? For content consumption, visitors are presented with the outcome of a complex collaboration process. The content may result from weaving together many contributions, with authors usually not known to the visitor and possibly even anonymous. The corollary of “anybody can contribute” is “anybody could have contributed.” How can users judge how much trust to put in the information they see?
Reputation systems can help address these challenges, facilitating both content creation and content consumption. To support this claim, we describe the reputation systems we built at the University of California, Santa Cruz, and at Google for two major collaborative applications: writing articles for Wikipedia and editing business locations on Google Maps.
Even for successful sites, establishing a community of dedicated users and accumulating sufficient high-quality feedback can take years.
We describe these systems because they are designed for two well-known cooperative systems and because they represent opposite ends of the reputation-system-design spectrum. The Wikipedia reputation system WikiTrust relies on a chronological analysis of user contributions to articles, metering positive or negative increments of reputation whenever new contributions are made. Users obtain new identities at will, and there is no “ground truth” against which their contributions are compared. The reputation mechanism can be explained in simple terms to users while helping provide an incentive to provide good-quality contributions. The Google Maps system Crowdsensus compares the information provided by users on map business listings and computes both a likely reconstruction of the correct listing and a reputation value for each user. Unlike WikiTrust, users have a stable identity in the system, and their contributions can be compared with the ground truth of the real world, if desired. The reputation system operates largely in the background, working not chronologically but by iteratively refining joint estimates of user reputations and listing values.
Content-driven vs. user-driven reputation. Wikipedia and Maps are both content-driven, relying on automated content analysis to derive user and content reputations. Reputation systems (such as the eBay system for sellers and buyers and the Amazon and NewEgg systems of product reviews and ratings) are user-driven, based on explicit user feedback and ratings.
Content-driven systems derive their feedback from an analysis of all interactions and, consequently, get feedback from all users uniformly. User-driven systems often suffer from selection bias, as users who are particularly happy or unhappy are more likely to provide feedback or ratings. Moreover, in user-driven systems, users might do one thing and say another. Sellers and buyers might give each other high ratings simply to obtain high ratings in return, regardless of how they feel about the transaction.7 Content-driven reputation systems derive user feedback from user actions and can be more resistant to manipulation.4
Deploying user-driven and content-driven reputation systems presents different challenges. The success of a user-driven system depends on user feedback. Even for successful sites, establishing a community of dedicated users and accumulating sufficient high-quality feedback can take years. However, when useful feedback is extracted automatically from user interactions and data, content-driven reputation systems can deliver results immediately.
On the other hand, the algorithmic nature of content-driven reputation systems can play against their success, preventing users from understanding and trusting the reputation values they generate. Reading “Product A received 25 positive, 12 neutral, and two negative votes,” users understand the meaning and often trust (to some extent) the result, in spite of possible selection bias of voting users and manipulation schemes by malicious users. When an algorithm produces the answer for a Wikipedia page “This sentence has reputation 4 out of a maximum of 10,” users typically wonder how the reputation is computed and question the appropriateness of the algorithms. In reputation systems that make reputation values available to users, simpler is often better, even when the performance, in numerical terms, is worse; users need to understand the origin of reputation to be able to trust it.6,7
WikiTrust and Crowdsensus are examples of content-driven reputation systems; others analyze the wording of consumer reviews to extract reviewer and product reputation12,15 and compute Wikipedia content reputation.14 The algorithms PageRank13 and HITS11 constitute content-driven reputation systems for ranking Web pages. Beyond the Web, consumer-credit-rating agencies are an example of a content-driven reputation system in the financial world.
We developed WikiTrust,e a reputation system for wiki authors and content, to provide an incentive to give quality contributions to Wikipedia and indicate to Wikipedia visitors the quality of content. WikiTrust employs two reputation systems, one for users and one for content. Users gain reputation when they make edits that are preserved by subsequent authors and lose reputation when their work is partly or wholly undone. Text starts with no reputation and gains reputation when revised by high-reputation authors; text can lose reputation when disturbed by edits. While WikiTrust was designed for wikis, its principles are applicable to any content-management system in which the content evolves through a sequence of revisions, provided the difference between revisions is measurable.
WikiTrust is available via a Firefox browser extension. When a user visits a page of one of several Wikipedias, the browser extension displays an additional WikiTrust tab alongside the standard wiki tabs (such as edit and history). When users click on the WikiTrust tab, the extension contacts the back-end servers to obtain text-reputation information, which is visualized via the text background color: perfect-reputation text appears on a white background, and the background turns a darker shade of orange as the reputation of the text gets lower. The text color alerts viewers to content that might have been tampered with (see Figure 1). WikiTrust does not currently display user reputations, due to our desire not to alter the social experience of contributing to Wikipedia.
User reputation. User reputations are computed according to the quality and quantity of their contributions. A contribution is considered of good quality if the change it introduced is preserved in subsequent revisions.2,3,8 To evaluate the quality of a contribution that produced a revision b, WikiTrust compares b with two reference points: a past revision a and a future revision c. From the point of view of c, if b is closer than a, then the author of b did good work, since that author made changes that made the page more like how it will be in the future revision c (see Figure 2a). On the other hand, if b is farther away from c than a was, the change from a to b was not preserved in c (see Figure 2b). To capture this intuition, we define the quality q(b | a, c) of b with respect to a and c as the amount of improvement d(a, c)d(b, c) divided by the amount of work d(a, b) involved in creating b. If the distance d satisfies the triangular inequality, then q(b | a, c) falls somewhere between −1 and +1; it is equal to −1 if a = c (so the change a → r was entirely reverted) and equal to +1 if the change a → b was entirely preserved.
Authors start with a small amount of reputation. When a new revision c is produced, it is used to judge the quality of several preceding revisions b, using as reference point revisions a that are either not too far in time from b and c or are by authors with solid reputations.4 For each such triple considered, the reputation of the author of b is increased by the amount q(b|a, c) · log(1 + rc), where rc is the reputation of the author of c. The dependence of the increment on the reputation of c‘s author ensures the judgment of higher-reputation authors carries more weight. A linear dependence would lead to an oligarchy in which longtime good-reputation users have overwhelming influence over new users, while new users can give no significant feedback in return. Assigning all users the same influence would lead to a completely democratic system, which would not be ideal in wikis, as the reputations of good users who participated in reversion wars with vandals would be too much at risk. The logarithmic factor balances oligarchy and democracy.
Users judge other users via their actions (their edits) and are thus liable to be judged in turn, making the system resistant to manipulation. For instance, the only way user A can damage the reputation of user B is by reverting user B‘s edits. However, if subsequent users reinstate B‘s edits, A‘s reputation suffers most, as B‘s contribution is longer-lived than A‘s.
When developing a reputation system, it is essential for its designers to be able to evaluate its performance quantitatively; otherwise, it is impossible to tune the system or compare different algorithms. A powerful evaluation criterion is whether reputation can help predict the quality of future user contributions.2 This is a tough test; reputation is not only a badge earned through past work but an indicator of future behavior. If low-reputation users were as likely as high-reputation users to do good work, why pay attention to user reputation in the first place?
To evaluate the WikiTrust user-reputation system, we measured the precision and recall with which low-reputation can predict reversions.2 For each revision b, we say the author of b has low reputation if his reputation is in the bottom 20% and that b has been reverted if its average quality is below −0.8. This is a proper evaluation, since the reputation of b‘s author depends only on the author’s behavior before b, whereas b‘s quality depends on how b will be judged by future revisions (see Table 1). The recall is high, indicating high-reputation authors are unlikely to be reverted; the precision is lower because many novice authors make good-quality contributions. In measuring precision and recall, the system weighs each each contribution according to the number of words added and deleted. The data is based on Wikipedia dumps ending in late 2009, except for the English Wikipedia, where the dump is from January 2008, augmented with updates through January 2010 for the 30,000 pages of the Wikipedia 0.7 project.f
Content reputation. WikiTrust aims to signal the quality of Wikipedia content, alerting visitors to possible vandalism and content tampering. To this end, WikiTrust computes and displays the reputation of Wikipedia content at the granularity of individual words; see Figure 1 for an example in which the assertion on NP-completeness, having earned low reputation, is highlighted in orange.
WikiTrust computes content reputation according to the extent the content has been revised and the reputations of the users who revised it.1,14 When a new revision is created, the text affected directly by the edit is assigned a small fraction of the revision author’s reputation. The text left unchanged gains reputation, with the idea that the author, by leaving it unchanged, has implicitly expressed approval for it. The same idea can be applied to many types of content; all the system must do is identify when an edit occurs, what content is new or directly affected by the edit (it receives a fraction of the author’s reputation), and what content is unaffected and thus implicitly validated (it may gain reputation).
WikiTrust also makes the content-reputation system difficult to subvert. Since it is possible to alter the content of sentences by inserting new text, as well as by rearranging or deleting text, WikiTrust ensures that each such action leaves a mark on the content. Furthermore, the algorithm allows users to raise text reputation up to only their own reputation, so low-reputation users cannot erase the low-reputation marks they leave behind with more activity. To ensure that a single high-reputation user gone rogue cannot arbitrarily raise the reputation of text via repeated edits, the system associates with each individual word the identities of the last few users who raised the word’s reputation and prevent users whose identity is associated with a word from again raising the word’s reputation. The resulting content reputation system includes two properties:
Revisions. Content reputation is an indication of the extent to which content has been revised and of the reputation of the users who revised it; and
Consensus. High content reputation requires consensus, achieved only as a result of the approval of multiple distinct high-reputation users.
Evaluation. The system’s predictive ability is a measure of its performance. Higher-quality content should be less likely to be deleted in future revisions. This evaluation is imperfect, as it disregards the fact that our content-reputation system aims not only for predictive value but warning value with respect to unrevised, possibly malicious edits. Our 2008 analysis of 1,000 articles selected at random from English Wikipedia articles with at least 200 revisions produced the following results1:
Recall of deletions. Only 3.4% of the content was in the lower half of the reputation range, yet this percentage corresponds to 66% of the text deleted from one revision to the next;
Precision of deletions. Text in the lower half of the reputation range had a probability of 33% of being deleted in the very next revision, unlike the 1.9% probability for general text. The deletion probability increases to 62% for text in the bottom 20% of the reputation range; and
Reputation as predictor of content longevity. Top reputation words had an expected lifespan 4.5 times longer than words with bottom reputation.
Lessons learned. The original reputation system described in Adler and de Alfaro2 was open to many attacks that allowed users to gain reputation while doing no useful work or worse, damaging the system. For instance, users could gain reputation by first vandalizing a revision using an alternate “sacrificial” identity, then undoing the vandalism using their main identity. As we believed these attacks could have crippled the system, we took pains to prevent them before making it available to the public.4 However, neither users nor researchers providing us with feedback expressed concern over the robustness of the original design. We suspect the system would have been more successful by making WikiTrust available earlier and dealing with the security issues later, adopting the common (if hardly principled) approach of “security as an afterthought.”
There was much interest in how the system measured contribution quality. Early in its development, we realized that if we relied on a standard edit distance between revisions, users whose contributions were later reworded would sometimes lose reputation in spite of their good work. We solved this by adopting an edit distance that accounts for block moves and differentiates between word insertions and deletions, which are both given a weight of 1, and word replacements, which are given a weight of 12; under this edit distance, authors of reworded contributions still receive partial credit for their work. We were sure our choice of edit distance would remain an obscure detail buried in the codebase. However, we found ourselves explaining it many times to Wikipedia contributors; users care deeply how their reputation is computed, even when reputation is not displayed to anyone. Perceived fairness is a very important quality of a reputation system.
Table 2 outlines the design space for reputation systems for collaborative content. The first distinction focuses on the signals used for computing reputation; are they derived from explicit user feedback or inferred algorithmically from system events? The two types of systems can work side by side; for instance, sale and product-return information can be used to compute NewEgg product ratings, and WikiTrust users were given the option of voting explicitly for the correctness of Wikipedia revisions.
The second distinction concerns the visibility of the reputation system to the users. Many systems are useful even if they work behind the scenes, used to rank content, prevent abuse, fight spam, and more. Examples are Web-content ranking algorithms (such as PageRank13 and HITS11). Reputation systems working behind the scenes can make use of any signals available on users and content and use advanced algorithms and techniques (such as machine learning). On the other hand, if the goal of the system is to influence user behavior, its existence and the reputation values it computes must be revealed to the users. It is thus important that users are able to form some idea of how the reputation values are computed; they want to know the metrics used to judge them, and systems that cannot be understood are typically considered arbitrary, capricious, unfair, or downright evil.
The strength of the identity system is also a relevant factor in the design of reputation systems. In systems with weak identity, new users must be assigned the same amount of reputation as bad users. There is no benefit of the doubt; if new users could enjoy a reputation above the minimum, bad users could simply use a new identity whenever their reputation fell below that of new users.
The next distinction concerns the existence of a ground truth to which content should correspond in order to have perfect quality. No such ground truth exists for Wikipedia articles, which do not converge to a canonical form as they are edited but rather continually evolve as content is added and refined. For any map-business listings, such ground truth exists for many information fields; for example, there is one (or a few) correct values for the telephone number of each business, and in the eBay seller-rating system, it can be usefully assumed that each seller has intrinsic “honesty,” with buyer feedback processed to estimate honesty. The latter highlights how the existence of a ground truth matters not so much because data can be compared to the ground truth (often expensive or impossible) but because the assumption that a ground truth exists affects the type of algorithms that can be used.
Finally, reputation algorithms span a spectrum from chronological to global. At one extreme, purely chronological algorithms consider the stream of actions on the systems, including contributions and comments, and for each action update the reputations of the participating users. The eBay reputation system is chronological, as is WikiTrust. At the other end of the spectrum are reputation systems based on global algorithms that operate at once on the whole network of recommendations, generally in batch mode. Each type of algorithm has advantages. Global algorithms can make use of the information in the graph topology; an example is the way PageRank and HITS propagate reputation along edges.11,13 However, global algorithms may require more computational resources, as they need to consider the whole system at once. Chronological algorithms can leverage the asymmetry between past and future to prevent attacks. In a chronological reputation system, new identities, including fake ones used for attacks, are assigned an initial reputation lower than that of established users. By making it difficult for users to gain reputation from users who are themselves of low reputation, WikiTrust is able to prevent many types of Sybil attacks.4
To illustrate how the characteristics of the design space could influence the structure of a reputation system, consider Crowdsensus, a reputation system we built in 2009 at Google to analyze user edits to Google Maps. Users can edit business listings on Google Maps, providing values for the title, phone, Web site, physical address, location, and business categories. The goal is to measure the accuracy of users contributing information and reconstruct insofar as is possible correct listing information for the businesses.
The design space of a reputation system for editing Google Maps business listings differs in several respects from the design space of a Wikipedia reputation system. First, for each business listing, there is, at least, in first approximation a ground truth; ideally, each business has exactly an appropriate phone number, Web site, and so forth. Reality is more complex, as there are businesses with multiple equivalent phone numbers, alternative Web sites, and more. Nevertheless, for our purpose here, we consider the simpler setting in which every listing attribute has exactly one correct value. We note, too, that it might be expensive to check the ground truth for each business listing, and, in the worst case, it might require sending someone on site. Crowdsensus does not require actually checking the ground truth; it simply relies on the existence of a ground truth, and user reputation is not visible to the users. Consequently, users need not understand the details of how reputation is computed, making it possible to use advanced algorithms and techniques to determine the correctness of the information. The identity notion is stronger in Google Maps than in Wikipedia; in particular, it is a practical nuisance for established users of Google products to open and use separate accounts for Maps editing. And the ample computational resources available at Google enable us to consider global reputation systems, in addition to chronological ones.
Reputation systems are the online equivalent of the body of laws regulating the real-world interaction of people.
These considerations led to our design for Crowdsensus that is very different from WikiTrust. The input to Crowdsensus is a sequence of statements that are triples of the form (u, a, v), meaning user u asserts that attribute a of some business has value v. Thus, Crowdsensus is set to solve what is called a “collective revelation problem,”9 even though some of the instruments by which such problems are solved (such as monetary payoffs and elaborate ways of revealing a user’s information) are not available in Crowdsensus. Crowdsensus is structured as a fixpoint graph algorithm; the vertices of the graph are the users and business attributes. For each statement (u, a, v), we insert an edge, from u to a labeled by v and an edge from a back to u. Crowdsensus associates to each user vertex u a truthfulness value qu, representing the probability that u is telling the truth about the values of attributes; this value is initially set to an a priori default, then estimated iteratively.
Crowdsensus computation is structured in a series of iterations. At the beginning of each iteration, user vertices send their truthfulness value to the attributes. Each attribute vertex thus receives the list (q1, v1),…,(qn, vn) consisting of the values v1,…,vn proposed for the attribute, along with the (estimated) truthfulness q1,…,qn of the user who proposed them. An attribute inference algorithm is then used to derive a probability distributiong over the proposed values v1,…,vn. Crowdsensus then sends to each user vertex ui the estimated probability that vi is correct; on this basis, a truthfulness-inference algorithm estimates the user’s truthfulness, concluding the iteration. The algorithm employs multiple iterations, so the information about a user’s truthfulness gained from some statements is able to propagate to other statements.
The attribute inference algorithm is the heart of Crowdsensus. We first used standard algorithms (such as Bayesian inference) but quickly realized they were suboptimal for the real case of Google Maps. First, users lack independent information on the correct value of attributes. Users typically have only a few ways to learn, say, the phone number of a restaurant; they can go there and ask or read it on a coupon, but 100 users providing us a phone number does not correspond to 100 independent ways of learning a phone number. We had to develop algorithms that could account for this lack of independence. Business attributes have different characteristics, and we found it very important to develop attribute-inference algorithms tailored to every type of attribute; for example, geographical positions (expressed as latitude-longitude pairs) have a natural notion of proximity (a distance), and it is essential to make use of it in the inference algorithms. Web sites also involve some notion of distance, at least insofar as two sites might belong to the same domain. Our implementation of Crowdsensus employs different inference algorithms for different types of attributes. The complete system is more complex in several respects, containing algorithms for attributes with multiple correct values, dealing with spam, and protecting the system from abuse. Furthermore, the Google Maps data pipeline comprises several interdependent algorithms and subsystems, and we designed Crowdsensus as one of the many components of the overall pipeline.
To illustrate the Crowdsensus algorithm, consider the case of N users and M attributes; the true value of each attribute is chosen uniformly at random from a set of K possible values. For each user u, we choose a probability pu uniformly at random in the [0, 1] interval; user u provides, with probability pu, the correct attribute value and probability 1−pu, a value selected uniformly at random from the K possible values. Note that Crowdsensus is not informed of the probability pu of a user u; rather, Crowdsensus computes the truthfulness qu for u from the statements by u. For simplicity, we assume that for each attribute, J estimates are provided by J users selected at random.
We experimented using a standard Bayesian inference for attribute values. For M = 1,000, N = 100, K = 10, and J = 10, Crowdsensus has an error rate in the reconstruction of the correct value of each feature of 2.8%. In contrast, a (noniterative) algorithm that performs Bayesian inference without using information on user reputation has an error rate of 7.9%. The roughly threefold reduction in error rate, from 7.9% to 2.8%, is due to the power of user reputation in steering the inference process. The statistical correlation between the true truthfulness pu and the reconstructed truthfulness qu over all users was 0.988, indicating Crowdsensus was able to precisely reconstruct the user’s truthfulness. If we take take J = 5, the error rate of Crowdsensus is 12.6%, compared with an error rate of 22% for standard Bayesian inference; the correlation between true and inferred truthfulness is 0.972.
We conclude on a note of optimism concerning the role of reputation systems in mediating online collaboration. Such systems are the online equivalent of the body of laws regulating the real-world interaction of people. As a larger fraction of people’s productive lives are carried out through online, computer-mediated interaction, we expect development of such an online body of algorithmic legislation to be a rich field of work and research, with important implications for society overall.
This work was supported in part by the Center for Information Technology Research in the Interest of Society and the Institute for Scalable Scientific Data Management (http://institute.lanl.gov/isti/issdm/). We thank Shelly Spearing and Scott Brandt for their enthusiastic support.