Eighty-Three Million Americans have bought products online and three million have used the Internet to rate a person, product or service.10 Yet users do not know the level of bias found in the ratings they use. On Web sites offering rating schemes almost anyone visiting the site can enter ratings. There is little protection from certain users inundating ratings with their own opinions to fulfill personal agendas. For example, book author(s) and their friends and family may rate their own books offered online as excellent and their competitors’ as poor. Also, ratings are inherently subjective and voluntarily provided, resulting in a possible mismatch between the quality of the rated object and the rating given.4,5,6 Here, raters may use objects in inappropriate contexts, which results in poor perceived performances and ultimately, low ratings. However ratings are used, they matter, and they influence what people do and how they will feel about their decisions.2,5,7 The purpose of this paper is to examine why people use biased ratings and ways to fix biased rating schemes.
Why people use ratings. The Internet stores vast amounts of information making it possible for users to locate trading partners, products, services, and information. Ratings are provided to help users determine which partners, products and services are best. Ratings are provided free of charge (i.e., unlike subscriptions to Consumer Reports) and typically come with explanations. Online consumers may simply be curious to see how others rated an object, need input for help with an immediate decision, or want to compare current beliefs to others’ to confirm their own beliefs are reasonable. Sometimes users find reading the comments provided with ratings entertaining especially when pointing out negative features.5,7 Ratings are typically conspicuous by being placed near their rated objects. The users’ desire for help in decision making combined with this easy access promotes the use of ratings.
Concerns with using ratings. Most users want to believe in the unselfish intentions of raters to help out in decision-making. Many users want to believe rating scheme administrators will monitor and provide honest ratings.3 In face-to-face environments, interacting parties develop trust based on cues of appearances, tone of voice and/ or body language. Parties also trust or distrust one another based on input from those who have interacted with the party in the past. These cues help one determine how much trust to place in others.7,9 However, the online environment does not offer the same cues making it more difficult for users to develop the appropriate level of trust. The online experience differs from the equivalent offline (face-to-face) experience as Internet users cannot depend on all five senses to make decisions. They must rely on limited representations such as graphics and text descriptions (i.e., the visual design). Web sites can mask deficiencies in the rated object or mislead users to believe that the ratings they provide are reliable through well designed web pages and features.
Types of Rating Schemes
Examples of rating schemes for people. One of the most well-known rating schemes about people involves participants on eBay. These ratings schemes, sometimes called reputation systems, are closely related to the formation of trust between people in an exchange relationship. Every twenty-four hours, eBay hosts over2.5 million auctions with over 250,000 new items.10 eBay attempts to build integrity into transactions through its Feedback Forum. Buyers and sellers rate each other based on the current transaction in order to inform future participants of the buyers’ and sellers’ current behaviors. Negative ratings may lead future transaction participants to steer clear of certain buyers or sellers. The Feedback Forum attempts to: (1) help buyers identify trustworthy sellers, (2) encourage sellers to be trustworthy, and (3) discourage participation from those who are not trustworthy.11
Examples of rating schemes for products and services. Rating schemes help users screen poor products and services, and find high-quality ones.1,9 These ratings schemes, called recommender systems, are closely related to the identification of viable alternatives that match user preferences. Bizrate.com, Epinions.com, and Amazon.com encourage users to rate books, music and other products and services. Users issue numerical ratings along with limited-length explanations on products and services sold by the site. Another example of recommender systems, Slashdot.org and Usenet.com, allowusers to rate the usefulness of comments about technical topics and allow users to set a search option to retrieve only the most highly-rated comments.
Rating Users and Rating Bias
A description of the objects being rated and the different types of ratings is illustrated in Table 1. Certain ratings are technical specifications or ratings that reflect objective measurements of people, products and services. People may have advanced degrees or substantial experience suggesting greater expertise in a particular domain. Products and services may have objective dimensions suggesting performance and service levels. For example, a car engine may have a horsepower of 220 or a help desk may have a mean-time-to-respond to incoming calls of 5 minutes. These measures are derived from some measurement tool or technique which may exhibit systematic measurement error regardless of the number of times it is used (i.e., sample size). People who need help with an immediate decision tend to gather information and consider the technical specifications of objects.
When people want to confirm their current beliefs about a certain rated object, they may want to compare their current position with the subjective personal opinions of others. Ratings based on personal opinions or ‘taste-dependent’ ratings can be inherently subjective and voluntarily provided. Bias is unintentional and inherent in these ratings as people may not know how to rate things. These ratings are inherently noisy (i.e., have randomness in addition to the rater’s true feelings), meaning deriving a rating that perfectly represents the rater’s opinion may never be possible.
People may use ratings based on technical specifications or personal opinions. For example, if the item rated is a digital camera, cell phone or PC, technical specifications or more objective ratings such as those provided by experts become more important. Alternatively, if the item rated is a CD, movie or restaurant, taste-dependent ratings are needed and personal opinion of the item is important. In this second situation, the rater simply rates based on ‘liking’ or ‘not liking’ the rated object and every opinion is equally valid because people have different tastes. Finally, some objects may have a combination of ratings for different aspects of the rated object (i.e., technical specifications about functionality and personal opinions about aesthetics).
Some people submitting ratings may manipulate them intentionally in order to influence others’ thinking or to enhance their own reputation (i.e., bad-mouthing or ballot-stuffing).4, 5 Manipulations can be found as a narrow critique of one dimension of a rated object or as a general critique of many dimensions of a rated object. Manipulations of ratings about people can occur with a collusive agreement of rating levels between the transacting parties. Manipulations of ratings about products and services can occur when people submit high ratings for their own products/services and low ratings for competing products/services.
While rating bias in technical specifications may exhibit systematic measurement error regardless of the sample size, rating bias in subjective personal opinions may be sensitive to sample size. Some of the bias in personal opinions is unintentional and may be based on the rater’s background or type of object being rated.5,12 With a large enough sample size in the context of personal opinions, clustering raters with similar preference profiles and then reporting the average ratings per cluster may converge on a more ‘true, unbiased’ rating foran object within that cluster. However, when the bias in personal opinions is intentional, clustering may not help as large sample sizes with honest ratings are needed to reduce the effects of the bias.
Manipulated ratings and a low sample size highlight why rating bias can be a problem. With only manipulated ratings to rely on, these dishonest ratings will influence users’ decisions. While rating scheme designers and users may expect personal opinions to be inherently biased by the raters’ own personal viewpoints, they also need to expect ratings manipulations to occur. Rating users may consider greater chances of bias when (1) raters have different goals with respect to the object than the user, (2) raters are coerced into submitting ratings (i.e., the disposition to the rated object may be impaired), and (3) the rated object is complex (i.e., rating may reflect limited attributes of a complex object).
Advanced Techniques to Reduce Bias and Improve User Confidence
Given raters’ uncertainty of rating bias levels, some advanced techniques to reduce bias and improve user confidence are warranted. A list of the types of bias and ways to mitigate or manage their effects is illustrated in Table 2.
Improving rating scheme designs to reduce bias. Ways to expose patterns of manipulation exist for rating schemes. Data mining and statistical analysis of past ratings are techniques to highlight unusual rating activity. Human experts can then review these for appropriateness. Probabilistic estimation techniques based on subsets of data may sort through the data to reduce the effects of manipulations.6 These techniques can use large amounts of information about a rater’s social standing, past behavior, and interaction habits to infer a rater’s reputation and to identify dishonest raters.1 These techniques can also examine multiple ratings to determine if one rater is trying to flood the system to his/her advantage.
Frequency, median, and cluster filtering can mitigate the impact of unfair raters who flood a rating scheme with their opinions. These techniques were shown in reputation ratings systems to reduce the influence of manipulated ratings as long as the ratio of manipulating raters to total raters was less than a specific threshhold.4,5 Once the number of manipulating raters is too great when compared to the total number of raters, it is difficult to eliminate their influence on users’ perceptions. Frequency filtering removes ratings from raters with higher than average submissions while median filtering calculates a mean rating using the median of the ratings provided.6,9 Cluster filtering identifies the nearest neighbor set of raters based on the similarity of ratings on commonly rated objects.6,9 These filtering techniques have been shown to remove some unwanted rating bias although a small amount of bias may remain.
The techniques above assume a rater is not changing his/her identity. Other statistical algorithms are needed to compare ratings to identify patterns of common language or style between and among raters. These algorithms would determine and search for patterns of fraudulent or manipulation practices which human experts could then review.
For ratings about people, controlled rater anonymity may eliminate the rating manipulation of bad-mouthing.5 Here, the system publishes the average ratings but keeps the identity of the rated individual concealed or assigns pseudonyms which change from one usage to the next. In this way, users make decisions solely based on object attributes and published ratings.4,5 Bad-mouthing is avoided because raters can no longer identify their victims.
Another solution is to compare raters’ ratings to peers’ ratings and weigh them based on convergence which has been shown to induce honesty.11 In addition, rating schemes can punish those manipulating ratings for personal gain by locking them out of the system.3 Also, publicizing dishonesty and reminding raters of their responsibility to be honest may deter biased ratings.
A different solution is to change the process by applying higher standards for accepting ratings. Optimally, rating schemes should accept only ratings from qualified raters who have experience with the object. However, this may be too limiting, so one more solution is to change the process to provide structure and formalize how ratings are submitted. To help raters submit better ratings, systems could use a pre-established list of questions about the rated object. Instead of one overall value, ratings would be multi-faceted reflecting the raters’ assessment of various product attributes such as aesthetics, functionality, durability and performance.3,4,5 Also, online help functions and wizards could help those submitting rating values.
Rating scheme designers need to make it more difficult to change the raters’ name and identity. Raters can manipulate ratings and perform other dishonest acts, then change their identities and repeat their actions. The rating scheme could require people to use their real names or prevent them from creating multiple identities by endorsing a “once-in-a-lifetime” identity.11 One way to monitor and control identities is to require ratings to be submitted via email to a dedicated review staff. Staff members can require authentication information which is reviewed and checked and a rater could be assigned one unique identifier. Other approaches comprise ways to make reentry with a new identity unprofitable by adopting an upfront cost for a new rater.3 This cost can be accomplished through an explicit fee or an implicit fee of having to go thorough a reputation building stage.3 The reputation building approach would assign the lowest possible reputation value to new raters to discourage raters from misbehaving and changing online names because they would have to re-build their reputation.11 Thus, new raters would be labeled as unreliable until they have proven themselves to be trustworthy by building a good reputation over time.
Rating schemes need to improve how they authenticate rater affiliation to the object. The rater could be specifically asked their affiliation to the object. Rating schemes should provide an outlet for those affiliated with the rated object to give feedback but designate it as such. The system should authenticate rater identity by confirming a phone number, credit card number, social security number, some personal questions, or various combinations of these.3 Regardless of identity, the system could authenticate a rater who has purchased the product or service by confirming an invoice number for the transaction or UPC product code, and then comparing it to a sales database.
Studies have shown the information source’s affiliation with the object matters.6 Certain consumers were more skeptical of ratings on a corporate Web site than an independent Web site.12 To increase the perceptions of honesty, rating schemes on non-independent Web sites could engage a third party to certify the accuracy of ratings, and furthermore make guarantees as to rating reliability and rater credibility. Much as financial auditors attest to the accuracy and completeness of financial statements, rating auditors could confirm ratings with the raters, then verify credentials and review rating policies. The rating auditor could provide an official seal of approval much like the WebTrust seal being used by accountants today.
Helping users manage rating bias. Rating scheme designers can adopt several approaches to help users manage rating bias. Systems that coordinate thousands of submissions by thousands of users reduce the effects of rating bias. With thousands of inputs, the majority count could give objects a more objective rating measure.9 Similar to rating schemes, manipulations are also a problem in online opinion forums. It has been argued the most effective way to increase the value of a forum for everyone is to encourage increased levels of active input because increased opinions decrease the effects of manipulations.9 Thus, to reduce the effects of manipulations, rating schemes need incentives to increase rating submissions. One way to increase the number of useful ratings is to develop industry standards for rating schemes across Web sites to allow users to combine information increasing the amount of rating inputs.
Designers can promote skepticism of ratings and warn users of bias and typical ways it manifests itself. Studies have shown that when people pay attention to whether a message is truthful or not they are more apt to judge it properly.9 In cases where subjective ratings are taste-dependent, rating schemes may need to personalize ratings using collaborative filtering techniques. Here the system provides a different estimate of the rating for each user, trying to compensate for each user’s individual taste profile. Taste profiles can be established from submitted ratings using collaborative filtering techniques which measure the extent to which users and/or raters agree in terms of taste.3 Using collaborative filtering concepts, hybrid rating schemes would find other users with similar taste profiles and direct the user to objects that similar users rated highly.1
Rating schemes should adopt alternative architectures using more objective ratings (e.g., Google.com rates web pages using the number of links other pages have pointing to that page) and decentralized architectures (peer-to-peer networks).3,6 Rating statistics could show the number of visits to a productpage and could recommend links to other highly visited product pages related to the current product or disclose user behavior such as the average time spent reviewing that product page.3,8 Another idea for implicit and objective ratings is to collect and merge personal bookmarks of productpages to share among users to exploit others’ browsing activities.1
Rating credibility information and quality metrics are needed. Rating schemes specify who can participate, the types of information solicited, how that information gets aggregated, and how that information gets reported.3 Some rating schemes allow users to rate the usefulness of ratings by displaying average “usefulness” alongside each rating or by posting reputation scores next to rater names. These are social cues to gauge trustworthiness and expertise of the rater. Research is needed to determine what types and what formats of credibility information will help users determine rating bias.
To help users understand rater motivations, rating schemes need to require text explanations for raters to substantiate their rating decisions. Studies have found explanations help users determine when to rely on recommended courses of action.8 In addition to text explanations, rating schemes need to provide background information about raters. A rater expertise hierarchy could calculate rater reputation.4,9 This hierarchy takes into account that raters’ expertise may vary depending on the domain that is being rated and calculates a confidence interval associated with a rating. Hierarchy weights are calculated based on discrepancies between raters and rater groups.4,9 Rating schemes could also derive rater reputations from a variety of sources, including direct experience, feedback from third parties, and implicitly gathered data.3
Finally, people tend to trust the friend of a friend more than a stranger, suggesting networks of trust could be useful in rating schemes.8 Ratings schemes could allow users to select those they want to be in their personal review group (e.g., linkedin.com). Users would have the ability to change the members of their review group by including only those who have historically proven to provide the most useful reviews in a particular product category.8 In addition, rating schemes should show measures of rater agreement and total rater credibility. Rating schemes should provide the identity of the raters along with his/ her level of contribution if an aggregated rating is provided. In conclusion, this paper has discussed the use and biases in rating schemes and suggested many advanced techniques to reduce bias and improve user confidence in ratings. These are all deemed critical to the future success the Internet.