Search engines are among the most-used resources on the Internet. Google [2], for example, now hosts over eight billion items and returns answers to queries in a fraction of a second—thus realizing some of the more far-fetched predictions envisioned by the pioneers of the Web [1]. Here, we assess whether people are biased in their use of a search engine; specifically, whether people tend to click on those items that are presented as being the most relevant in the search engine’s results list (those items listed at the top of the results list). To test this bias hypothesis, we simulated the Google environment systematically reversing Google’s normal relevance ordering of the items presented to users. Our results show people do manifest some bias, favoring items at the top of results lists, though they also seek out high-relevance items listed further down a list. Whether this bias arises from people’s implicit trust in search engines such as Google, or some other effect, is a subject for further discussion later in this article.
The Web provides access to an unparalleled volume of information at time costs that are orders of magnitude lower than those required for traditional media. The critical jumping-off point for this vast repository is typically provided by the results returned by search engines to user queries: Google [2], for example, returns results to over 100 million queries daily, queries that are typically two words long [12]. Like many search engines, Google uses the collective intelligence of the Web to rank-order pages of relevance to a particular query. Each page in this ordered list is typically summarized by a clickable title, some snippets of the page’s content (with highlighted matching content words) and a Web address link.
A rational searcher might be expected to assess each of these page summaries against their information need and click on the one that appears as the most relevant. People may not search in such a way, however. Instead, they may manifest biases—they may simply click on top-listed results without checking the results against what is on offer, for example. Such biases, if they exist, could be due to users coming to implicitly trust a search engine. That is, over time, as a search engine consistently delivers relevant pages toward the top of its results lists, users might come to assume the top results are indeed the best. Alternatively, such biases may arise because of a user’s tendency to “satisfice,” or stop at the first item that contains the sort of content sought, rather than looking for the “best” result among hundreds of relevant results.
The key manipulation in this study was to compare users’ responses when they received results lists in their normal ordering versus a systematically reversed order. If people are biased in their search then they will not notice that the relevance rankings have been reversed. That is, they should respond identically to the normal and reversed lists clicking on results placed first at the top of the lists. If they are not biased then they should respond differently to the normal and reversed lists; specifically, they should hunt down the reversed list to find highly relevant items. To presage our results, the truth seems to lie somewhere between these two extremes. There is definite evidence of bias in people’s Google searches; they tend to click on first-listed items, though people also sometimes seek out highly relevant results lower down results lists.
Method
Thirty science undergraduates at University College in Dublin, Ireland were paid to participate in the study. Participants were asked to answer 16 questions on computer science (for example, “Who invented Java?”) by running as many queries as they liked on the simulated Google environment. The interface was designed to have the look and feel of Google, and all participants reported they thought it was Google they were using. The simulated system was built using comprehensive search logs from a live user trial in which a separate group was asked to answer the same 16 questions [4]. All of the queries used by this group were stored, as were all of the results lists returned to these queries. We then created a database linking specific queries to their appropriate results list. This database sat behind our simulated Google interface and was used to systematically counterbalance the presentation of results lists in either a normal or reversed ordering when a user entered a query.
The results lists returned to a given query were presented in either their original relevance ordering or a reversed ordering in a counterbalanced way across trials of the experiment. The order in which questions were presented was randomized for each participant to counteract any learning effects that might occur in the course of a trial. Participants were instructed to limit the number of query terms to two or less, to parallel typical user behavior (the average Web user enters between one and three query terms [12]). For each question we recorded the number and names of the queries entered and the search results clicked on by users. The timing of each transaction was also recorded. Participants were asked to complete a form detailing their answers to the questions and sessions averaged 1.5 hours.
We also carried out a ranking post test to see whether people agree with Google’s relevance ordering of results. This post test was carried out on a sample of the results lists using a new group of 14 students. These participants were asked to manually rank the presented results lists from the search experiment (on a 110 scale from “most likely to select” to “least likely to select”). A sample of 16 results lists from the experiment were used, based on those results lists returned to the most frequently used query for each of the 16 questions. This sample thus covered those results lists that contributed most to any effects found in the experiment. Each participant received the results lists in a randomized order and the results in each list were also randomized for every participant. This procedure was adopted to ensure an accurate assessment of people’s relevance ranks, independent of any possible bias effect. People were given one hour to complete this ranking task during which participants only completely ranked a subset of the presented results sets.
Results and Discussion
The dependent measure was the users first click, that is the first chosen link by a given user in a returned results list to a given query. The data was analyzed in a 2 (condition; normal versus reversed) x 10 (relevance rank; 110) design treating queries as the random factor. That is, for each query we recorded the proportion of people that chose a particular ranked result, noting whether this occurred in a list that was normal or reversed. The two-way analysis of variance (ANOVA) with condition and relevance rank revealed a main effect of relevance rank [F(9,319) = 102.14, p<0.01, MSe= 0.89], and a reliable interaction between the condition and relevance rank [F(9,319)=11.31, p<0.01, MSe=0.10]. Tukey’s post-hoc comparisons of the interaction showed there were reliable differences between the first-click frequencies for the 1st, 9th, and 10th relevance ranks (see the figure here).
These results clearly indicate people’s first clicks in the normal and reversed conditions are not identical, providing evidence that people are partially biased in their search activities. Items with the highest-relevance ranks (items ordered first by Google) are chosen 70% of the time in the normal condition, but this rate drops to 10% in the reversed condition. In contrast, the 9th and 10th relevance-ranked items are chosen more often (13% and 41%, respectively) in the reversed condition than in the normal one (2% and 2%, respectively). Intermediately ranked items are much the same across both conditions.
The significance of what is happening is readily apparent if one considers the data by position in the results lists. The accompanying figure shows that when lower-relevance-ranked items are positioned first and second in the results list (as they are in the reversed condition) they are chosen more often by users, despite their limited relevance. In contrast, when the highest-relevance items are positioned last in the results list (in the reversed condition) they are being chosen by users considerably less often. In short, users are, in part, misled by the presented order of the items. However, sometimes people deliberately hunt out the highly relevant items even when they are located at the very bottom of the returned list.
The post test showed there is close agreement between people’s rank of returned results and those rankings posed by Google. People’s mean rankings of the sampled results lists correlate highly with the search engine rankings (r2 = 0.9124 t=9.13; df=8; p<0.0001). This result shows us that the items Google presents as the best are considered by people to be the best too. It is interesting this finding occurs even when people have been given the results lists in a randomly re-ordered form, suggesting highly relevant items in each results list were easily identifiable. This post test also sheds some light on another issue regarding the relevance topology of the results lists. One concern about the evidence is that the first 10 results in each list are approximately equal in relevance and the real relevance differences only begin around the 100th or 200th ranked items. If this were the case then the search behavior observed would only apply to results lists with flat, relevance topologies. This concern is partly answered by the correlation reported here, but not fully.
To get a better idea of the actual relevance topology we analyzed the rankings produced by people in the post test in a different way. For each of the 16 results lists sampled, we noted the mean rating given by people to each result in the list. If the relevance topology is flat for these lists then these mean ratings should all be approximately equal (recall, order effects are controlled for this data by randomization). However, this is not what we found. There is a huge variety of different topologies for the results in each list; a few have a single highly relevant item (with a mean rank of 1 or 2), others have several results given high mean ranks, while others have a linearly increasing relevance topology. This finding suggests our random selection of questions for the experiment generated a random selection of different relevance topologies; these are presumably representative of the topologies generated by Google. Furthermore, they are not all flat but hugely varied.
General Discussion
Our study results clearly show people are partially biased in their search behavior. While it is known that people have a fondness for items at the beginning of written lists, the novelty of our study is that it demonstrates such effects within a search engine context (through our systematically controlled forward-reversed paradigm). So, given that we have evidence of such bias, the difficult question to answer is “Why?”
Recently, Joachims et al. [5] carried out an independently conducted similar experiment to the present one using an eye-tracking paradigm that interprets its findings as being due to people’s development of an implicit trust in search engines. However, other recent work does not conclusively support the “trust hypothesis,” for instance, O’Brien and Keane [7] have found this bias exists even when search results are presented as simple text lists.
An alternative possibility is that the bias is a function of “satisficing” search heuristics on behalf of users, where users seek satisfactory as opposed to optimal results. Our findings seem closer to this type of search behavior in that we only find a partial bias; that is, people do sometimes search to the bottom of the list to find the highly relevant items. Also, O’Brien and Keane [7] have observed different click patterns across different result distributions where a highly relevant result coming before many irrelevant results stands a greater chance of being chosen over the same highly relevant result preceded by other relatively relevant results, for instance.
O’Brien and Keane [8] have modeled users interacting with search results adopting a satisficing strategy, accommodating both the findings presented in the current study and eye-tracking evidence that suggests users tend to evaluate results sequentially, deciding immediately whether or not to click [6]. O’Brien and Keane’s model [8] predicts users will, in general, tend to click on top results over results lower down the list, though this tendency should not be as strong when the relevance of the top results is weakened. They [8] demonstrate how the model, across a number of trials, approximates the aggregate search behaviors of large numbers of users searching the Web.
On the whole, our results suggest search engines could misleadingly overpromote an initially popular page because, having placed it at the top of the results list, it is clicked on preferentially by users, in turn increasing the likelihood of it being placed first, being clicked on, and so on (see also [3, 7, 8, 10]). This problem obviously applies to search engines that rely on histories of previous user choices (for example, [11]), but it could also apply to those linkage-based algorithms such as Google’s PageRank [9] because the top-of-the-list pages are more likely to end up as the chosen link on people’s Web pages. Search engine designers may need to design systems to overcome such effects (for a solution see [10]), and it is clear that future information delivery systems have much to learn from such detailed analyses of user search behavior.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment