Research and Advances
Artificial Intelligence and Machine Learning Research highlights

Technical Perspective: Evaluating Sampled Metrics Is Challenging

Posted
  1. Article
  2. Author
  3. Footnotes
Read the related Research Paper
two couples walking, shown from above

Item recommendation algorithms rank the items in a catalogue from the most relevant to the least relevant ones for a given context (for example, query) provided in input. Such algorithms are a key component of our daily interactions with digital systems, and their diffusion in the society will only increase in the foreseeable future.

Given the diffusion of recommendation systems, their comparison is a crucial endeavor. Item recommendation algorithms are usually compared using some metric (for example, average precision) that depends on the position of the truly relevant items in the ranking, produced by the algorithm, of all the items in a catalogue.

The experimental evaluation and comparison of algorithms is far from easy. One of the reasons is that there are several choices to be made, such as the input instances for the evaluation: how many of them should be considered? Which ones? While ideally one would like to pick inputs that are representative of the instances on which the algorithms will be run when deployed, there are several other factors that play a role in such choices, such as the available resources (for example, time and memory). The impact of the choice of the inputs is clear to every researcher and practitioner that has tried to compare algorithms and tools.

The following paper exposes another crucial aspect for the evaluation of algorithms and tools: the impact of using sampled metrics instead of exactly computed metrics. The catalogues ranked by item recommendation systems are often very large, with sizes ranging from the tens of thousands to the millions, depending on the application. For this reason, the evaluation of item recommendation algorithms is extremely laborious. A recently used method to speed up the evaluation consists in using sampled metrics, that are metrics obtained from the ranking of the relevant items against a small set of irrelevant items sampled from the catalogue.


The following paper exposes a crucial aspect for the evaluation of algorithms and tools: the impact of using sampled metrics instead of exactly computed metrics.


The authors study the use of sampled metrics in evaluating item recommendation algorithms. One of the main results of the paper is that sampled metrics lead to conclusions that are inconsistent with their exact counterparts, meaning that an algorithm A, which is inferior to algorithm B when an exact metric is used, appears superior with the sampled metric. Even more surprisingly, the paper shows that, for commonly used metrics, the relative ordering of three real world recommenders changes when increasing the size of the sample, with opposite conclusions drawn with different sample sizes.

One possible explanation for the inconsistent behavior of sampled metrics could be the variance of such metrics, due to sampling. However, the paper shows that, for item recommendation algorithms, this is not the case. In fact, the sampled metrics have very low variance, and their inconsistency is due to an inherent bias in their estimate of the exact metrics. This rules out a commonly used approach to obtain stable results from sampling, namely to repeat the estimation procedure multiple times. Moreover, the paper shows that relatively large samples are needed to obtain consistent results from sampled metrics. In fact, the paper shows that when 1/3rd of the whole catalogue is used as a sample, sampled metrics are consistent with the exact metrics. Unfortunately, in this case the speed up from sampling is limited.

What is the source of the inconsistency and bias in the sampled metrics? As the authors show, they stem from a simple fact: by using a sample of the irrelevant items, the rank of a relevant item is an underestimate of its exact rank, obtained when all the irrelevant items are considered. Since the error in the estimate can be quantified, it can then be corrected, and another main result of the paper is showing that even a simple correction is able to resolve most of the mistakes of the uncorrected sampled metrics. Therefore, while, as suggested by the authors, sampling-based approaches should be avoided in evaluations whenever possible, they can still be employed by using a properly designed correction.

One of the most important take-aways from the paper is clear: when sampling is used to estimate a quantity, understanding, and analyzing the impact of the sampling procedure is crucial. This is a more general message than it may seem at first sight. In several applications one can rarely assume that the data at hand represent the whole system, or population, or process, under study, and most commonly the data is only a sample of the system/population/process. Understanding the impact of sampling procedures on the results of algorithms, and how to properly account for them in the computation, is of paramount importance to draw reliable and robust answers from data.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More