News
Computing Applications News

Competition Makes Big Datasets the Winners

Measurement has driven research groups to home in on the most popular datasets, but that may change as metrics shift to real-world quality.
Posted
  1. Article
  2. Author
graph lines in big data concept, illustration

If there is one dataset that has become practically synonymous with deep learning, it is ImageNet. So much so that dataset creators routinely tout their offerings as “the ImageNet of …” for everything from chunks of software source code, as in IBM’s Project CodeNet, to MusicNet, the University of Washington’s collection of labelled music recordings.

The main aim of the team at Stanford University that created ImageNet was scale. The researchers recognized the tendency of machine learning models at that time to overfit relatively small training datasets, limiting their ability to handle real-world inputs well. Crowdsourcing the job by recruiting recruiting casual workers from Amazon’s Mechanical Turk website delivered a much larger dataset. At its launch at the 2009 Conference on Computer Vision and Pattern Recognition (CVPR), ImageNet contained more than three million categorized and labeled images, which rapidly expanded to almost 15 million.

The huge number of labeled images proved fundamental to the success of the AlexNet model based on deep neural networks (DNNs) developed by a team led by Geoffrey Hinton, professor of computer science at the University of Toronto, that in 2012 won the third annual competition built around a subset of the ImageNet dataset, easily surpassing the results from the traditional artificial intelligence (AI) models. Since then, the development of increasingly accurate DNNs and large-scale datasets have gone hand in hand.

Teams around the world have collected and released to the academic world or the wider public thousands of datasets designed for use in both developing and assessing AI models. The Machine Learning Repository at the University of California at Irvine, for example, hosts more than 600 different datasets that range from abalone descriptions to wine quality. Google’s Dataset Search indexes some 25 million open datasets developed for general scientific use, and not just machine learning. However, few of the datasets released to the wild achieve widespread use.

Bernard Koch, a graduate student at the University of California at Los Angeles, teamed with Emily Denton, a senior research scientist at Google, and two other researchers from the University of California; the team found in their work presented at the Conference on Neural Information Processing (NeurIPS) last year a long tail of rarely used sources headed by a very small group of highly popular datasets. To work out how much certain datasets predominated, they analyzed five years of submissions to the Papers With Code website, which collates academic papers on machine learning and their source data and software. Just eight datasets, including ImageNet, each appeared more than 500 times in the collected papers. Most datasets were cited in fewer than 10 papers.

Much of the focus on the most popular datasets revolves around competitions, which have contributed to machine learning’s rapid advancement, Koch says. “You make it easy for everybody to understand how far we’ve advanced on a problem.” Koch says.

Groups release datasets in concert with competitions in the hope that the pairing will lead to more attention on their field. An example is the Open Catalyst Project (OCP), a joint endeavor between Carnegie Mellon University and Facebook AI Research that is trying to use machine learning to speed up the process of identifying materials that can work as chemical catalysts. It can take days to simulate their behavior, even using approximations derived from quantum mechanics formulas. AI models have been shown to be much faster, but work is needed to improve their accuracy.

Using simulation results for a variety of elements and alloys, the OCP team built a dataset they used to underpin a competition that debuted at NeurIPS 2021. Microsoft Asia won this round with a model that borrows techniques from the Transformers used in NLP research, rather than the graphical neural networks (GNNs) that had been the favored approach for AI models in this area.

“One of the reasons that I am so excited about this area right now is precisely that machine learning model improvements are necessary,” says Zachary Ulissi, a professor of chemical engineering at CMU who sees the competition format as one that can help drive this innovation. “I really hope to see more developments both in new types of models, maybe even outside GNNs and transformers, and incorporating known physics into these models.”

Real-world performance is at the heart of the OCP’s objectives, but problems can easily arise when the benchmarks themselves come to dominate research objectives. In natural language processing (NLP), the enormous capacity of the Transformer-based models built by industrial groups such as Google and OpenAI have called into question the widespread use of existing benchmarks and their datasets, such as RACE and SQuAD. As with ImageNet, the AI models often score better than humans on the benchmarks, but fail on experiments that probe performance more deeply. Investigation into the results has found that the models often rely on unintended hints in the benchmark tests themselves.

Similar problems emerged in ImageNet and other datasets, where it became apparent the models can rely more on cues provided by groupings of objects than on the target objects themselves. To keep costs down, images in visual datasets often are sourced from photo-sharing sites such as Flickr, and some categories inevitably will be poorly represented. Work presented at the 2017 Conference on Empirical Methods in NLP by Jieyu Zhao of the University of Virginia and colleagues showed how the increased prevalence of women cooking in two common datasets made an uncorrected model far more likely to associate women with that task than men. Princeton University Ph.D. student Angelina Wang and her supervisor, assistant professor of computer science Olga Russakovsky, showed in a paper presented at the 2021 International Conference on Machine Learning how models perform this kind of “directional bias amplification.”


Groups release datasets in concert with competitions in the hope that the pairing will lead to more attention on their field.


Developers of self-driving vehicles and other robots face a related problem. Most of the existing real-world footage they can use is uneventful and of little use for training systems to recognize potential problems. To train their systems to avoid accidents, they need far more unusual events, such as people running into the road or cases of dangerous driving by others in the scene. The solution to which the community has turned is simulation: creating a much wider range of scenarios than would be possible even with millions of miles of recorded driving. For image-recognition datasets that might replace ImageNet, generative adversarial networks (GANs) provide a way to create synthetic people and scenes that deliver more balanced training and evaluation datasets. However, at the current state of today’s technology, this has limits; though GANs today can generate convincing faces, creating more complex scenes remains challenging.

As AI models and datasets have moved from being purely tools for research to production applications, some of which are now used for surveillance and policing, ethical issues and problems caused by biased data have become more pressing. Following an investigation by the Financial Times in 2019, Microsoft withdrew its MS Celeb dataset, originally created to support a facial-recognition contest at the 2017 International Conference on Computer Vision (ICCV). The dataset contained multiple images of 100,000 people, many of which had been scraped from publicly available online sources, and the newspaper’s investigation found the subjects they contacted had not given permission for their images to be used.

Amid concerns over the use of pictures of people in the dataset, the Stanford group was faced with the possibility of withdrawing ImageNet from use. Russakovsky, who works as a member of the ImageNet team, says this would have proven to be near-impossible in practice for such a widely used dataset. For example, a workshop at the 2019 ICCV utilized a downsampled version of MS Celeb almost six months after Microsoft’s withdrawal announcement.

Russakovsky says the ImageNet group decided to take “small steps towards mitigating some of the concerns.” This was helped by the fact that ImageNet is focused more on object recognition than it relied on personal identification, as in MS Celeb. One change was to improve the privacy of people in the background of images by blurring their faces, while ensuring models would still be able to predict accurately whether “this is a photo of a barber’s chair, a Husky, or a beer bottle.”

One way to moderate the influence of potentially harmful data in community sources is to limit how they are used and restrict their impact to pure research by removing the legal right for corporations and governments to use the datasets for production models. A number of researchers who have studied ethical issues in datasets have looked at other scientific fields where large datasets play an important role in research, to see which good practices ideally should be copied over into machine learning.

A number of researchers have called for greater diversity in the creation and use of datasets for machine learning. However, the high cost of developing and maintaining the collections, particularly if greater supervision of crowdsourcing is needed to reduce the introduction of biased data, may lead to a further concentration of effort in institutions with the deepest pockets. The cost of higher-quality labeling and selection may be balanced by an increasing focus on data-centric AI, where the emphasis is far more on the quality of the datasets rather than on their raw size.


One way to moderate the influence of potentially harmful data in community sources is to limit how they are used and restrict their impact to pure research.


Work in the data-centric AI community typically is more focused on tuning datasets to the task at hand, which in turn may reduce the tendency of the machine-learning community to focus on a small number of dominant datasets, and instead utilize highly customized labeled data in concert with better metrics, rather than trying to leverage an ImageNet of anything.

*  Further Reading

Denton, E., Hanna, A., Amironesei, R., Smart, A., and Nicole, H.
On the genealogy of machine learning datasets: a critical history of ImageNet Big Data & Society, July–December 1–14 (2021)

Wang, A. and Russakovsky, O.
Directional bias amplification Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021.

Koch, B., Denton, E., Hanna, A., and Foster, J.G.
Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

Motamedi, M., Sakharnykh, N., and Kaldewey, T.
A data-centric approach for training deep neural networks with less data arXiv preprint – arXiv:2110.03613 (2021).

Papers With Code: The Latest in Machine Learning https://paperswithcode.com/

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More