Artificial Intelligence and Machine Learning

Data Quality May Be All You Need

Model size is not everything.

giant shoes next to a tiny person, illustration

History has a lesson for the development of artificial intelligence (AI): when in doubt, make it bigger.

Many in the field point to a 2019 blog post by Richard Sutton, professor of computing science at Canada’s University of Alberta. In “The Bitter Lesson,” he argued that over its 70-year history, AI has succeeded when it has exploited available computing power. A series of papers published during the past decade that analyzed deep learning performance have confirmed the powerful effects of scaling up model size. This process accelerated in the wake of Google’s development of the Transformer architecture for the BERT large language models (LLMs). Model size, measured by the number of stored neural weights, ballooned in just five years. From BERT’s 340 million parameters, today’s largest implementations, known as frontier models, such as OpenAI’s GPT-4 have pushed beyond a trillion.

Model size is not everything. In 2022, a team at Google subsidiary DeepMind declared that neglecting a similar increase in training data is not a wise strategy. Their Chinchilla was a quarter the size of the 280-billion-parameter Google model Gopher. They fed this smaller model four times more data. During testing, Chinchilla turned out to be 7% more accurate than Gopher.

Chinchilla’s training demanded an immense quantity of data: 1.4 trillion tokens (a token roughly equates to a short word or word fragment). A 2022 study by Pablo Villalobos and colleagues at research institute Epoch argued that the world is not generating enough textual data to keep pace with larger LLMs.

“When we talk about high-quality language data, we usually refer to stuff like research papers, where there’s roughly 1 trillion tokens. And then there’s another 1.8 trillion tokens in books,” says Niklas Muennighoff, research engineer at generative AI startup Contextual AI. That estimate points to a model half the size of Gopher that would exhaust the stack of high-quality data.

The stock of data available to future commercial models may not even be that large. During the early development of LLMs, researchers could use copyright exemptions to train their creations on textbooks and novels. However, the shift to commercial deployment means developers lose those exemptions and now face lawsuits from angry publishers and creators in attempts to deny free access to their works.

For non-English-language material, there is even less usable data. A survey by an international team from 20 universities that was led by Google researchers found out of 205 Web-crawled datasets in non-English languages, 15 were unusable for pretraining. In almost 90 others, fewer than half the sentences were suitable. Even after filtering out problematic data, the information content of the data created other challenges. Maximizing data may be counterproductive.

Experiments have shown that duplicated information leads not just to unwanted biases; it also causes LLMs to memorize data instead of using it to generalize about the inputs. Personal data and copyrighted phrases can easily wind up being incorporated into the model verbatim. The prime cause of this effect seems to be the result of near-copies that pass simple exact-match filters.

At the 2022 Annual Meeting of the Association for Computational Linguistics (ACL), Katherine Lee of Google DeepMinda and Daphne Ippolito of the University of Pennsylvania reported on how a lot of content can pass simple deduplication checks because they quote from sources or are partially rewritten. Before they applied a probabilistic filter scheme to remove these near-copies, one English sentence of 60 words appeared more than 60,000 times in a commonly used training dataset. In some cases, duplicated content even appeared in the test data used to check the trained model. Trained on the filtered set, the model produced results that used memorized text 10 times less often.

Publishing their work just a few weeks afterwards, Danny Hernandez and colleagues at AI company Anthropic found duplication degrades model quality overall. They claimed the core of the problem is the diversion of a disproportionate amount of storage capacity away from generalizing relationships in data. Repeating just 0.1% of the data 100 times resulted in a model achieving the accuracy of one-half its capacity trained on the content that had not been artificially duplicated.

Although partial duplication of data leads to problems, controlled repetition can work well. Cornell Tech associate professor Alexander Rush says the repetition of complete data sets across multiple training cycles, or epochs, seems to have fallen out of favor. His work with Muennighoff and othersb has shown how repetition can deliver better models, albeit with diminishing returns over multiple epochs.

“Our paper encourages training on unique and good data, when possible,” Rush explains. But because of the shortage of original data, “We do not believe that this will be possible for future frontier models,” he adds.

The experiment also showed that filtering data using some basic metrics of quality and training across two epochs produced better results than training the full dataset over just one epoch.

One way to find new and seemingly high-quality data for huge models is to invent some of it. However, this introduces other problems. Because LLMs use probability distributions to deliver their answers, the content they generate from synthetic data can easily diverge from the human-generated material on which they were trained. Several studies already have shown this divergence translates into progressively worse results in a variety of AI models. Sina Alemohammad and colleagues at Rice University coined the acronym Model Autophagy Disorder (MAD) for the artifacts in the output from image generators trained on the outputs of their predecessors. Human faces developed unsightly and artificial crosshatch patterns in later-generation images.

This degradation from synthetic sources potentially becomes highly troublesome even when teams believe they are using real-world data. Content produced by ChatGPT and other LLMs already appears in Web-crawled data; this will wind up in future training datasets. A team at the University of Cambridge used a mathematical model of the training process to argue that a growing proportion of material from LLMs will alter the knowledge they store and make them less useful. Such pollution could happen more quickly than expected. Several groups have found LLMs favor AI-generated content because it delivers better training metrics than human-generated data.

Sub-optimal data in supposedly high-quality collections may be degrading performance. So, researchers are now considering much deeper inspection of what they feed their models. They are building more rigorous tests of what matters and the degree to which synthetic data can be used safely. The results of this work may encourage LLM developers to hit the brakes on scaling. Such a move would be popular as concerns are growing over the economic and environmental sustainability of these models.

Samuel Albanie, assistant professor of applied machine learning at the U.K.’s University of Cambridge, sees a turning point in work Microsoft published in June last year. This paper claimed, in an echo of the title of the 2017 Google paper that introduced Transformers, “textbooks are all you need.”

They do not need to be human-written textbooks. For some of its data, Phi-1 used the output from GPT-3.5, among other sources, to generate high-quality synthetic textbook extracts for training a coding-focused model. Though not at the level of the much-larger GPT-4, Phi-1 outperformed LLMs 10 times its size, with one exception. It took just four days to train on just 7 billion tokens, focused on programming tasks written in Python. At the end of last yearc, the team followed up with Phi-2, trained on 1.4 trillion tokens and double the size of its predecessor.

One approach to determining how best to build and scale data may come from the world of open-source multimodal AI work. A major problem even among the open-source datasets used to build tools such as Stable Diffusion is that their contents are not well understood. LLM developers have accepted this up to now because their innovation has focused on model architecture. The DataComp initiative seeks to flip this around by keeping the model architecture fixed and then training on datasets that have been filtered and processed in innovative ways. The group has also designed the scheme to work on comparatively small models to make the scheme more accessible to teams that lack the funding of large companies and research institutes.

A key question is whether this kind of work will scale up to larger models predictably. Sometimes the patterns are less than clear. Work by Stella Biderman and colleagues from the EleutherAI group, which is attempting to build an open-source equivalent of the GPT family of LLMs, found experiments on small models do not predict what larger versions will memorize. In general, experiments have shown that size leads to increased robustness when LLMs are faced with types of data on which the models have not been trained. This may ultimately prove to be a driving force to maintaining the scaling-up of model size, even in a more data-centric environment. Some argue this apparent size advantage may simply come from more diverse training data, which could come from smarter selection and synthetic generation.

But because scale has proven to be so important until now, it continues to be difficult for groups operating outside the tech giants to assess the real importance of size. The key issue is that the tech giants do not publish details of their datasets, making any comparisons of data quality much harder.

The other issue is the expense of training frontier models, although Muennighoff points to open-source efforts that are scaling up models thanks to resources such as the European research supercomputer LUMI. Recent efforts have scaled the parameter count of open-source models to close to the size of Google’s Chinchilla. The open-source efforts will find it difficult to close the gap on the frontier models, but they may be more successful in showing how important that extra scale really is.

Further Reading

  • Hoffman, J., et al. An empirical analysis of compute-optimal large language model training, Proceedings of Advances in Neural Information Processing Systems 35 (NeurIPS), arXiv:2203.15556 (2022)
  • Villalobos, P., Sevilla, J., Heim, L., Besiroglu, T., Hobbhahn, M., and Ho, A. Will we run out of data? An analysis of the limits of scaling datasets in machine learning, arXiv:2211.04325 (2022)
  • Gunasekar, S., et al. Textbooks are all you need, arXiv:2306:11644 (2023)
  • Muennighoff, N., Rush, A.M., Barak, B., Le Scao, T., Piktus, A., Tazi, N., Pyysalo, S., Wolf, T., and Raffel, C. Scaling data-constrained language models, Proceedings of Advances in Neural Information Processing Systems 36 (NeurIPS), arXiv:2305:16264

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More