“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.”
—Richard Sutton, The Bitter Lessona
During the last seven decades, the field of artificial intelligence (AI) has seen astonishing developments—recent AI systems have demonstrated impressive wide-ranging capabilities in areas spanning from mathematics to biology,4,14 and have the potential to influence large fractions of the U.S. economy.6
How have these developments been possible? When we look at the numbers, the story becomes clear—the amount of compute, data, and parameters in deep learning models have all grown exponentially. On top of these massive increases in scale, researchers are constantly searching for novel ways to improve algorithmic and hardware efficiency.7
While all these factors play a crucial role in AI progress, our focus in this iewpoint is on training compute—the number of floating point operations (FLOP) performed by computer chips when training a Machine Learning (ML) model. Training compute has grown an astonishing amount since the start of ML in the 1950s—first, doubling every 20 months in tandem with Moore’s Law, and then every six months after the advent of Deep Learning. For comparison, Rosenblatt’s Perceptron in 195815 was trained with 7e5 FLOP, whereas we estimate the recently unveiled GPT-414 was trained with up to 3e25 FLOP—almost 20 orders of magnitude more compute (see Figure 1).
Figure 1. Training compute budgets of notable ML systems, 1950–2022. (Original graph from Sevilla et al.,17 updated March 2023.)
These enormous increases in compute investments have been accompanied by impressive increases in AI capabilities. Research into neural scaling laws suggests training compute is among the best predictors of model performance.10,11 Furthermore, training compute requirements inform budget allocations in large tech companies, who are willing to spend millions of U.S. dollars to train a single model.5,10
Evidently, compute is a very important ingredient for AI. But there is a problem—despite recent studies indicating compute is an essential ingredient for high AI performance, we fail to report it with precision. Case in point: none of the award-winning papers (or papers receiving honorable mentions) in ICLR 2021, AAAI-21, and NeurIPS 2021 explicitly reported their compute usage in unambiguous terms, such as the number of FLOP performed during training.b In fairness to the authors, there are challenges to estimating compute, and a lack of norms for publishing FLOP numbers. This has to change.
Why You Should Report Your Compute Usage
The reasons for reporting compute usage in unambiguous terms go beyond acknowledging the importance of compute—adopting compute reporting norms would support the science of ML, forecasts of AI impacts and developments, and evidence-based AI policy. We discuss each of these factors in turn.
Supporting reproducibility. Publishing precise experimental details about data and model architectures is expected of any rigorous ML paper. Should compute not be held to the same standard as these other metrics? Understanding the required compute budget helps practitioners know what is needed—in terms of hardware or cloud-computing requirements, engineering effort, and so on—to reproduce or run a similar experiment.
Facilitating comparisons between models. Publishing compute figures allows researchers to better interpret the drivers of progress in ML. For instance, Erdil and Besiroglu7 investigate algorithmic improvements in image classifiers based on a dataset of training compute estimates.
Understanding who is able to train massive ML systems. We are entering a world in which only a few players have the resources to train competitive ML models.1 Transparency around this fact will help us better decide what we should do about it. In particular, we think people reporting how much compute they used and their providers will help us understand access to computing infrastructure in AI research.
Studying the relation between scale and performance. Empirically derived scaling laws hint at a certain regularity in the relationship between compute and performance.10,11 More systematic reporting of compute and performance can help us better map how one translates to the other, and how this ultimately corresponds to the capabilities of AI systems.
Despite recent studies indicating compute is an essential ingredient for high AI performance, we fail to report it with precision.
Anticipating the emergence of capabilities in AI. Armed with the map between compute and performance, we might be able to extrapolate compute trends to venture guesses on what new capabilities may emerge, and when new capabilities can pose previously unseen risks. Anticipating these issues allows us to take preemptive action. We can, for instance, provide an early warning to AI safety and ethics researchers, helping prioritize their efforts on what might become pressing future issues.3,18
Regulating the deployment of advanced AI systems. Models trained on new regimes of compute must be treated with caution. The emergence of side effects and unexpected behavior plagues our field, where we are being led by bold experiments rather than careful deliberation. Language models can in fact be too big2 and AI ingenuity can produce unintended consequences.12 Being transparent about the levels of compute used in training can help us understand better whether there is precedent for the scale deployed, or if special precautions are warranted.
While capabilities and risks are difficult to assess, especially before training and deployment, compute used in training is a legible lever for external and internal regulation. We can draft policies targeting models exceeding a certain compute budget, for example, requiring special review and safety assurance processes for models exceeding a threshold of training compute.
In short, reporting compute usage in unambiguous terms will help us hold results to greater scientific scrutiny and manage the deployment of advanced AI. This must become a widespread norm, so we can make wiser decisions about this increasingly important part of our technological society.
How to Measure Compute
In a sense, it is not surprising that reporting ML training compute is not already a widespread norm. Currently, there are few standard methods or tools for estimating compute usage, and the ones that exist are often unreliable.c
Developers of ML framework such as PyTorch, TensorFlow, or Jax should consider incorporating a way to automatically and consistently measure the compute usage into the training toolkit. Such a change would facilitate accurate measures of training compute, which can benefit the entire field.
Reporting compute usage in unambiguous terms will help us hold results to greater scientific scrutiny and manage the deployment of advanced AI.
Meanwhile, we can apply apply approximate methods. One method involves estimating the number of arithmetic operations the ML model performs over training, based on its architecture. Alternatively, we can estimate traing compute via information on hardware performance,d and the training time. In general, the first method yields more accurate results, whereas the second can often be determined more easily for complex architectures16 (see Figure 2).
Figure 2. Two ways of applying approximate methods.
Ultimately, we want to use a consistent and unambiguous way of measuring compute. The metric favored by previous work and that we recommend using is the total number of FLOPe used to train the system.f A calculatorg is available to help with estimating FLOP.17
Conclusion
Ever-growing compute usage has become a central part of ML development. To keep track of this increasingly important metric, we need a norm to consistently report compute. And while no tools exist yet to measure this automatically and precisely, we have explained how you can approximate the result using freely available online tools.h
In the AI field there exists a commitment to transparency, accountability, and oversight, which has resulted in norms such as publishing reproducible code, model cards,13 and data cards.8 Reporting compute is a necessary extension of that commitment.
We hope this will help shift the norms of research in ML toward more transparency and rigor. And you can help us normalize this, by acquainting yourself with the methods we have outlined and including FLOP estimates in your writing. Please report your compute.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment