Sign In

Communications of the ACM


The Growing Cost of Deep Learning for Source Code

data values on a colorful graph, illustration

Credit: Ozz Design

Recent years have seen a steep increase in the use of artificial intelligence methods in software engineering (AI+SE) research. The combination of these two fields has unlocked remarkable new abilities: Lachaux et al.'s recent work on unsupervised machine translation of programming languages,15 for instance, learns to generate Java methods from C++ with over 80% accuracy—without curated examples. This would surely have sounded like a vision of a distant future just a decade ago, but such quick progress is indicative of the substantial and unique potential of deep learning for software engineering tasks and domains.

Yet these abilities come at a price. The "secret ingredient" is data, as epitomized by Lachaux et al.'s work that utilizes 163 billion tokens across three programming languages. For perspective, this is not just nearly 100 times the size of virtually all prior datasets in the AI+SE field; the estimated cost of training this model is to the tune of tens of thousands of dollars. And even that is a drop in the bucket compared to what is next: training the new state-of-the-art in language models—GPT-32—runs in the order of millions. This may be a small price to pay for Facebook, where Lachaux et al.'s research was conducted, or OpenAI (GPT-3), but this exploding trend in cost to achieve the state of the art has left the ability to train and test such models limited to a select few large technology companies—and way beyond the resources of virtually all academic labs. It is reasonable, then, to worry that a continuation of this trend will stifle some of the innovative capacity of academic labs and leave much of the future of AI-based SE research in the hands of elite industry labs. This Viewpoint is a call to action, in which we discuss the current trends, their importance for our field, and propose solutions.

Back to Top

The Case For Scaling

Training deep learners at a massive scale is increasingly often essential: many new, more complex tasks and applications of deep learning are uniquely enabled by larger models and datasets. To use a recently popular example: OpenAI's GPT-32 language model has over 100 billion parameters (approximately 1,000 times the size of typical models in the AI+SE academic field.6,12 This unprecedented scale, the authors found, makes it remarkably adept at learning new tasks from just a few examples—no prior, smaller models could replicate this behavior. This reflects a common trend: innovations in AI increasingly focus on unstructured, complex tasks, because those better align with real-world goals of interest. For instance, learning subject-verb agreement is one component of enabling coherent dialogues with AI assistants. The former requires mostly basic parsing, for which labeled examples can be constructed by the thousands, and is thus quite easy to learn. The latter, meanwhile, involves many, layered and interacting properties of communication, making it far less obvious to learn by imitation alone. This remains an active area of AI research.

Given that AI+SE research has historically adopted many innovations from Natural Language Processing (NLP) researcha it is unsurprising that these trends in the latter are now echoed in the former. The progression in both fields has tracked over the years, trending toward ever-larger models. The latest popular family of neural networks, Transformers,20 yields especially steady performance improvements with increasing training budgets.13 This makes them highly suitable for self-supervised pre-training settings, where abundant data (for example, all text from Wikipedia, all code from GitHub) is put to use to prime a network for success in a downstream task. Approaches such as BERT5,17 and GPT-32 are effectively (very) large Transformers trained with simple objectives on large datasets. These innovations were soon echoed in AI+SE; in the last two years, many such large Transformers were applied to modeling code, as seen in work by Microsoft,6,18 Google12 and Facebook AI.15 More recently, GitHub demonstrated large-scale code complete through their Copilot tool powered by OpenAI's Codex, which leverages 12B parameters and was trained on 100B tokens of code.4 The figure here shows the (log-)scale and the approximate cost of training these models (in terms of cloud GPU cost), bracketed on the left by an example of a relatively large model recently trained in an academic lab and on the right by what is (likely) to come next from NLP.

Figure. Model and dataset sizes of state-of-the-art deep learners for source code in 2020 (not the log-scaling), with approximate training costs for reference.

There is little doubt that their future scales align as well; as the AI+SE field has evolved, the application domain of AI models has grown from relatively simple objectives, like token-level code completion,9,11 to more semantically challenging tasks, such as type inference,8 bug detection,21 and program repair.19 Lachaux et al.'s work indicates where this trend may be headed next: learning from larger data sources with relatively little supervision. Similarly, the way massive datasets are nearly irreplaceable for model quality13 in NLP likely holds true for source code as well: Allamanis and Sutton showed a remarkably constant trend in language modeling quality for n-gram models,b which recent work suggests is also the case for deep models, and is complementary to model size.14 In short, billions of tokens and parameters are the bar of the future.

Back to Top

The Cost of Scaling

What does it cost to scale like that? To train a single, large Transformer-based model takes multiple high-end GPUs working in parallel, often for days or weeks. The accompanying figure reflects that training just Lachaux et al.'s final reported model in the cloud would cost approximately $30,000, based on 32 GPUs working in parallel. Training GPT-3 (which took 80 V100 GPU years)2 would cost more than $2 million. Those numbers just reflect training the published model; research in this field often involves a range of prototypes and ablations. Typically, a grid search is employed to select optimal hyperparameters (for example, number of layers, learning rate, and batch size) from many hundreds of candidates—for example, Hellendoorn et al. train more than 1,000 model variations on Google's infrastructure.10

This hidden cost in the discovery and evaluation of new models can quickly scale up costs, which is why most research labs purchase in-house machines rather than relying on cloud computing. That requires a large up-front cost that is well outside the reach for almost any academic lab: a single server with 16 V100 GPUs, for instance, costs $400K from NVidia—Lachaux et al. used twice that power. Meanwhile, the average U.S. NSF grant size is ~$630K over three years (median: $500K),c most of which intended to cover students, which often leaves the equipment portion small enough that just training Lachaux et al.'s final model in the cloud would be challenging. As a result, academic communities have resorted to buying much cheaper GPUs in smaller quantities, but this limits training. Code inputs tend to be very large (files frequently span thousands of tokens), which plays poorly with limited GPU memory, so investigators must trade off model size, input size, or training batch size, and thereby their model's quality.

Back to Top

How to Turn the Tide

Although the trend toward massive scaling in industry is pervasive, it is still relatively early for AI+SE. If we can mitigate these problems before this gap grows out of bounds, we may well avoid long-lasting damage to innovation in this space.

Similar issues plagued the field of supercomputing, where vast scale on limited budgets also presented insurmountable obstacles to individual labs. To overcome this cost barrier, groups of institutions, in conjunction with funding agencies and smaller companies, jointly funded large computing clusters (for example, Livermore National Laboratory (U.S.) and Swiss National Supercomputing Centre (Switzerland)) exclusively for academic research. In the discipline of AI research, the CCC and the AAAI jointly released a 20-year roadmap,d which also highlights the funding gap between industry and academia. One of their key proposals to address this is to create dedicated AI research facilities that can house up to 200 researchers and provide them with the computing infrastructure needed to advance the state of the art in AI.

Currently, in the U.S., just one large GPU-powered cluster, named COMET,e has been funded by the NSF, which contains 144 GPU nodes with four CPUs each—entirely insufficient to support so many different endeavors. One clear-cut path forward is to create large-scale collaboration(s) across universities (and even countries) and funding agencies. Ideally, this also involves industrial funding, as tech companies benefit from high-quality academic research (on which the current wave of progress is largely built). Pooling all the presently distributed resources together to create large computing facilities with many thousands of GPUs would greatly accelerate research efforts and offer the significant benefit of leveraging unused capacity inherent in broad resource sharing.

At a smaller scale, there is precedent as well. With the increasing importance of mining software repositories research in recent years, funding was provided to create dedicated hardware and software resources that efficiently collect public datasets from open source projects. The NSF has funded several such projects, including Boa, BugSwarm and, Fastenf all in the order of $1M-$1.4M. Now that deep learning for software is proving similarly pivotal, there is a need for such dedicated grants to invest in medium-to-large-scale hardware. Such funding will permit the centralized training of (very) large models that can be disseminated (and compressed) for fine-tuning on downstream tasks to smaller labs. We might also envision smaller grants, with budgets on the order of $250K, to enable individual research labs to purchase the machinery needed to replicate many of the experiments shown in the accompanying figure.

Finally, we would be remiss if we did not highlight the potential for research itself to help overcome these boundaries. Research groups worldwide are working on accelerating training and model compression,3,7,16 which can serve as major democratizing forces for this ubiquitous problem. Besides, there are many fruitful research pursuits in AI+SE that require novel insights, rather than substantial compute resources; this makes smaller and more targeted tasks achievable for simpler models as opposed to behooving a large panacea model. Even so, the overall trend is clear: model growth is exponential and unrelenting, training speed-ups are not keeping up with this leading edge, and novel ideas increasingly go hand-in-hand with large models. We argue that a holistic research agenda empowers both pursuits.

Back to Top

Are We on Track?

In the U.S., the NSF is planning to scale up its investment in AI research from $297M (as of 2019) to $734M by 2022,g acknowledging this increase is necessary due to the high cost of computing resources. The European Union is planning to invest a (much larger) total of $20 billion per year, starting in 2020, specifically for advancing AI research across all fields. Although the current funding structure does not designate equipment funding as we proposed in this Viewpoint, we believe our suggestions can work well with these budget increases. We believe studies of software have a particularly timely need for such resources and hope our recommendations spur the necessary change.

Back to Top


1. Allamanis M. and Sutton, C. Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, 207–216.

2. Brown, T.B. et al. Language models are few-shot learners. (2020); arXiv preprint arXiv:2005.14165.

3. Chen, B. Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. (2019); arXiv preprint arXiv:1903.03129.

4. Chen, M. et al. Evaluating large language models trained on code. (2021); arXiv preprint arXiv:2107.03374.

5. Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. (2018); arXiv preprint arXiv:1810.04805.

6. Feng, Z. et al. Codebert: A pre-trained model for programming and natural languages. (2020); arXiv preprint arXiv:2002.08155.

7. Goyal, P. et al. Accurate, large minibatch sgd: Training imagenet in 1 hour. (2017); arXiv preprint arXiv:1706.02677.

8. Hellendoorn, V.J. et al. Deep learning type inference. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. (2018), 152–162.

9. Hellendoorn, V.J. and Devanbu, P. Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. (2017), 763–773.

10. Hellendoorn, V.J. et al. [n.d.]. Global Relational Models of Source Code.

11. Hindle, A. et al. On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE). (2012) IEEE, 837–847.

12. Kanade, A. et al. Pre-trained Contextual Embedding of Source Code (2019); arXiv preprint arXiv:2001.00059.

13. Kaplan, J. et al. Scaling laws for neural language models. (2020); arXiv preprint arXiv:2001.08361.

14. Karampatsis, R.-M. Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code. (2020); arXiv preprint arXiv:2003.07914.

15. Lachaux, M.-A. et al. Unsupervised translation of programming languages. (2020); arXiv preprint arXiv:2006.03511.

16. Li, Z. et al. Train large, then compress: Rethinking model size for efficient training and inference of transformers. (2020); arXiv preprint arXiv:2002.11794.

17. Yinhan Liu, Y. et al. RoBERTa: A robustly optimized BERT pretraining approach. (2019); arXiv preprint arXiv:1907.11692.

18. Svyatkovskiy, A. et al. IntelliCode compose: Code generation using transformer. (2020); arXiv preprint arXiv:2005.08025.

19. Vasic, M. et al. Neural program repair by jointly learning to localize and repair. (2019); arXiv preprint arXiv:1904.01720.

20. Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems. (2017), 5998–6008.

21. Wang, S. et al. Automatically learning semantic features for defect prediction. In Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE) (2016), 297–308.

Back to Top


Vincent J. Hellendoorn ( is an assistant professor of computer science at Carnegie Mellon University, Pittsburgh, PA, USA.

Anand Ashok Sawant ( is a Research Professional at Siemens Corporate Technology, Princeton, NJ, USA.

Back to Top


a. This on account of the many parallels between code and natural language—for example, words and identifiers, sentences and statements, and even complex dependencies such as parse trees.

b. Specifically, a linear improvement in per-token entropy with exponentially increasing training data size.1

c. NSF FY 2022 Budget Request to Congress—CISE.;

d. A 20-Year Community Roadmap for Artificial Intelligence Research in the U.S.;

e. See

f. NSF grant IDs 1513263, 1629976 and EU H2020 ID 825328 respectively.

g. NSF FY 2022 Budget Request to Congress—CISE.;

Copyright held by authors.
Request permission to (re)publish from the owner/author

The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.


No entries found