History tells us that scientific progress is imperfect. Intellectual traditions and available tooling can prejudice scientists away from some ideas and towards others.24 This adds noise to the marketplace of ideas and often means there is inertia in recognizing promising directions of research. In the field of artificial intelligence (AI) research, this article posits that it is tooling which has played a disproportionately large role in deciding which ideas succeed and which fail.
Key Insights
- The term hardware lottery describes a research idea that wins due to its compatibility with available software and hardware, not its superiority over alternative research directions.
- We may be in the midst of a present-day hardware lottery. Hardware design has prioritized delivering on commercial use cases, while built-in flexibility to accommodate the next generation of ideas remains a secondary consideration.
- Any attempt to avoid future hardware lotteries must be concerned with making it cheaper and less time consuming to explore different hardware/software/algorithm combinations.
What follows is part position paper and part historical review. I introduce the term “hardware lottery” to describe when a research idea wins because it is compatible with available software and hardware, not because the idea is superior to alternative research directions. The choices about software and hardware have often played decisive roles in deciding the winners and losers in early computer science history.
These lessons are particularly salient as we move into a new era of closer collaboration between the hardware, software, and machine-learning research communities. After decades of treating hardware, software, and algorithm as separate choices, the catalysts for closer collaboration include changing hardware economics, a “bigger-is-better” race in the size of deep-learning architectures, and the dizzying requirements of deploying machine learning to edge devices.
Closer collaboration is centered on a wave of new-generation, “domain-specific” hardware that optimizes for the commercial use cases of deep neural networks. While domain specialization creates important efficiency gains for mainstream research focused on deep neural networks, it arguably makes it even more costly to veer off the beaten path of research ideas. An increasingly fragmented hardware landscape means that the gains from progress in computing will be increasingly uneven. While deep neural networks have clear commercial use cases, there are early warning signs that the path to the next breakthrough in AI may require an entirely different combination of algorithm, hardware, and software.
This article begins by acknowledging a crucial paradox: machine-learning researchers mostly ignore hardware despite the role it plays in determining which ideas succeed. The siloed evolution of hardware, software, and algorithm has played a critical role in early hardware and software lotteries. This article considers the ramifications of this siloed evolution with examples of early hardware and software lotteries. And, while today’s hardware landscape is increasingly heterogeneous, I posit that the hardware lottery has not gone away, and the gap between the winners and losers will grow. After unpacking these arguments, the article concludes with thoughts on how to avoid future hardware lotteries.
Separate Tribes
For the creators of the first computers, the program was the machine. Due to both the cost of the electronics and a lack of cross-purpose software, early machines were single use; they were not expected to be repurposed for a new task (Figure 1). Charles Babbage’s “difference engine” (1817) was solely intended to compute polynomial functions.9 IBM’s Harvard Mark I (1944) was a programmable calculator.22 Rosenblatt’s perceptron machine (1958) computed a stepwise single-layer network.48 Even the Jacquard loom (1804), often thought of as one of the first programmable machines, was, in practice, so expensive to re-thread that it was typically threaded once to support a pre-fixed set of input fields.36
Figure 1. Early computers were single use and were not expected to be repurposed. These machines could not be expected to run the variety of programs our modern-day machines do.
In the early 1960s, joint specialization of hardware and software went vertical. IBM was an early pioneer in the creation of instruction sets, which were portable between computers. A growing business could install a small IBM 360 computer and not be forced to relearn everything when migrating to a bigger 360 machine. Competitors Burroughs, Cray, and Honeywell all developed their own systems—compatible with their own machines but not across manufacturers. Programs could be ported between different machines from the same manufacturer, but not on competitive machines. The design itself remained siloed, with hardware and software developed jointly in-house.
Today, in contrast to the specialization necessary in computing’s very early days, machine-learning researchers tend to think of hardware, software, and algorithms as three separate choices. This is largely due to a period in computer science history that radically changed the type of hardware that was produced and incentivized the hardware, software, and machine-learning research communities to evolve in isolation.
The general-purpose computer era crystallized in 1969, when a young engineer named Gordan Moore penned an opinion piece in Electronics magazine titled, “Cramming More Components onto Circuit Boards.”33 In it, Moore predicted that the number of transistors on an integrated circuit could be doubled every two years. The article and subsequent follow-up were originally motivated by a simple desire—Moore thought it would sell more chips. However, the prediction held and motivated a remarkable decline in the cost of transforming energy into information over the next 50 years.
Moore’s law combined with Dennard scaling12 enabled a factor of three-magnitude increase in microprocessor performance from 1980 to 2010. The predictable increases in computing power and memory every two years meant hardware design became risk averse. Why experiment on more specialized hardware designs for an uncertain reward when Moore’s law allowed chip makers to lock in predictable profit margins? Even for tasks which demanded higher performance, the benefits of moving to specialized hardware could be quickly eclipsed by the next generation of general-purpose hardware with ever-growing computing power.
The emphasis shifted to universal processors, which could solve myriad different tasks. The few attempts to deviate and produce specialized super-computers for research were financially unsustainable and short-lived. A few very narrow tasks, such as mastering chess, were an exception to this rule because the prestige and visibility of beating a human adversary attracted corporate sponsorship.34
Treating the choice of hardware, software, and algorithm as independent has persisted until recently. It is expensive to explore new types of hardware, both in terms of time and capital required. Producing a next-generation chip typically costs $30–$80 million and takes two to three years to develop.14 These formidable barriers to entry have produced a hardware research culture that might feel odd or perhaps even slow to the average machine-learning researcher. While the number of machine-learning publications has grown exponentially in the last 30 years, the number of hardware publications has maintained a fairly even cadence.42 For a hardware company, leakage of intellectual property can make or break the survival of the firm. This has led to a much more closely guarded research culture.
In the absence of any lever with which to influence hardware development, machine-learning researchers rationally began to treat hardware as a sunk cost to work around rather than something fluid that could be shaped. However, just because we have abstracted hardware away does not mean it has ceased to exist. Early computer science history tells us there are many hardware lotteries where the choice of hardware and software has determined which ideas succeed and which fail.
The Hardware Lottery
The first sentence of Tolstoy’s Anna Karenina reads, “All happy families are alike; every unhappy family is unhappy in its own way.”47 Tolstoy is saying that it takes many different things for a marriage to be happy—financial stability, chemistry, shared values, healthy offspring. However, it only takes one of these aspects to not be present for a family to be unhappy. This has been popularized as the Anna Karenina principle, “a deficiency in any one of a number of factors dooms an endeavor to failure.”32
Despite our preference to believe algorithms succeed or fail in isolation, history tells us that most computer science breakthroughs follow the Anna Karenina principle. Successful breakthroughs are often distinguished from failures by benefiting from multiple criteria aligning surreptitiously. For AI research, this often depends upon winning what we have named the hardware lottery—avoiding possible points of failure in downstream hardware and software choices.
An early example of a hardware lottery is the analytical machine (1837). Charles Babbage was a computer pioneer who designed a machine that could be programmed, at least in theory, to solve any type of computation. His analytical engine was never built, in part because he had difficulty fabricating parts with the correct precision.25 The electromagnetic technology required to actually build the theoretical foundations laid down by Babbage only surfaced during WWII. In the first part of the 20th century, electronic vacuum tubes were heavily used for radio communication and radar. During WWII, these vacuum tubes were repurposed to provide the computing power necessary to break the German enigma code.10
As noted in the TV show Silicon Valley, often “being too early is the same as being wrong.” When Babbage passed away in 1871, there was no continuous path between his ideas and modern computing. The concept of a stored program, modifiable code, memory, and conditional branching were rediscovered a century later because the right tools existed to empirically show that the idea worked.
The Lost Decades
Perhaps the most salient example of the damage caused by not winning the hardware lottery is the delayed recognition of deep neural networks as a promising direction of research. Most of the algorithmic components needed to make deep neural networks work had already been in place for a few decades: backpropagation was invented in 1963,43 reinvented in 1976,29 and then again in 1988,39 and was paired with deep convolutional neural networks15 in 1989.27 However, it was only three decades later that deep neural networks were widely accepted as a promising research direction.
The gap between these algorithmic advances and empirical success is due in large part to incompatible hardware. During the general-purpose computing era, hardware such as central processing units (CPUs) was heavily favored and widely available. CPUs are very good at executing an extremely wide variety of tasks; however, processing so many different tasks can incur inefficiency. CPUs require caching intermediate results and are limited in the concurrency of tasks that can be run, which poses limitations for an operation such as matrix multiplication, a core component of deep neural-network architectures. Matrix multiplies are very expensive to run sequentially but far cheaper to compute when parallelized. The inability to parallelize on CPUs meant matrix multiplies quickly exhausted memory bandwidth, and it simply wasn’t possible to train deep neural networks with multiple layers.
While domain specialization creates important efficiency gains for mainstream research focused on deep neural networks, it arguably makes it even more costly to veer off the beaten path of research ideas.
The need for hardware that supported tasks with lots of parallelism was pointed out as far back as the early 1980s in a series of essays titled, “Parallel Models of Associative Memory.”19 The essays argued persuasively that biological evidence suggested massive parallelism was needed to make deep neural-network approaches work.
In the late 1980s/90s, the idea of specialized hardware for neural networks had passed the novelty stage. However, efforts remained fractured due to a lack of shared software and the cost of hardware development. Without a consumer market, there was simply not the critical mass in end users to be financially viable. It would take a hardware fluke in the early 2000s, a full four decades after the first paper about backpropagation was published, for the insights about massive parallelism to be operationalized in a useful way for connectionist deep neural networks.
A graphical processing unit (GPU) was originally introduced in the 1970s as a specialized accelerator for video games and developing graphics for movies and animation. In the 2000s, GPUs were repurposed for an entirely unimagined use case—to train deep neural networks.7 GPUs had one critical advantage over CPUs: they were far better at parallelizing a set of simple, decomposable instructions, such as matrix multiples. This higher number of effective floating-point operations per second (FLOPS), combined with clever distribution of training between GPUs, unblocked the training of deeper networks.
The number of layers in a network turned out to be the key. Performance on ImageNet jumped with ever-deeper networks. A striking example of this jump in efficiency is the now-famous 2012 Google research that required 16,000 CPU cores to classify cats; just a year later, a published paper reported the same task had been accomplished using only two CPU cores and four GPUs.8
Software Lottery
Software also plays a role in deciding which research ideas win and which ones lose. Prolog and LISP were two languages heavily favored by the AI community until the mid-90s. For most of this period, AI students were expected to actively master at least one, if not both. LISP and Prolog were particularly well suited to handling logic expressions, which were a core component of reasoning and expert systems.
For researchers who wanted to work on connectionist ideas, such as deep neural networks, no clearly suited language of choice existed until the emergence of MATLAB in 1992. Implementing connectionist networks in LISP or Prolog was cumbersome, and most researchers worked in low-level languages such as C++. It was only in the 2000s that a healthier ecosystem began to take root around software developed for deep neural-network approaches, with the emergence of LUSH and, subsequently, TORCH.
Machine-learning researchers mostly ignore hardware despite the role it plays in determining which ideas succeed.
Where there is a loser, there is also a winner. From the 1960s through the mid-1980s, most mainstream research focused on symbolic approaches to AI. Unlike deep neural networks, where learning an adequate representation is delegated to the model itself, symbolic approaches aimed to build up a knowledge base and use decision rules to replicate the ways in which humans would approach a problem. This was often codified as a sequence of logic ‘what-if’ statements that were well suited to LISP and PROLOG.
Symbolic approaches to AI have yet to bear fruit, but the widespread and sustained popularity of this research direction for most of the second half of the 20th century cannot be seen as independent of how readily it fit into existing programming and hardware frameworks.
The Persistence of the Hardware Lottery
Today, there is renewed interest in collaboration between the hardware, software, and machine-learning communities. We are experiencing a second pendulum swing back to specialized hardware. Catalysts include changing hardware economics, prompted by both the end of Moore’s law and the breakdown of Dennard scaling; a “bigger is better” race in the number of model parameters;1 spiraling energy costs;20 and the dizzying requirements of deploying machine learning to edge devices.50
The end of Moore’s law means we are not guaranteed more computing power and performance; hardware will have to earn it. To improve efficiency, there is a shift from task-agnostic hardware, such as CPUs, to domain-specialized hardware that tailors the design to make certain tasks more efficient. The first examples of domain-specialized hardware released over the last few years—tensor processing units (TPUs),23 edge-TPUs,16 and Arm Cortex-M552—optimize explicitly for costly operations common to deep neural networks, such as matrix multiplies.
In many ways, hardware is catching up to the present state of machine-learning research. Hardware is only economically viable if the lifetime of the use case is longer than three years.11 Betting on ideas that have longevity is a key consideration for hardware developers. Thus, co-design efforts have focused almost entirely on optimizing an older generation of models with known commercial use cases. For example, ‘matrix multiplies’ are a safe target to optimize because they are here to stay—anchored by the widespread use and adoption of deep neural networks in production systems. Allowing for unstructured sparsity and weight-specific quantization is also a safe strategy because there is wide consensus that these will enable higher compression levels.
There is still the separate question of whether hardware innovation is versatile enough to unlock or keep pace with entirely new machine-learning research directions. This question is difficult to answer; the data points here are limited, making it hard to model whether this idea would succeed given different hardware. However, despite this task’s inherent challenge, there is already compelling evidence that domain-specialized hardware makes it more costly for research ideas that stray outside of the mainstream to succeed.
The authors of a 2019 published paper titled, “Machine Learning Is Stuck in a Rut”3 consider the difficulty of training a new type of computer vision architecture called capsule networks42 on domain-specialized hardware. Capsule networks include novel components, such as squashing operations and routing by agreement. These architecture choices aim to solve for key deficiencies in convolutional neural networks (lack of rotational in-variance and spatial hierarchy understanding) but stray from the typical architecture of neural networks. As a result, while capsule network operations can be implemented reasonably well on CPUs, performance falls off a cliff on accelerators such as GPUs and TPUs, which have been overly optimized for matrix multiplies.
Whether or not you agree that capsule networks are the future of computer vision, the authors say something interesting about the difficulty of trying to train a new type of image-classification architecture on domain-specialized hardware. Hardware design has prioritized delivering on commercial use cases, while built-in flexibility to accommodate the next generation of research ideas remains a distant secondary consideration.
While hardware specialization makes deep neural networks more efficient, it also makes it far more costly to stray from accepted building blocks. It prompts the questions: how much will researchers implicitly be overfit to ideas that operationalize well on available hardware, rather than take a risk on ideas that are not currently feasible? What are the failures we still don’t have the hardware to see as a success?
The Likelihood of Future Hardware Lotteries
It is an ongoing, open debate within the machine-learning community about how much future algorithms will differ from models such as deep neural networks. The risk you attach to depending on domain-specialized hardware is tied to your position on this debate. Betting heavily on specialized hardware makes sense if you think that future breakthroughs depend on pairing deep neural networks with ever-increasing amounts of data and computation.
Several major research labs are making this bet, engaging in a “bigger is better” race in the number of model parameters and collecting ever-more-expansive datasets. However, it is unclear whether this is sustainable. An algorithm’s scalability is often thought of as the performance gradient relative to the available resources. Given more resources, how does performance increase?
For many subfields, we are now in a regime where the rate of return for additional parameters is decreasing.46 The cost of throwing additional parameters at a problem is becoming painfully obvious. Perhaps more troubling is how far away we are from the type of intelligence humans demonstrate. Despite their complexity, human brains remain extremely energy efficient. While deep neural networks may be scalable, it may be prohibitively expensive to do so in a regime of comparable intelligence to humans. An apt metaphor is that we appear to be trying to build a ladder to the moon.
Biological examples of intelligence differ from deep neural networks in enough ways to suggest it is a risky bet to say that deep neural networks are the only way forward. While general-purpose algorithms such as deep neural networks rely on global updates to learn a useful representation, our brains do not. Our own intelligence relies on decentralized local updates which surface a global signal in ways that are not well understood.5
In addition, our brains can learn efficient representations from far fewer labeled examples than deep neural networks (Figure 2). Humans have highly optimized and specific pathways developed in our biological hardware for different tasks.49 This suggests that the way a network is organized, and our inductive biases, are as important as the network’s overall size.18
Figure 2. Our own cognitive intelligence is inextricably both hardware and algorithm. We do not inhabit multiple brains over our lifetime.
For example, it is easy for a human to walk and talk at the same time. However, it is far more cognitively taxing to attempt to read and talk.44 Our brains are able to fine-tune and retain human skills across our life-times.4 In contrast, deep neural networks that are trained upon new data often evidence catastrophic forgetting, where performance deteriorates on the original task because the new information interferes with previously learned behavior.30
There are several highly inefficient assumptions about how we train models. For example, during typical training, the entire model is activated for every example, leading to a quadratic explosion in training costs. In contrast, the brain does not perform a full forward and backward pass for all inputs; it simulates what inputs are expected against incoming sensory data. What we see is largely virtual reality computed from memory.6
The point of these examples is not to convince you that deep neural networks are not the way forward, but rather that there are clearly other models of intelligence, which suggests it may not be the only way. It is possible that the next breakthrough will require a fundamentally different way of modeling the world with a different combination of hardware, software, and algorithm. We may very well be amid a present-day hardware lottery.
The Way Forward
Scientific progress occurs when there is a confluence of factors that allows scientists to overcome the “stickiness” of the existing paradigm. The speed at which paradigm shifts have happened in AI research have been disproportionately determined by the degree of alignment between hardware, software, and algorithm. Thus, any attempt to avoid hardware lotteries must be concerned with making the exploration of different hardware/software/algorithm combinations cheaper and less time-consuming.
This is easier said than done. Expanding the search space of possible hardware/software/algorithm combinations is a formidable goal. It is expensive to explore new types of hardware, both in terms of time and capital required. Producing a next-generation chip typically costs $30–$80 million and takes two to three years to develop.14 The fixed costs alone of building a manufacturing plant are enormous, estimated at $7 billion in 2017.45
Experiments using reinforcement learning to optimize chip placement (Figure 3) may help decrease costs.31 There is also renewed interest in reconfigurable hardware, such as field-programmable gate array (FPGAs)17 and coarse-grained reconfigurable arrays (CGRAs).37 These devices allow chip logic to be reconfigured to avoid being locked into a single use case. However, the tradeoff for flexibility is much higher FLOPS and the need for tailored software development. Coding even simple algorithms on FPGAs remains very painful and time-consuming.41
Figure 3. Hardware design remains risk averse due to the large amount of capital and time required to fabricate each new generation of hardware.
Hardware development in the short- to medium-term is likely to remain expensive and prolonged. The cost of producing hardware is important because it determines the amount of risk and experimentation hardware developers are willing to tolerate. Investment in hardware tailored to deep neural networks is assured because neural networks are a cornerstone of enough commercial use cases. The widespread profitability of downstream uses of deep learning has spurred a healthy ecosystem of hardware startups aiming to further accelerate deep neural networks and has encouraged large companies to develop custom hardware in-house.
The bottleneck will continue to be funding hardware for use cases that are not immediately commercially viable. These riskier directions include biological hardware, analog hardware with in-memory computation, neuromorphic computing, optical computing, and quantum computing-based approaches. There are also high-risk efforts to explore the development of transistors using new materials.
An interim goal is to provide better feedback loops to researchers about how our algorithms interact with the hardware we do have. Machine-learning researchers do not spend much time talking about how hardware chooses which ideas succeed and which fail. This is primarily because it is hard to quantify the cost of being concerned. At present, there are no easy-and-cheap-to-use interfaces to benchmark algorithm performance against multiple types of hardware at once. There are frustrating differences in the subset of software operations supported on different types of hardware, which prevents the portability of algorithms across hardware types.21 Software kernels are often overly optimized for a specific type of hardware, leading to huge lags in efficiency when used with different hardware.
These challenges are compounded by an ever-more-formidable and heterogeneous landscape of hardware.38 As the hardware landscape becomes increasingly fragmented and specialized, writing fast and efficient code will require more niche and specialized skills.28 This means that there will be increasingly uneven gains from progress in computer science research. While some types of hardware will benefit from a healthy software ecosystem, progress on other languages will be sporadic and often stymied by a lack of critical end users.45
One way to mitigate this need for specialized software expertise is through the development of domain-specific languages that focus on a narrow domain. While you give up expressive power, domain-specific languages permit greater portability across different types of hardware. They allow developers to focus on the intent of the code without worrying about implementation details.35 Another promising direction is automatically autotuning the algorithmic parameters of a program based upon the downstream choice of hardware. This facilitates easier deployment by tailoring the program to achieve good performance and load balancing on a variety of hardware.13
In parallel, we need better profiling tools to empower researchers with a more-informed opinion about how hardware and software should evolve. Ideally, software should surface recommendations about what type of hardware to use given the configuration of an algorithm. Registering what differs from our expectations remains a key catalyst in driving new scientific discoveries. Software needs to do more work, but it is also well positioned to do so. We have neglected efficient software throughout the era of Moore’s Law, trusting that predictable gains in computing performance would compensate for inefficiencies in the software stack. This means there are many low-hanging fruit as we begin to optimize for more efficient software.26
Conclusion
George Gilder, an American investor, powerfully described the computer chip as inscribing worlds on grains of sand. The performance of an algorithm is fundamentally intertwined with the hardware and software it runs on. This article proposes the term ‘hardware lottery’ to describe how these downstream choices determine whether a research idea succeeds or fails.
Today the hardware landscape is increasingly heterogeneous. This article posits that the hardware lottery has not gone away, and the gap between the winners and losers will grow. To avoid future hardware lotteries, we need to make it easier to quantify the opportunity cost of settling for the hardware and software we have.
Acknowledgments
Thank you to many of my wonderful colleagues and peers who took time to provide valuable feedback on earlier versions of this contributed article. In particular, I would like to acknowledge the valuable input of Utku Evci, Erich Elsen, Melissa Fabros, Amanda Su, Simon Kornblith, Aaron Courville, Hugo Larochelle, Cliff Young, Eric Jang, Sean McPherson, Jonathan Frankle, Carles Gelada, David Ha, Brian Spiering, Stephanie Sher, Jonathan Binas, Pete Warden, Sean Mcpherson, Lara Florescu, Jacques Pienaar, Chip Huyen, Raziel Alvarez, Dan Hurt, and Kevin Swersky. Thanks for the institutional support and encouragement of Natacha Mainville and Alexander Popper.
Figure. Watch the author discuss this work in the exclusive Communications video. https://cacm.acm.org/videos/the-hardware-lottery
Join the Discussion (0)
Become a Member or Sign In to Post a Comment