The following paper presents a research deployment of Field Programmable Gate Arrays (FPGAs) in a Microsoft Bing datacenter. The FPGAs enabled more efficient search processing directly at the digital logic level. This pioneering work is the first to successfully demonstrate at scale the idea of FPGAs as an effective large-scale, first-class component in cloud computing. Following this landmark effort, Microsoft has launched full-scale production deployment of FPGA accelerators in its new datacenter servers for a range of cloud services.
For most of its 30-year existence, FPGA technology had primarily served as an alternative to application-specific integrated circuits (ASICs), with only niche applications in computing. Today, besides Microsoft’s activities, we are also finding Intel and IBM adding FPGAs as programmable computing substrates to their product lines. At the root of the computing industry’s current embrace of FPGAs is the same Power Wall struggle that brought about the transition to multicore microprocessors in the last decade.
In the decades prior, single-threaded microprocessors enjoyed a regular doubling of compute performance with each new Very Large Integrated Circuits (VLSI) scaling generation, taking advantage of the more numerous and faster transistors. However, each new generation of faster microprocessors had also required more power. The Power Wall is less about supplying power but more about removing the resulting heat fast enough. For set upper bounds in cost, weight, size, and noise of the cooling apparatus, there is a limit to how fast heat can be extracted from a microprocessor die. Microprocessors remained well below market-set economical cooling limits until the 1990s. Despite the best concerted efforts from software down to material science to reign in the power increase in the ensuing years, single-threaded microprocessors ran out of cooling headroom to sustain their performance improvements by the middle of the last decade.
In the present power-constrained design regime—whether dictated by the cooling and packaging of a microprocessor or the air handling capacity of a datacenter, plus nowadays actual supply-side considerations in battery powered devices—the problem of getting more performance ("operations per second") requires a solution that can somehow do so while expending less "energy per operation."
Parallelism provided the first solution to running faster while using less energy in each step. As a rule of thumb, it takes proportionally more power to increase the performance of a sequential task. Therefore, when parallelism is available, one could use parallel execution to reduce power instead of for speedup. In an illustrative example, given design A with throughput performance ThroughputA at PowerA and a lower-performing design B with ThroughputB = ThroughputA/N and PowerB < PowerA/N, we can use N copies of B to complete the same number of tasks per second as A at less power, and we can use >N copies of B to exceed the throughput of A at the same PowerA. Both multicore microprocessors and GPGPUs are practitioners of this principle. FPGA computing can take this to further extremes, delivering high overall performance using a sea of individually slow processing elements that are very energy efficient per operation. This route to energy efficiency is of course only applicable when the computing task of interest is amenable to a high degree of parallelization.
Relative to microprocessors, FPGA computing has another source of energy efficiency. The majority of the power in a microprocessor is consumed by the overhead in presenting the simplifying von Neumann abstraction to the programmer and in ensuring good performance across a wide range of program behaviors. FPGA computing avoids these abstraction overheads and at the same time gains unrestricted optimization freedom, in particular the ability to exploit all forms and granularity of parallelism in a task. In return, FPGA computing places a great burden on application developers to design at a low level of specificity to meet functional and performance objectives. How to simplify application development, while retaining FPGA’s efficiency advantage, is an important and challenging problem.
Mapping computations to ASICs has the same benefits as FPGAs. In fact, an ASIC implementation can be significantly more energy efficient than an FPGA implementation due to the overhead in making the FPGAs’ logic substrate reprogrammable. But reprogrammability is very important to computing. As argued in the following paper, besides the utility of repurposing a hardware investment over its multiyear ownership, a particular accelerated task of interest can evolve too quickly to be committed into ASIC development time and cost scales.
Finally, it should be noted that FPGA computing is not the answer in every scenario or across all trade-offs. Ultimately, the quest for efficiency is a matter of using the right tool for the job.
Driven by the contradicting needs for more performance and better energy efficiency, parallel computing in the form of multicore microprocessors seemingly exploded into the commercial mainstream in the last decade. With the Catapult effort, perhaps we are again seeing the start of something equally transformative in the pursuit of still higher performance and energy efficiency. As we watch the current exciting developments unfold across the computing industry, we should also recognize the multiple decades of prior work that led up to this pivotal time.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment