Sign In

Communications of the ACM

News

Making Chips Smarter


View as: Print Mobile App ACM Digital Library Full Text (PDF) In the Digital Edition Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook
Google's Tensor Processing Unit

Google's Tensor Processing Unit is a custom circuit designed specifically for AI applications such as speech processing and streetview mapping and navigation.

It is no secret that artificial intelligence (AI) and machine learning have advanced radically over the last decade, yet somewhere between better algorithms and faster processors lies the increasingly important task of engineering systems for maximum performanceand producing better results.

The problem for now, says Nidhi Chappell, director of machine learning in the Datacenter Group at Intel, is that "AI experts spend far too much time preprocessing code and data, iterating on models and parameters, waiting for training to converge, and experimenting with deployment models. Each step along the way is either too labor-and/or compute-intensive."

The research and development communityspearheaded by companies such as Nvidia, Microsoft, Baidu, Google, Facebook, Amazon, and Intelis now taking direct aim at the challenge. Teams are experimenting, developing, and even implementing new chip designs, interconnects, and systems to boldly go where AI, deep learning, and machine learning have not gone before. Over the next few years, these developments could have a major impacteven a revolutionary effecton an array of fields: automated driving, drug discovery, personalized medicine, intelligent assistants, robotics, big data analytics, computer security, and much more. They could deliver faster and better processing for important tasks related to speech, vision, and contextual searching.

Specialized chips can significantly increase performance for fixed-function workloads, because they include everything needed specifically for the task at hand and nothing more. Yet, the task is not without its challenges.

For one thing, there's no clear idea about how to use silicon to accelerate AI. Most chip designs and systems are still in the early stages of research, development, or deployment.

For another, there's no single design, approach, or method that works well for every situation or AI-based framework.

One thing that is perfectly clear: AI and machine learning frameworks are advancing rapidly. Says Eric Chung, a researcher at Microsoft Research: "We're seeing an escalating, insatiable demand for this kind of technology."

Back to Top

Beyond the GPU

The quest for faster and better processing in AI is nothing new. In recent years, graphical processing units (GPUs) have become the technology of choice for supporting the neural networks that support AI, deep learning, and machine learning. The reason is simple, even if the underlying technology is complex: GPUs, which were originally invented to improve graphics processing on computers, execute specific tasks faster than conventional central processing units (CPUs). Yet, a specialized design is not ideal for every application or situation. For instance, a search engine such as Bing or Google has very different requirements than the speech processing used on a smartphone, or the visual processing that takes place in an automated vehicle or in the cloud. To varying degrees, systems must support both training and delivering real-time information and controls.

In the quest to boost performance in these systems, designers and engineers are leaving no idea unexamined. However, all the research revolves around a key goal: "Specialized AI chips will deliver better performance than either CPUs or GPUs. This will undoubtedly shift the AI compute [framework] moving forward," Chappell explains. In the real world, these boutique chips would greatly reduce training requirements in neural networks, in some cases from days or weeks to hours or minutes. This has the potential to not only improve performance but also slash costs for companies developing AI, deep learning, and machine learning systems. The result would be faster and better visual recognition in automated vehicles, or the ability to reprocess millions of scans for potentially missed markers in healthcare or pharma.

The focus on boutique chips and better AI computation is leading researchers down several avenues. These include improvements in GPUs as well as work on other technologies such as field programmable gate arrays (FPGAs), Tensor Processing Units (TPUs), and other chip systems and architectures that match specific AI and machine learning requirements. These initiatives, says Bryan Catanzaro, vice president of Applied Deep Learning Research at Nvidia, point in the same general direction: "The objective is to build computation platforms that deliver the performance and energy efficiency needed to build AI with a level of accuracy that isn't possible today."

GPUs, for instance, already deliver superior processor-to-memory bandwidth and they can be applied to many tasks and workloads in the AI arena, including visual and speech processing. The appeal of GPUs revolves around providing greater floating-point operations per second (FLOPs) using fewer watts of electricity, and the ability to extend the energy advantage by supporting 16-bit floating point numbers, which are more power- and energy-efficient than single-precision (32-bit) or double-precision (64-bit) floating point numbers. What is more, GPUs are quite scalable. The Nvidia Tesla P100 chip, which packs 15 billion transistors into a silicon chip, delivers extremely high throughput on AI workloads associated with deep learning.


"The objective is to build computation platforms that deliver the performance and energy efficiency needed to build AI with a level of accuracy that isn't possible today."


However, as Moore's Law reaches physical barriers, the technology must evolve further. For now, "There are a lot of ways to customize processor architectures for deep learning," Catanzaro says. Among these: improving efficiency on deep learning specific workloads, and better integration between throughput-oriented GPU and latency-oriented CPU. For instance, Nvidia has introduced a specialized server called DGX-1, which uses eight Tesla P100 processors to deliver 170 teraflops of compute for neural network training. The system also uses a fast interconnect between GPUs called NVLink, which the company claims allows up to 12 times faster data sharing than traditional PCIe interconnects.

"There is still an opportunity for considerable innovation in this space," he says.

Back to Top

New Models Emerge

Other approaches are also ushering in significant gains. For example, Google's Tensor Processing Unit (TPU) is a custom application-specific integrated circuit (ASIC) that is specifically designed for AI applications such as speech processing and street-view mapping and navigation. It has been used in Google's datacenters for more than 18 months. A big benefit is that the chip is optimized for reduced computational precision. This translates into fewer transistors per operation and the ability to squeeze more operations per second into the chip, which results in better-optimized performance per watt and an ability to use more sophisticated and powerful machine learning modelswhile applying the results more quickly.

Another technology aimed at advancing AI and machine learning is Microsoft's Project Catapult, which uses field programmable gate arrays (FPGAs) that underpin the widely used Bing search engine, as well as the Azure cloud. This allows teams to implement algorithms directly onto hardware, rather than potentially less-efficient software. Chung says the FPGAs' performance exceeds that of CPUs while retaining flexibility and allowing production systems to operate at hyperscale. He describes the technology as "programmable silicon."

To be sure, energy-efficient FPGAs satisfy an important requirement when deploying accelerators at hyperscale in power-constrained datacenters. "The system delivers a scalable, uniform pool of resources independent from CPUs. For instance, our cloud allows us to allocate few or many FPGAs as a single hardware service," he explains. This, ultimately, allows Microsoft to "scale up models seamlessly to a large number of nodes. The result is extremely high throughput with very low latency."

FPGAs are, in fact, highly flexible chips that achieve higher performance and better energy efficiency with reduced numerical precision. "Each computational operation gets more efficient on the FPGA with the fewer bits you use," Chung explains. The current generation of these Intel chips, known as Stratix V FPGAs, will evolve into more advanced versions, including Arria 10 and Stratix 10, he notes. They will introduce additional speed and efficiencies.

"With the technology, we can build custom pipelines that are tailored to specific algorithms and models." Chung says. In fact, Microsoft has reached a point where developers can deploy models rapidly, and without underlying technical expertise about the machine learning framework. "The appeal is the high level of flexibility. It can be reprogrammed for different AI models and tasks," Chung notes. In fact, the FPGAs can be reprogrammed on the fly to respond to advances in artificial intelligence or different datacenter requirements. A process that previously could take two years or more, now can take place in minutes.

Finally, Intel is introducing Nervana, a technology that aims to "deliver unprecedented compute density and high bandwidth interconnect for seamless model parallelism," Chappell says. The technology will focus primarily on multipliers and local memory, and skip elements such as caches that are required for graphics processing but not for deep learning. It also features isolated pipelines for computation and data management, as well as High Bandwidth Memory (HBM) to accelerate data movement. Nervana, which Intel expects to introduce during the first half of this year, will "deliver sustained performance near theoretical maximum throughput," he adds. It also includes 12 bidirectional high-bandwidth links, enabling multiple interconnected engines for seamless scalability, a key requirement for increased performance through scale.

Back to Top

Into the Future

An intriguing aspect of emerging chip designs for AI, deep learning, and machine learning is the fact that low-precision chip designs increasingly prevail. In many cases, reduced-precision processors conform better to neuromorphic compute platforms and accelerate the deployment and possibly training of deep learning algorithms. Simply put: they can produce similar results while consuming less power, in some cases by a factor of 100. While algorithms running on today's digital processors require high numerical precision, the same algorithms operating on low precision chips in a neural net excel, because these systems adapt dynamically by examining data in a more relational and contextual way (and are less sensitive to rounding errors).

This makes the technology perfect for an array of machine learning tasks and technologies, including drones; automated vehicles; intelligent personal assistants such as Amazon's Alexa, Microsoft's Cortana, or Apple's Siri; photo and image recognition systems, and search engines, including general services like Bing and Google but also those used by retailers, online travel agencies, and others. It also supports advanced functionality like real-time speech-to-text transcription and language translations.


Microsoft has reached a point where developers can deploy models rapidly, without underlying technical expertise about the machine learning framework.


In the end, says Gregory Diamos, a senior researcher at Baidu, specialized machine learning chips have the potential to change the stakes and usher in an era of even greater breakthroughs. "Machine learning has already made tremendous progress," he says. "Specialized chips and systems will continue to close the gap between computers and human performance."

* Further Reading

Caulfield, A., Chung, E., Putnam, A., Angepat, H., Fowers, J., Haselman, M., Heil, S., Humphrey, M., Kaur, P., Kim, J.Y., Lo, D., Massengill, T., Ovtcharov, K., Papamichael, M., Woods, L., Lanka, S., Chiou, D., and Burger, D.
A Cloud-Scale Acceleration Architecture, October 15, 2016. Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society. https://www.microsoft.com/en-us/research/publication/configurable-cloud-acceleration/

Samel, B., Mahajan, S., and Ingole, A.M.
GPU Computing and Its Applications, International Research Journal of Engineering and Technology (IRJET). Volume: 03 Issue: 04, Apr-2016. https://www.irjet.net/archives/V3/i4/IRJET-V3I4357.pdf

Shafiee, A., Nag, A., Muralimanohar, N., Balasubramonian, R., Strachan, J.P., Hu, M., Williams, S.R., and Srikumar, V.
ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 1426, 2016, ISSN 1063-6897. http://ieeexplore.ieee.org/document/7446049/citations

Shirahata, K., Tomita, Y., and Ike, A.
Memory reduction method for deep neural network training, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), 2016, pp. 16. doi: 10.1109/MLSP.2016.7738869. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7738869&isnumber=7738802

Back to Top

Author

Samuel Greengard is an author and journalist based in West Linn, OR.

Back to Top

Figures

UF1Figure. The design of the NVIDIA NVLink Hybrid Cube Mesh, which connects eight graphics processing units, each with 15 billion transistors.

Back to top


©2017 ACM  0001-0782/17/05

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2017 ACM, Inc.


 

No entries found