Sign In

Communications of the ACM

ACM TechNews

HPC Technique Propels Deep Learning at Scale


In the ring all-reduce OpenMPI algorithm, all GPUs send data simultaneously.

Baidu's Silicon Valley Artificial Intelligence Lab has released a modified implementation of the ring all-reduce OpenMPI algorithm which will enable faster training of neural networks across graphical-processing unit nodes.

Credit: HPCwire

Baidu's Silicon Valley Artificial Intelligence Lab (SVAIL) has released a modified implementation of the ring all-reduce OpenMPI algorithm for the deep-learning community, which will enable faster training of neural networks across graphical-processing unit (GPU) nodes.

Unlike the OpenMPI version, the SVAIL modification avoids making extraneous copies between the central processing unit (CPU) and the GPU.

Although commonplace in high-performance computing, the technique has been underused within AI and deep learning, according to Baidu. Compared with using a single GPU, the ring all-reduce algorithm is about 31 times faster at 40 GPUs.

The algorithm has enabled the SVAIL team to get linear GPU scaling up to 128 GPUs and to parallelize the training of Deep Speech 2, its speech-recognition mode.

Two years after the approach was initially developed, the researchers have issued two non-proprietary implementations, one for TensorFlow and one for more general applications.

From HPC Wire
View Full Article

 

Abstracts Copyright © 2017 Information Inc., Bethesda, Maryland, USA


 

No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account