High-performance computing (HPC) plays an important role in promoting scientific discovery, addressing grand-challenge problems, and promoting social and economic development. Over the past several decades, China has put significant effort into improving its own HPC through a series of key projects under its national research and development program. Development of supercomputing systems has advanced parallel applications in various fields in China, along with related software and hardware technology, and helped advance China's technological innovation and social development.
To meet the requirements of multidisciplinary and multidomain applications, new challenges in architecture, system software, and application technologies must be addressed to help develop next-generation exascale supercomputing systems.
The first supercomputer developed in China was Yinhe-I in 1983, with 1MFlops peak performance, by the National University of Defense Technology (NUDT). China has since continued its supercomputer development.
Three major teams in China—Tianhe, Sunway, and Sugon, like IBM, Cray, and Intel in the U.S.—have developed a series of domestic supercomputing systems, including Dawning 4000A (2005, 11.2TFlops); Tianhe-1A (2011, 4.7PFlops, number one in the TOP500); Sunway BlueLight (2011, 1PFlops); Tianhe-2 (2013, number one in the TOP500 six times); and Sunway TaihuLight (2016, number one in the TOP500 four times). Chinese supercomputers have adapted multiple architectures, including vector, SMP, ccNUMA, MPP, cluster, heterogeneous-accelerated, and many-core. Their developers have thus acquired rich knowledge of supercomputing hardware and software and trained a large number of engineers along the way.
In the years since the Yinhe-1 system in 1983, China has achieved the leading position in supercomputer development worldwide. For example, Tianhe-2 and Sunway TaihuLight held the top position in the TOP500 from 2013 to 2017. At the same time, the number of HPC systems in China increased dramatically, exceeding the number of HPC systems in the U.S., as of the June 2018 TOP500 ranking. And Chinese HPC manufacturers Lenovo, Sugon, Inspur, and others have claimed significant shares of the market for HPC systems and high-end servers.
China's leading-class supercomputer systems. Two 100PF computers—Tianhe-2 and Sunway TaihuLight—were developed with support from the National High-Tech R&D Program in the country's 12th Five Year Plan. The first stage of Tianhe-2 was completed in early 2013, delivering peak performance of 55 petaflops and Linpack performance of 33.9 petaflops. It was a hybrid system consisting of Intel Xeon processors and Xeon Phi accelerators and claimed the top position in the TOP500 six consecutive times, from June 2013 to November 2015. In 2017, NUDT deployed the new 128-core Matrix 2000 processor and applied it to upgrading Tianhe-2. The upgraded system, called Tianhe-2A, delivered 100.68 petaflops peak performance and approximately 61 petaflops Linpack performance.
The second 100 petaflops system is the many-core-based Sunway TaihuLight that delivers 125 petaflops peak performance and 93 petaflops Linpack performance. Sunway TaihuLight was implemented with the homegrown many-core processor SW26010—a 3Tflops chip with 260 cores—ranking in first place four times, from 2016 to 2017, in the TOP500. Key technologies adopted by the TaihuLight system include highly scalable heterogeneous architecture, high-density system integration, a high-bandwidth multi-level network, highly efficient DC power supply, and customized water cooling. It also represents a new milestone in China's HPC history for being a 100PF system implemented completely with homegrown processors.
Two 100PF systems have been installed at two national supercomputing centers, with Tianhe-2A at the National Supercomputing Center in Guangzhou and Sunway TaihuLight at the National Supercomputing Center in Wuxi, respectively. Moreover, extra effort has gone toward increasing the user population and developing applications.
Efficient HPC software stack. With development of domestic supercomputing systems, China has established a self-controllable system software stack covering basic drivers, operating system, compilers, communication software, basic library, parallel programming environment, parallel file system, resource management, and scheduling system, thus providing a comprehensive capability for large-scale system construction and performance tuning.
China must rely on self-controllable technologies, especially for basic hardware components like processors, memory, and interconnect networks, to build an exascale system.
The Tianhe-2 software stack consists of four components: a system environment, an application-development environment, a runtime environment, and a management environment. The system environment consists of the 64-bit Kylin OS and H2FS parallel file system and a resource-management system. Various job-scheduling policies and resource-allocation strategies have been implemented so system throughput and resource utilization can be enhanced. The application-development environment supports multiple programming languages, including C, C++, Fortran 77/90/95, a heterogeneous programming model called OpenMC, and the traditional OpenMP and MPI programming models. The THMPI is an updated version based on MPICH over the Tianhe Net communication protocols, able to deliver 12GB/s P2P bandwidth at the user level. The runtime environment consists of the parallel numerical toolkit for scientific applications, a scientific data-visualization system, and an HPC application service and cloud-computing platform. It provides runtime support to multiple fields, including scientific and engineering computing, big data processing, and high-throughput information service.
To support both HPC and big data applications, the Sunway TaihuLight also includes highly efficient scheduling and management tools and a rich set of parallel programming languages and development environments for application research and development. A two-level "MPI+X" approach helps devise the right parallelization scheme for mapping the target application onto the processes and threads that utilize more than 10 million of the system's cores. The 260-core SW26010 processor consists of four core groups (CGs), with each CG including one management processing element (MPE) and one computing processing element (CPE) cluster with eight-by-eight CPEs. Each CG usually corresponds to one MPI process. Within each CG, the system has two options: one is Sunway OpenACC, a customized parallel compilation tool that supports OpenACC 2.0 syntax targeting the CPE cluster; the other is a high-performance yet lightweight thread library called Athread that exploits fine-grain parallelism. With a byte-to-flop ratio five to10 times less than other top-five systems, the system needs extraordinary memory-related innovation to deal with the memory wall to scale its simulation capability with 125 Pflops computing performance. It also needs software migration to such an architecture, with radical changes in both compute and memory hierarchy. For each CPE, instead of hardware L1 cache, the system includes user-controlled 64-KB local data memory (LDM) that completely changes the memory perspective for programmers.
Effect on HPC industry. The rapid development of China's supercomputing systems has benefited from the continuous support of several national five-year plans, as well as the country's economic development and national strength. The systems support scientific research, technological breakthroughs, and an industrial revolution while promoting development and expansion of the IT server sector. Vendors Inspur, Lenovo, Huawei, and others have taken advantage of research and development of domestic HPC kernel technologies, including systems integration, storage architecture, interconnection technology, optimization techniques, system testing and benchmarks, and application technologies. At the same time, a large number of HPC hardware and software engineers have been trained for IT companies in China, including Alibaba, Baidu, and Tencent. The Chinese IT industry has also benefited from the technological innovation resulting from supercomputer development, including high-performance clusters, HPC-enabled cloud computing, distributed computing, and application optimization.
Along with the rapid development of hardware systems, major breakthroughs have been made in application development based on the new supercomputers, covering both traditional HPC domains like climate, seismology, computational fluid dynamics, and fusion and relatively new applications like big data and artificial intelligence (AI).
Atmospheric modeling. Large-scale simulation of the global atmosphere is one of the most computationally challenging problems in scientific computing. The Tianhe-1A hybrid CPU-GPU system launched a continuous development effort toward highly scalable atmospheric dynamic solvers on heterogeneous supercomputers, achieving sustained double-precision performance of 581 Tflops on Tianhe-1 by efficiently using the CPU and GPU resources on 3,750 nodes. The work was later extended to the Tianhe-2 system, scaling to the 8,644 nodes of Tianhe-2, achieving 3.74 Pflops performance with CPUs and MICs.
In 2016, the solver effort migrated to the Sunway TaihuLight supercomputer, with a highly scalable fully implicit solver for cloud-resolving atmospheric simulations. The solver supports fully implicit simulations with large time steps at extreme-scale resolutions and encapsulates novel domain decomposition, multigrid, and ILU factorization algorithms for massively parallel computing. With both algorithmic and optimization innovations, the solver scales to 10.5-million heterogeneous cores on Sunway TaihuLight at an unprecedented 488-m resolution with 770-billion unknowns, sustaining 7.95 PFLOPS performance in double-precision with 0.07 simulated-years-per-day. Considered a major breakthrough, it won the ACM Gordon Bell Prize in 2016, the first time in 29 years Chinese researchers were so recognized.
Earthquake simulation. Earthquake simulation is another traditional major challenge for supercomputers. Starting with AWP-ODC and CG-FDM codes, Chinese researchers have developed nonlinear earthquake simulation software on Sunway TaihuLight, winning the ACM Gordon Bell Prize in 2017. While TaihuLight delivers an unprecedented level of computing power (three times that of Tianhe-2 and five times that of Titan), its memory system is relatively modest. Total memory size is similar to other systems (such as Piz Daint and Titan, two GPU-based systems), with a significantly lower byte-to-flop ratio, as compared to 1/5 in other heterogeneous systems and 1/10 in the K Computer. Such a system represents both high potential and notable challenges for scaling scientific applications. Especially for earthquake simulation, which requires both a large amount of memory and high memory bandwidth, breaking the memory wall becomes the top challenge. To resolve this bandwidth constraint, Chinese researchers have performed three notable optimizations: a customized parallelization scheme that employs the 10-million cores efficiently at both the process level and the thread level (to address the scale challenge); an elaborate memory scheme that integrates on-chip halo exchange through register communication, optimized blocking configuration guided by an analytic model, and coalesced DMA access with array fusion (to alleviate the memory constraint); and on-the-fly compression that doubles the maximum problem size and further improves performance by 24% (to further address the memory wall). The extreme cases demonstrate sustained performance greater than 18.9 Pflops, enabling simulation of the Tangshan earthquake through an 18Hz scenario with eight-meter resolution.
Drug design. Virtual high-throughput screening is an established computational method for identifying drug candidates from a large collection of compound libraries, accelerating the drug-discovery process. When diseases and unknown viruses appear, it is especially useful for screening as many molecules as possible to help identify an effective treatment. The kernel algorithm is the Lamarckian Genetic Algorithm, a combined local search and genetic algorithm for efficient global-space coverage and local-search optimization. A typical data scale of 40 million molecules requires more than 800TB. With the need to handle approximately 40 million small files, the optimized design on Tianhe-2 takes advantage of the high throughput of the H2FS file system, I/O-congestion control, multi-stage task scheduling, task-pool management, asynchronous I/O, and communication to improve application scalability. The design is able to screen 35 million candidate drug molecules against the Ebola virus in 20 hours. The parallel efficiency from 500 to 8,000 nodes (1.6 million hybrid cores) is over 84%. Such computational capability demonstrates how Tianhe-2 is able to screen all known 40 million drug molecules against an unknown virus in a single day.
Large-scale graph computing. With increasing demand for graph processing, both Sunway TaihuLight and Tianhe-2 have earned Graph500 breadth-first-search (BFS) scores. Sunway TaihuLight ranks second at 23,755.7 giga-traversed edges per second (GTEPS), and Tianhe-2 ranks tenth.
In addition, the graph-processing framework ShenTu was developed on the Sunway TaighuLight, allowing users to write vertex-centric graph-processing programs and scale out the computing to the whole Sunway TaihuLight machine. The framework can support such graph algorithms as PageRank, Shortest Path, BFS, and K-Core with just 20 lines of code. It can process graphs with 10 trillion edges in tens of seconds. For example, ShenTu can complete one round of page ranking in 21 seconds on a 12-trillion-edge real-world Web graph, an order-of-magnitude performance improvement on graphs that are one order of magnitude larger than prior work.
The Chinese government is encouraging development of the kernel technologies, including high-performance processor/accelerator, novel memory devices, and interconnect networks.
Deep-learning applications. In addition to traditional applications, efforts are under way to explore the potential of training complex deep neural networks (DNNs) on these heterogeneous supercomputers. For example, there is a highly efficient library on swDNN on Sunway TaihuLight for accelerating deep-learning applications. By identifying the most suitable approach for mapping the convolutional neural networks (CNNs) onto the 260 cores within the chip, swDNN achieves double-precision performance greater than 1.6Tflops for the convolution kernel, which is over 54% of the theoretical peak of the SW26010 processor. Parallel training is supported through swCaffe, a redesigned version of Caffe, for large-scale training on up to 1,000 Sunway nodes.
Some deep-learning applications run on Tianhe-2, including for tumor diagnosis, video analysis, and intelligent transportation. One application called "trade business of Guanghzou" supports 900 million deals annually.
As of July 2018, the Summit supercomputer (powered by IBM POWER9 and Nvidia V100 processors) was ranked number one in the TOP500, achieving 122PF LINPACK performance and 3.3Exaops for data processing and AI applications at half precision. However, a number of planned systems will soon surpass it. In the past few years, several countries have targeted exscale computing, including ECP in the U.S., Post K in Japan, and EuroHPC in the E.U., aiming for breakthroughs in key technologies, including novel architecture, high energy efficiency, system software, and exascale applications. These efforts lead the way toward next-generation supercomputing systems.
China's exascale project. The key HPC project in China's 13th five-year research and development program was launched two years ago to pursue a two-step strategy for developing exaflops supercomputing. The first step aims to deploy three prototype exascale computers by the Tianhe team, Sunway team, and Sugon teams, respectively, pursuing novel architectures, kernel technology breakthroughs, and possible technical approaches for implementing future exascale systems. Carried out from 2016 to 2018, the projects were completed by the end of June 2018. The second step is to select two of the three to develop exascale systems by the end of 2020.
The project aims to develop self-dependent and controllable kernel technologies for exascale computing and maintain China's leading position in global HPC; develop a number of critical HPC applications and build a national software center, establishing an HPC-application ecosystem; and build a national HPC environment with world-leading resources and services.
The Chinese exascale system will aim to achieve the following specification: peak performance of 1EFlops, node performance greater than 10TFlops, memory capacity greater than 10PB, storage capacity of 1EB, interconnection network bandwidth greater than 500Gbps, Linpack efficiency over 60%, and energy efficiency greater than 30GFlops/W. Moreover, the system should include an easy-to-use parallel programming environment, monitoring and fault-tolerance management, and support for large-scale applications.
Our approach. Exascale computing must address unprecedented technical challenges worldwide, including the memory wall, communication wall, reliability wall, energy-consumption wall, and programming wall. A strategy of hardware and software co-design will thus be required. For example, new algorithms will be proposed and implemented with the target hardware features in mind. Resilience will be addressed through fault-tolerant hardware design and fast failure detection and recovery enabled by software.
China must also rely on self-controllable technologies, especially for basic hardware components like processors, memory, and interconnect networks, to build an exascale system. The Chinese microelectronics and IC industry is still relatively weak, thus calling for more basic research and technology development. Also, China must satisfy various complex application needs and deal with a huge and highly diverse market, thus calling for multiple design and development approaches. The current key HPC project relies on architectural innovation, technology breakthroughs, and hardware and software coordination to address these challenges. Novel architectures will be explored to address the requirement of the various applications. Engineering trade-offs will be necessary to balance metrics in power consumption, performance, programmability, and resilience. Technology breakthroughs will be pursued through comprehensive research efforts. Special attention will target application software.
The Chinese government is encouraging development of the kernel technologies, including high-performance processor/accelerator, novel memory devices, and interconnect networks. The effort toward self-controllable processor technologies include Sunway's SW many-core processor, NUDT's FT series CPU and Matrix series accelerator, and Sugon's X86 AMD-licensed processor. NUDT has developed its propriety interconnect network TH-Net with high bandwidth and low latency, making the TH-2 system efficient and scalable. The Sunway system also includes its own self-designed large-scale network, enabling the TaihuLight system to run efficiently on 10 million cores. More new technologies breakthroughs are still needed to support successful development of exascale systems.
The key HPC project also targets applications focusing on climate change, ocean simulation, combustion, electromagnetic-environment simulation, oil exploration, material science, astrophysics, and life science. A new computational model and algorithm will be designed, and the efficiency, scalability, reliability of the applications will be evaluated for future exascale systems.
The pervasive use of HPC has promoted development of large-scale parallel software. Chinese researchers are strengthening development of system software and application software for domestic hardware systems, aiming to establish the country's own HPC ecosystem.
Emerging big data and AI applications have also gained the attention of Chinese HPC research programs. The National Natural Science Foundation of China, the counterpart of the U.S. National Science Foundation, has launched an initiative in big-data science to research computational models, algorithms, and platforms for data analytics and processing. Related projects focus on such big-data-related fields as video processing, health and medicine, intelligent transport, finance, government administration, and intelligent education. And an upcoming national research initiative on AI will call for HPC support for AI applications. The scope of HPC applications will definitely broaden in the future.
Parallel computers and parallel applications have cross-pollinated each other in China for the past 15 years. The availability of leading-class supercomputers has stimulated the growth of parallel applications in a number of fields, an application- and technology-driven-growth trend that will continue into the future.
How to maintain sustainable development toward the next generation of supercomputing in China is an open question. Though significant progress has been made in recent years, China is still behind Western countries in HPC in many respects. A long-term national plan on HPC is needed that would allow more systematic deployment of HPC research. A mechanism that would ensure sustainable development of the national HPC infrastructure must be established so the supercomputing centers do not have to struggle to find the money needed to run the supercomputers.
Exascale computing projects are being implemented in the U.S., Japan, and Europe, aiming to deliver exaflops computers in three to five years. Their effort is like mountain climbing. Climbers can enjoy the magnificent scenery only when they get to the top following their arduous journey. The Chinese HPC community is willing to work with the international HPC community to pursue the goal of exascale computing, sharing the experience and jointly attacking the grand challenges. HPC should not be a new kind of arms race but technology that benefits all people.
Chinese researchers also need to be aware of new technologies and applications. The emergence of big data and AI brings new challenges and opportunities to HPC. Supporting big data and AI with HPC while being rewarded by big-data- and AI-enabled technologies for HPC should drive coordinated and converged development of all three. All should take this opportunity to embrace this new exciting era of supercomputing.
©2018 ACM 0001-0782/18/11
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2018 ACM, Inc.
No entries found