Academic Supercomputer Seeks a Top 10 Ranking

Stampede2 provides high-performance computing capabilities for thousands of researchers. — Stampede2 at the Texas Advanced Computing Center is the world's fastest academic supercomputer.

Stampede2 is a second-generation academic supercomputer under development with $30 million in National Science Foundation (NSF) funding in the Texas Advanced Computing Center (TACC) at The University of Texas at Austin, in collaboration with Intel for its Xeon-Phi and -Skylake central processing units (CPUs), and Dell for its PowerEdge racks.

The supercomputer is being constructed in three phases, the first of which went online in July.

"Stampede2 Phase One has already been tested by users like Stephen Hawking's Cambridge University team doing visualizations of gravitational waves," said TACC executive director Dan Stanzione.

The new hardware serves the over 13,000 academic and other NSF users nationwide who switched over from its predecessor, the Stampede1. Its higher speed is expected to also attract new users who will be able to run simulations on the 18-petaflop (when fully deployed in mid-2018) Stampede2 that the Stampede1, at 9.6 petaflops, could not handle. Stampede-2's construction is also being assisted by Seagate Technology, and the supercomputer itself is operated by a team of cyber-infrastructure experts at TACC, the University of Texas at Austin, Clemson University, Cornell University, the University of Colorado at Boulder, Indiana University, and Ohio State University.

"Supercomputers are having such a large impact on both science and design. Stampede-2 will enable the NSF community to accomplish goals that they would not have been able to do without it—from more accurate weather predictions to curing cancer," said Patricia (Trish) Damkroger, vice president of Intel's Data Center Group and general manager of its Technical Computing Initiative for the Enterprise and Government. "Intel was instrumental in helping Dell bring up the system, but perhaps Intel's engineer's greatest contribution was code optimization for Stampede-2's particular configuration."

According to Damkroger, TACC's Stampede2, Spain's Barcelona Supercomputing Center's MareNostrum and Italy's Cineca Marconi were all in Intel's Skylake early release program for pre-production Xeon Phi and Xeon Skylake central processing units (CPUs), netting them the 12th, 13th and 14th spots on July's Top500 Supercomputer ranking, respectively. By the next Top500 ranking in November, TACC hopes Stampede-2's upgrades will boost it up into the Top 10.

Currently Stampede2's Phase One configuration uses 4,200 Intel Phi CPUs, each with 68-cores and 96-gigabytes of double-data rate (DDR) random access memory (RAM), 16 gigabytes of high-speed multi-channel (MC) dynamic RAM (DRAM), and 512-bit wide vector operations, for a total of 285,600 Xeon Phi cores running at 1,890 kW, due to each core's 1.4-GHz clock speed.

Even with only Phase One completed, users are already raving about Stampede-2's performance compared to that of Stampede1.

"We have worked with a ray-tracing application that models thermal energy radiation transfer in a new, much more efficient class of clean coal boilers being developed by GE," said Martin Berzins, a Stamepede-2 user and visiting professor from the University of Leeds (U.K.) currently at the Scientific Computing Institute at the University of Utah (Salt Lake City). "The computation results we have been able to calculate so far have shown that not only is Stampede-2, with its new Xeon Phi Knights Landing processors, much faster than the Xeon Phi Knights Corner processors in Stampede1, but that the fast on-chip memory (MCDRAM) makes it possible to get better performance for memory intensive applications, too."

Stampede2's Phase One is optimized for parallel workloads, but Phase Two will add 1,736 Intel Xeon Skylake 24-core CPUs (41,664 cores total) running at 2.1 GHz and performing 512-bit wide vector operations. Each 24-core Xeon Skylake will consume more power than each 68-core Intel Xeon Phi, but will be almost 50% faster. The Intel Xeon Phi cores, optimized for parallel code, will provide 70% of the Stampede-2's computing power, with the smaller number of Intel Xeon Skylakes handling its serial analytic tasks. Phase Two is scheduled for completion before the end of this year.

By mid-2018, Phase Three will add to Stampede2 Intel's Optane technology using 3-D crosspoint memories comprised of multiple 512-Gbyte non-volatile dual-inline memories (DIMs).

"The 3-D crosspoint memories, which are halfway between memory and storage—faster than flash, but denser than DRAM—will be tried out in several use-cases," said Stanzione, "such as using a couple hundred terabytes for check points, for large memories on a single node, and for apps requiring extra-large scratch pad memories."

Stampede1 code will run, accelerated up to nine times faster but otherwise unchanged, on Stampede2, for applications from quantum simulations to artificial intelligence (AI), as well as bellwether apps including weather prediction, chemistry simulations, computational fluid dynamics simulations, astronomic visualizations, high-temperature internal combustion simulations, and fusion reactor research. Stampede-2 will also run more machine learning apps, more life science simulations, more analytics, and more social science apps than Stampede-1.

To Stampede2 user George Biros, leader of the Institute for Computational Engineering and Science Parallel Algorithms for Data Analysis and Simulation Group at the University of Texas in Austin, "The great advantage of Stampede2 is that running applications requires exactly the same workflow as in Stampede1. Of course, the performance on Stampede2 is much better; not only it is faster, but it has much more memory per node."

Biros said Stampede2 "provides tremendous computational resources in a completely transparent way to the researcher. Porting existing code is very easy and for some problems, getting good performance is also quite easy. But of course, for some applications, some further tuning is necessary to get good performance, such as when using Advanced Vector Extensions, as well as when carefully managing the memory hierarchy and the different vector ports on the Knights Landing [Xeon Phi] cores."

The Intel 100-Gbyte-per-second Omni-Path Interconnect with six core switches for Stampede2, is similar to the previous InfiniBand interconnect for Stampede1, since both use two Fat Tree fabrics. However, using the interconnect is accelerated by directly integrating the fabric interfaces onto both the Xeon Phi and Xeon Skylake CPUs. This high-performance Omni-Path Architecture (OPA) fabric interface has already been proven on complex simulations by early users such as molecular simulations of biological systems by biochemist Rommie Amaro at the University of California in San Diego, and magnetic resonance imaging (MRI) analyses of brain cancer by Biros.

Stampede2 also uses two dedicated high-performance Lustre file systems with a storage capacity of 31 petabytes. TACC's Stockyard-hosted Global Shared File System provides an additional 20 petabyes of Lustre storage.

Stampede2 is also protected by a proprietary suite of cybersecurity protocols. "We get thousands of attempts per hour to breech the system, but block them with our intrusion protection system," said Stanzione.

With the U.S., China, Japan, and several European nations increasing their funding of supercomputer development in a sprint to reach exascale computing (over 1,000 petaflops), Stanzione predicts exascale supercomputers will arrive by 2020, adding that TACC will not be able to afford their price tag, which he expects to reach $500 million.

R. Colin Johnson is a Kyoto Prize Fellow who has worked as a technology journalist for two decades.