Petaflops Computing

Imagine you wanted to model some important biological problem (such as understanding how each atom moves in the process of the folding of a protein) and suppose you had an unlimited budget. What obstacles would you face if you wanted to build a PetaFLOPS computer—defined as being able to perform one thousand trillion (10¹⁵) FLoating point Operations Per Second—so you could model the folding behavior down to the level of individual atoms?

Creating a PetaFLOPS supercomputer would take 200,000 current microprocessors (assuming they operated at 2.5GHz and that each one performed two floating point operations per clock cycle). Just the microprocessors of such a computer would take about as much electricity as 10,000 typical U.S. homes. Current processor chips produce more heat per square inch than a nuclear power plant, so cooling a PetaFLOPS-class computer would itself involve a tremendous technical and financial effort. Moreover, some very large supercomputer systems must be housed in custom-made buildings with an entire floor dedicated to the computer and the floor immediately below dedicated to electrical power and cooling equipment.

Even if you could solve all the physical problems, you’d still face the daunting computer science problem of getting 200,000 processors to work together effectively on a single program. Ever since the publication of the book The Mythical Man-Month: Essays on Software Engineering by Frederick P. Brooks more than 20 years ago, people have understood that human activities are not infinitely subdividable. What takes one programmer a month to do cannot generally be done by 10 programmers working for three days.

The same issue arises when trying to divide work among a large number of computer processors. Deciding how to to do it is a critical area of research in parallel computing (the use of a large number of computers together in parallel to solve a single problem or task). How might a large amount of computation be divided up among many CPUs so each one has work to do and is not idle waiting for information from other processors? What should be done about the tasks that simply cannot be subdivided? Amdahl’s Law describes the issue in the following way:

where N is the number of processors and S is the fraction of a computer program that executes in serial, where only one processor can be active at a time (such as at program initialization). The key is that having even a small portion of the program operating in serial limits how much it may be sped up by the addition of more processors. If as little as 1% of a program operates in serial, then the maximum speedup possible is to reduce the execution time by a factor of 100, no matter how many processors are used. A program that could in theory be parallelized and the execution time reduced (as compared to a single processor) by a factor of 100,000 must be at least 99.999% parallel—a daunting proposition.

The effort to create PetaFLOPS computing systems to study complex biological phenomena thus faces three critical challenges: money (never unlimited); the physical aspects of providing electricity and cooling for very large systems; and the algorithms needed to create programs that will run effectively on a PetaFLOPS-scale computer.

Specialized Strategies

Due to the challenges of developing applications that will scale to run on a PetaFLOPS computer, and due to the practical problems of managing power and cooling, current efforts to build PetaFLOPS computing systems take different strategies from those employed in the design of most large supercomputers. Building a PetaFLOPS computer (in the near term at least) takes strategies other than just making large computers bigger. For example, IBM’s Blue Gene Project is leveraging low-power chip technologies developed for embedded microprocessors to achieve extremely high ratios of performance to power dissipated while combining these CPUs with a highly specialized interconnect that greatly facilitates large-scale parallel programming. Meanwhile, the RIKEN Institute (Institute of Physical and Chemical Research) of Japan is taking a different approach, building specialized systems (such as the Molecular Dynamics Machine) that will scale up to a PetaFLOPS using chips that perform certain calculations used in molecular dynamics very quickly, but which are not useful for problems other than those similar to molecular dynamics.

Blue Gene. The Blue Gene project is pursuing two goals: push back the boundaries in machine architecture and system software to enable peta-scale computing; and conduct research to advance the understanding of biological processes (such as protein folding) via large-scale simulation (see the article by David A. Bader in this section). When validated by physical experiment, simulation provides a detailed view of microscopic phenomena that may be relevant to folding-related diseases (such as cystic fibrosis) and to the dynamics of membrane-bound proteins representing a large fraction of the drug targets pursued by the pharmaceutical industry.

The first member of the Blue Gene family, called Blue Gene/L, is the result of a partnership between IBM and the Lawrence Livermore National Laboratory (LLNL). In its full configuration due to be unveiled in 2005 Blue Gene/L will have 65,535 nodes organized in a 32 3 32 3 64 3D mesh connected as a torus (wrapped in three dimensions). Each node will include two PowerPC 440 processors with double floating point units, 4MB on-chip L3 cache, and 512MB of off-chip double data rate memory. The planned CPU frequency is 700MHz. With each processor capable of executing two fused multiply-add operations per cycle, the theoretical peak performance will be 360TeraFLOPS.

Blue Gene/L has two communication networks:

A 3D torus. Nodes are connected to each of its six nearest neighbors in a 3D array. The connections are wrapped in each dimension; that is, the node with the largest x-coordinate is connected to nodes with the smallest x-coordinate. Each of the 12-node links to and from a nearest neighbor in the torus network has 1.4Gbps communications bandwidth capacity; and
A global tree network. Enabling fast broadcasts while being physically distinct from the torus network, the global tree has one-to-all or all-to-all broadcast functionality with fixed-point arithmetic operations implemented in the hardware. One-way latency of the tree traversal is approximately two microseconds.

IBM, LLNL, and other collaborators are assessing the suitability of Blue Gene for a range of applications, including climate modeling, computational fluid dynamics, quantum chemistry, and the analysis of radio-astronomy data. IBM’s own efforts emphasize development of a framework for classical biomolecular simulation—called Blue Matter—to support studies of protein/membrane systems and protein folding. Blue Matter was used on IBM SP hardware in published work on b-hairpin kinetics, a small peptide (protein) that has been studied extensively as a model system and that captures many aspects of protein folding in more complicated systems. The Blue Gene team is currently using Blue Matter to run simulations of rhodopsin, a membrane protein involved in the perception of light by the eye, on a 512-node Blue Gene/L prototype.

The Blue Gene application effort is also exploring parallel decompositions and algorithms in the context of Blue Gene/L. One example of such an algorithm is a highly scalable 3D Fast Fourier Transform (FFT) developed for use in the particle-particle-particle-mesh Ewald (P3ME) technique widely used for evaluating electrostatic forces in biomolecular simulations.

With the capabilities enabled by a 20,480-node Blue Gene/L system to be installed at the IBM T.J. Watson Research Center in 2005, Blue Gene researchers expect to be able to break new ground in three distinct investigation directions: improving statistical sampling in simulations; studying larger systems (protein and protein/membrane); and routinely accessing very long timescales (hundreds of nanoseconds to microseconds) in simulations of biomolecular systems.

Prototype Blue Gene/L machines have appeared on the top 500 supercomputer list (www.top500.org), which ranks supercomputers by performance on the LINPACK benchmark. For example, on the June 2004 list, a 4,096-node 500MHz prototype occupied the fourth position with a measured performance of 11.68TeraFLOPS and a 2,048-node 700MHz BG/L prototype was eighth at 8.655TeraFLOPS.

Molecular Dynamics Machine. MDM is a super high-speed computer (78TeraFLOPS peak performance) developed for molecular dynamics simulations of biomolecules at RIKEN’s Advanced Computing Center. Molecular dynamics simulations treat atoms as classical particles interacting through three forces: bonding, van der Waals, and Coulomb. In molecular dynamics calculations, almost all the computing time is consumed in the calculation of the pair-wise van der Waals and Coulomb forces. MDM calculates these forces using three components: MD-GRAPE-2 (Molecular Dynamics GRAvity PipE-2) accelerator chips for the van der Waals force and the real space part of the Coulomb force (using the Ewald method); WINE-2 (Wave Integrator for the Ewald Method-2) chips for the wave number space portion of the Coulomb force; and a host computer.

The MD-GRAPE chip is an application-specific large-scale integration (LSI) chip with four pipelines for pairwise force calculation. The GRAPE family of processors has a distinguished history, including multiple Gordon Bell prizes for peak performance awarded at the annual Supercomputng ACM/IEEE supercomputing conferences. GRAPE systems do not appear on the Top500 lists (www.top500.org) because they do not run the Linpack benchmark. The WINE-2 chip has eight pipelines for (Inverse) Discrete Fourier Transformation used for the wave number space portion of the Coulomb force.

The MDM research team recorded an effective speed of 8.61TeraFLOPS in May 2001 and since then has performed large-scale simulations of biomolecules with a million atoms. It first performed simulations of the interactions of two hydrophobic walls (areas of molecules not attracted to water molecules) with a short gap to emulate hydrophobic interactions. As the gap grows narrower, water molecules between the walls are evacuated to form a vacuum bubble. Hydrophobic forces are thought to play important roles in folding and structural change of protein molecules.

The MDM team has also studied the structural changes of protein molecules, including Sytalone Dehydratase and Importin-Beta with and without a ligand molecule, or a small molecule that binds to a protein while affecting its shape and function. By simulating protein folding, these molecules have been found to show significant structural variability without a ligand while being stable with a ligand. Obtaining this result would have been impossible experimentally; understanding this molecular behavior is possible only through the use of simulations.

After completion of MDM, RIKEN began developing the still-faster MD-GRAPE-3 chip, aiming for a PetaFLOPS of performance, as part of a Japanese national project called Protein 3000 that aims to create a PetaFLOPS system for the study of proteins.

The MD-GRAPE-3 goal is peak performance of 165GigaFLOPS based on the use of 20 pipelines. Each chip has sufficient memory to store the coordinates of 32,768 particles. With this special-purpose architecture, the chip will be able to perform the equivalent of 660 floating-point operations per cycle—far greater than those of conventional microprocessors.

The MDM team also plans to build a PetaFLOPS system by 2006, integrating 6,144 MD-GRAPE-3s and a cluster of 512 CPUs. A PetaFLOPS system will open a new era in computational chemistry, solving important protein-folding problems and accelerating drug design.

Figures

Figure. Protein embedded in a cell wall. top view of rhodopsin, a protein involved in the human perception of light, embedded in a lipid bilayer (the cell wall) with cholesterol surrounding it. M. Pitman and F. Suits, IBM Thomas J. Watson Research Center, Yorktown Heights, NY.