acm-header
Sign In

Communications of the ACM

Contributed articles

The Rise and Fall of High Performance Fortran


concentric circles

Credit: missyuan.com

Parallelismthe execution of multiple tasks at the same timeis fundamental in computer design. Even very early computer systems employed low-level parallelism (such as overlapping input-output with computing). Later designs moved parallelism to higher levels, including vector processors in the 1970s and shared-memory and distributed-memory parallel supercomputers in the 1980s. Software support for these machines never fully reached the standard of commercial software development, due mainly to the specialized market for high-performance computing and the difficulty related to users' extreme focus on target-code performance. Here, we trace the history of High Performance Fortran (HPF), a high-level data-parallel language standardized by a group of industry and academic leaders, 19911993. It was initially well received by the community, generating a great deal of interest in the language and its implementation, but never succeeded in building a strong user base. We concentrate on the technical and social process of designing HPF, as well as the factors that contributed to its successes and its shortcomings.

Back to Top

Key Insights

ins01.gif

HPF was one of many efforts to develop parallel programming techniques active since the late 1980s. Here, we do not treat these efforts comprehensively or review all work prior to HPF; more on these topics can be found in the conference version of this article15 and in the History of Programming Languages conferences (http://www.acm.org/sigplan/hopl). Rather, we provide background on the parallel computers and programming systems available when HPF was developed, then turn to the process of defining HPF, the features of the language itself, and its successes and failures. We conclude with a discussion of lessons to be learned by developers and users of other languages.

Back to Top

Parallel Computers and Languages

The milieu in which HPF was created (see the figure here) guides much of our discussion, with rectangles representing classes of computers and attached circles representing the most common programming languages for each class. A line from a language L to an architecture A represents the compiler/runtime or transformation technology needed to implement L on A. For completeness, we include sequential computers and FORTRAN 77.

The data-parallel computation model is characterized by the property that sequences of operations or statements can be performed in parallel on each element of a collection of data. This model was in part motivated by the work on the NESL language.4 Two common implementations were vector processors (featuring highly pipelined CPUs operating on vector registers) and single instruction multiple data (SIMD) computers (featuring thousands of simple processors operating under a single control unit). Due to this fine-grain control, the most natural languages for these machines provided array arithmetic and array functions mirroring the vector and matrix operations in many scientific codes.

Following years of development, vectorizing compilers were able to identify and exploit this level of parallelism in FORTRAN 77 programs. The key enabling techniques were dependence analysis, which identified conflicting operations on the same data that would prevent parallel execution, and compiler transformations, which reorganized operations in loops to enhance the opportunities for vector operation. The result was that many users of vector machines continued programming in a sequential language.

However, due to the need to identify and exploit much higher levels of parallelism, purely automatic parallelization was not as successful on SIMD machines. Programmers on these architectures wrote in explicitly data-parallel languages (such as NESL and Connection Machine Fortran, or CMF22). Later, Fortran 901 represented an important step toward data-parallel languages by standardizing array operations, array assignments, and intrinsic data-parallel functions (such as reductions like SUM). HPF was later based on this programming model.

Multiple instruction multiple data (MIMD) computers allowed more general and larger-grain parallel execution by providing each processor its own instruction stream and the ability to operate asynchronously to all other processors.

MIMD computers could be classified as either shared-memory or distributed-memory architectures. In shared-memory MIMD machines all processors were connected to a single shared main memory, making it simple to access common data and communicate results between processors. Languages for these machines featured a global view of data and multiple ways to create parallel activities (such as parallel loops in which iterations could execute concurrently) and synchronization operations for enforcing order between activities on different processors. The Parallel Computing Forum promulgated a standard set of FORTRAN extensions for this programming model18 to allow easy programming while enabling access to the performance advantages of shared memory over other machines.

However, two sets of problems arose with these architectures: First, the inability to hide latency with conventional processors at large scale and the hardware overhead of implementing cache coherence limiting their scalability; and second, the possibility of data races, so if two parallel threads access the same memory location, with at least one of them writing to that location, then a lack of proper synchronization could lead to different results for different parallel schedules.

These problems led to development of distributed-memory MIMD machines in which processors were interconnected via networks, with each processor having its own local memory. This organization allowed the machines to scale to around 1,000 processors by the early 1990s. The price of this architectural simplicity was software complexity; when two processors would have to share data, they would have to exchange messages, an expensive operation. Reducing this cost relied on correctly placing the data to minimize the required communication, and placing the message-passing calls in the most appropriate program locations. By the time the HPF effort had begun in the early 1990s, message-passing libraries (such as PVM and PARMACS) were already being used for programming MIMD systems in connection with standard sequential languages. A standardization activity for message passing began soon thereafter, patterned after and in close connection with the HPF effort.11 The resulting Message Passing Interface (MPI)20 library provided efficient explicit control of locality and communication. Despite the greater complexity of this programming paradigm, it quickly became popular due to its ability to exploit the performance of distributed-memory computers.


The HPF goal was a common programming model that could execute efficiently on all classes of parallel machines.


The HPF goal was a common programming model that could execute efficiently on all classes of parallel machines. Developers had to answer whether the advantages of shared memory, even a single thread of control, can be simulated on a distributed-memory machine; also how parallelism can be made to scale to hundreds or thousands of processors. Such issues were addressed through data-parallel languages in which the large data structures of applications are part of a global name space that can be laid out across the memories of a distributed-memory machine, or a "data distribution." The data elements mapped to the local memory of a processor in this way are said to be "owned" by the processor. Program execution is modeled by a single thread of control, but the components of distributed data structures can be operated on in parallel. The data distribution controls how work is to be allocated among the processors. Compilation techniques (such as those discussed later) allowed MIMD architectures to relax the lockstep semantics of the SIMD model to improve efficiency.

The HPF design was based largely on experience with the language designs and implementations of the early data-parallel languages. Here, we focus on the data-parallel approach embodied in HPF. To be sure, it was not universally embraced by either the compiler research community or the application community. Other programming paradigms, including functional and dataflow languages, also had substantial intellectual merit, passionate user communities, and/or both.

Back to Top

HPF Standardization Process

At the Supercomputing conference in 1991 in Albuquerque, NM, Ken Kennedy of Rice University and Geoffrey Fox of Indiana University met with a number of commercial vendors to discuss the possibility of standardizing the data-parallel versions of Fortran. These vendors included Thinking Machines (then producing SIMD machines), Intel and IBM (then producing distributed-memory MIMD machines), and Digital Equipment Corp. (then interested in producing a cross-platform Fortran system for both SIMD and MIMD machines).

Kennedy and Fox organized a birds-of-a-feather session with academic and industrial representatives who agreed to explore a more formal process through a group that would come to be known as the High Performance Fortran Forum (HPFF), with Kennedy serving as chair and Charles Koelbel as executive director. With support from the Center for Parallel Computation Research (CRPC) at Rice University, a meeting with nearly 100 participants organized in Houston in January 1992 concluded with a business session in which more than 20 companies committed to a process for drafting the new standard.

They agreed the process should produce a result in approximately one year. It turned out this tight schedule would affect the language, leading to adoption of the rule that HPF would include only features that had been demonstrated in at least one language and compiler, including research compilers. This limited some of the features that could be considered, particularly in the realm of advanced data distributions.

The 30 to 40 active HPFF participants then met for two days every six weeks or so, mostly in a hotel in Dallas. Besides Kennedy and Koelbel, the participants serving as editors of the standard document included Marina Chen, then at Yale, Bob Knighten, then at Intel, David Loveman, then at DEC, Rob Schreiber, then at NASA, Marc Snir, then at IBM, Guy Steele, then at Thinking Machines, Joel Williamson, then at Convex, and Mary Zosel, then at Lawrence Livermore National Laboratory. Attendees represented five government labs, 12 universities, and 18 companies (vendors and users); they also hailed from at least five countries. HPFF was a consensus-building process, not a single-organization project.

It dealt with numerous difficult technical and political issues, many due to the tension between the need for high-level language functionality supporting a broad range of applications and the need to generate efficient target code. They were compounded by limited experience with the compilation of data-parallel languages. The research compilers were mostly academic prototypes that had been applied to few applications of any size. On the other hand, the industrial-strength language CMF was relatively new and lacked some advanced features of the research languages. Many decisions thus had to be made without a full understanding of their effect on compiler complexity.

The result was a new language finalized in early 199312 and presented later that year at the Supercomputing conference in Portland, OR. An additional set of HPFF meetings was held in 1994, aiming to address corrections and clarifications of the existing standard and consider new features for the support of additional applications. Ken Kennedy was again chair, while Mary Zosel was now executive director. The attendees were mostly the same as in the 19921993 meetings, with notable additions Ian Foster of Argonne National Laboratory and Joel Saltz, then at the University of Maryland, each leading new subgroups. The corrections were included in the HPF 1.1 standard, a slight revision of the 1993 document presented at the 1994 Supercomputing conference in Washington, D.C. Some new features and clarifications discussed but not included in HPF 1.1 were collected in the HPF Journal of Development and later served as a starting point for the HPF 2.0 standardization effort, 19951996.

This effort would incorporate new features, most notably the ability to perform accumulations in the INDEPENDENT loop construct, advanced data distributions, and the ON clause, which provided control over computation mapping by identifying the processor to execute each iteration of a parallel loop. HPF 2.0 also defined "approved extensions" considered too complex to be required initially in all compilers. HPFF had intended that vendors would implement all "core language" features at first, then prioritize extensions based on customer demand. Not surprisingly, the result was confusion, as the feature sets offered by different vendors diverged.

Back to Top

HPF Language

The goals established for HPF were fairly straightforward: provide a high-level, portable programming model for scalable computer systems based (primarily) on data-parallel operations in a (conceptually) shared memory and produce code with performance comparable to the best hand-coded native language code on a given machine. To achieve them, HPF 1.0 defined a language with several novel characteristics. For simplicity, we do not discuss the enhancements in HPF 1.1 or HPF 2.0, though they were in much the same vein.

First, the language was based on Fortran 90,1 with extensions defined as a set of directives in the form of structured comments. These directives could be interpreted by HPF compilers as advice concerning how to produce a parallel program. On a scalar machine, an HPF program could be executed without change simply by ignoring the directives, assuming the machine had sufficient memory. This device permitted the HPF parallelism extensions to be separated cleanly from the underlying Fortran 90 program; the same program would work on both sequential and parallel systems. Sequential-to-parallel portability is a huge advantage in debugging a code.

The principal additions to the language were a set of distribution directives. Sets of arrays could be aligned with one another, then distributed across the processors through built-in distributions. These directives could be used to assign the individual rows or columns of an array to processors in contiguous blocks or in round-robin fashion. This feature is illustrated by showing the application of directives to two simple Fortran arrays

ins02.gif

Now suppose we want to split the first dimension (and with it, the computation over that dimension) across the processors of a parallel machine. Moreover, suppose because corresponding elements of A and B are often accessed together, they should always have the same distribution. Both effects can be accomplished through the directives

ins03.gif

HPF also provides the CYCLIC distribution, in which elements are assigned to processors in round-robin fashion, and CYCLIC(K), in which blocks of K elements are assigned round-robin to processors. Generally speaking, BLOCK is the preferred distribution for computations with nearest-neighbor elementwise communication, whereas the CYCLIC variants allow finer load balancing of some computations.

Data distribution of subroutine arguments was particularly complex. Formal subroutine arguments could be associated with ALIGN and DISTRIBUTE directives. If they fully specified a data distribution, then the corresponding actual arguments would be redistributed when the call was made and redistributed back to their original distribution on return. If the caller and callee distributions matched, it was expected that the compiler or runtime system would forego the copying needed in the redistribution. HPF also defined a system of "inherited" distributions in which the distribution of the formal arguments would be identical to the actual arguments. This declaration required an explicit subroutine interface (such as a Fortran 90 INTERFACE block). In this case, copying could be avoided, but code generation for the subroutine would have to handle all possible incoming distributions. The complexity was so great that to our knowledge no compiler fully implemented it.

In addition to distribution directives, HPF provides directives to assist identification of parallelism, illustrated with a simple smoothing operation on the arrays described earlier

ins04.gif

Using Fortran 90 array notation produces this code

ins05.gif

HPF also provided a FORALL statement, taken from CMF, as an alternative means of expressing array assignment. The inner DO loop in our relaxation example could be written as

ins06.gif

The explicit indexing allowed FORALL to conveniently express a wider range of array shapes and computations than the standard array assignment.

HPF included the ability to specify that the iterations of a loop should be executed in parallel using the INDEPENDENT directive. While in this inner loop the compiler could easily derive this property, and subscripted subscripts would in general require runtime analysis without the explicit assertion provided by the INDEPENDENT directive. The programmer often has application-specific knowledge that would allow such loops to be executed in parallel, like this

ins07.gif

HPF was also one of the first language specifications to include an associated library, the HPF Library, as a part of its definition, adding power to the language by providing parallel operations on global arrays (such as sum reduction, gather/scatter, and partial prefix operations).

Finally, HPF included features designed to improve compatibility and facilitate interoperation with other programming languages and models. In particular, the EXTRINSIC interface made it possible to invoke subprograms written in other languages (such as scalar Fortran and C). Of particular importance was the ability to call subroutines written in MPI in a way that made it possible to recode HPF subprograms for more efficiency.

Back to Top

HPF Compilation Technology

The concepts underlying the HPF language design and related compilation technology were pioneered by the early data-parallel languages preceding HPF, particularly CMF,22 Fortran D,13 and Vienna Fortran.6 The key approach common to them was adoption of a high-level notation for specifying data distribution, thus relegating the task of explicit generation of communication to the compiler and runtime system. Fortran D and Vienna Fortran initially adopted the owner-computes rule for compilation to distributed-memory MIMD machines. Accordingly, the compiler generates code for each statement in such a way that all computations are performed on the processors owning the computation output. Later, this strategy was modified to perform computations on any processors that would minimize communication. The rule is complemented by a general execution model in which each processor scheduled to perform a computation performs the following three steps: generate messages to send data it owns to all other processors that need them; await arrival of messages from other processors containing required nonlocal data; and complete the computation using its local data and non-local data communicated from other processors. The point-to-point messages generated through this paradigm are all the synchronization required for a distributed-memory MIMD implementation. This idea was so successful that, when Thinking Machines introduced its MIMD CM-5 computer in 1993, it retained the data-parallel CM Fortran language. Similar techniques could generate the synchronization needed for shared-memory MIMD computers.

Here, we outline the major phases in an HPF compiler that performs a source-to-source transformation:3,14

Following the compiler front end, the data-distribution phase analyzes HPF mapping directives to determine ownership of all data objects. Reaching distribution analysis serves to associate all occurrences of arrays with distribution information, including dummy arrays. If ownership cannot be resolved statically, code is generated to determine this information at runtime.

The work-distribution phase generates an explicitly parallel single-program-multiple-data (SPMD) program by determining the distribution of work and inserting communication. In it, the owner-computes paradigm is enforced, and for potential nonlocal accesses, communication primitives are inserted that transfer nonlocal data into private variables.

A key goal of the optimization phase is reduction of communication overhead via a range of techniques, including execution of communication in parallel to computation, elimination of redundant communication, and exploitation of collective communication primitives whenever possible. Overlap analysis detects simple communication patterns (such as stencils), using this information to improve local data management, as well as the organization of communication. Gupta et al.10 presented a general framework for optimizing communication in data-parallel programs. In the final code-generation phase, the optimized parallel program is transformed into a Fortran program with explicit message passing.

The inspector-executor paradigm17 is an important method for runtime optimization of parallel loops not amenable to static analysis due to irregular array accesses (such as through subscripted subscripts).

Back to Top

Experience with the Language

The initial response to HPF can be characterized as cautious enthusiasm. A large part of the high-performance user community was hopeful that the high-level abstractions provided by the language would make parallel programs portable and efficient without requiring explicit control of message passing. On the other hand, vendors hoped HPF would expand the market for scalable parallel computing. Several major vendors, including DEC, IBM, and Thinking Machines, initiated independent compiler efforts; others offered OEM versions of compilers produced by independent software companies (such as Applied Parallel Research and the Portland Group, Inc.). At its peak, 17 vendors offered HPF products and more than 35 major applications were written in HPF, at least one with more than 100,000 lines of code.

Much HPF experience was reported at meetings of the HPF Users Group in Santa Fe, NM (1997), Porto, Portugal (1998), and Tokyo (2000).16 The Tokyo meeting was notable for demonstrating the strong interest in HPF in Japan, later reemphasized by the Earth Simulator19 featuring a high-functionality HPF implementation supporting the HPF/JA extensions. The IMPACT-3D fluid simulation for fusion science on the Earth Simulator achieved 40% of its peak speed and was awarded a Gordon Bell Prize at the Supercomputing conference in 2002 in Baltimore.


At its peak, 17 vendors offered HPF products and more than 35 major applications were written in HPF, at least one with more than 100,000 lines of code.


As the language was applied to a more diverse set of application programs it became clear that in many cases its expressive power allowed the formulation of scientific codes in a clearer, shorter, less error-prone way than was possible based on explicit message passing. HPF did best on simple, regular problems (such as dense linear algebra and partial differential equations on regular meshes). However, for some data-parallel applications it was difficult to achieve the all-important goal of high target-code performance. As a result, frustrated users, particularly in the U.S., switched to MPI. This migration significantly reduced demand for HPF, leading compiler vendors to reduce, even abandon, their development efforts. The end result was HPF never achieved a sufficient level of acceptance among leading-edge users of high-performance parallel computing systems.

Here, we explore the main reasons for this development:

Missing language features. To achieve high performance on a variety of applications and algorithms, a parallel programming model must support a range of different kinds of data distributions. The original HPF specification included three main classes of such distributionsBLOCK, CYCLIC, and CYCLIC(K)motivated largely by the requirements of regular algorithms operating on dense matrices and well adapted to the needs of linear algebra. However, many relevant applications need more general distributions that deal efficiently with dynamic and irregular data structures. Examples include multiblock and multigrid codes, finite element codes (such as crash simulations operating on dynamically changing unstructured grids), spectral codes (such as weather forecasting), and distributed sparse matrix codes. Algorithmic strategies in these codes were difficult or even impossible to express within HPF without sacrificing too much performance. Though this problem was addressed to a certain degree in HPF 2.0 through basic support for irregular data distributions, the damage was done.

Another language issue with HPF was limited support for task parallelism. Users wanted more powerful strategies beyond the parallel loop feature offered in HPF. This was corrected in HPF 2.0 but too late for a revival of HPF. Including features like the OpenMP set of directives could have helped.

Immature compiler technology. HPF was defined on top of Fortran 90, the first major upgrade to the Fortran standard since 1977. Among the new features of Fortran 90 were explicit interfaces requiring array descriptors, recursion, dynamic storage allocation, and pointer-based data structures. Building a Fortran 90 compiler meant a substantial implementation effort compared with FORTRAN 77a huge obstacle on the way to HPF.

Moreover, the features in HPF, including the HPF library, demanded new compilation strategies that in 1993 at the time of the release of HPF 1.0 had been implemented only in research compilers and the CM Fortran compiler. Proper compilation of HPF requires extensive global analysis of distributions, partitioning of computation, generation of communication, and optimizations (such as overlapping communication and computation).

Finally, efficient implementation of HPF programs required special attention to locality on individual processors. Since many of the processors used in distributed-memory systems had complex cache hierarchies, advanced transformation strategies (such as loop tiling) were becoming essential to generate efficient code. At the time of the release of HPF 1.0, these techniques were beginning to be understood by compiler developers, but most commercial compilers had not yet incorporated them. As a result the first compilers were immature, providing disappointing performance on many codes.

Barriers to portable performance. A key goal of HPF was enabling a single version of a parallel program to achieve a significant fraction of the possible performance on a variety of architectures but was difficult to achieve for two main reasons:

Different optimizations. Different vendors focused on different optimizations in their implementations, causing a single HPF application to deliver dramatically different performance on machines from different vendors. In turn, programmers were forced to recode their applications repeatedly to take advantage of the strengths (and avoid the weaknesses) of each vendor's implementation, thwarting the original goal of portability.


The end result was HPF never achieved a sufficient level of acceptance among leading-edge users of high-performance parallel systems.


No reference implementation. Second, the HPF Library could have been used to address some of the usability and performance problems described earlier, but there was no open-source reference implementation for the library, so each compiler project had to implement its own version. Due to the number and complexity of library components, this was a significant burden. The end result was that the implementations were inconsistent and often exhibited poor performance; users were again forced to code differently for different target machines.

Difficulty of performance tuning. Every HPF compiler we know of translated the HPF source to Fortran plus MPI. In this process, a large number of transformations were carried out, making the relationship between the original source program and the corresponding target program less than obvious to the programmer, making it very difficult to identify and correct performance problems. Though research groups were eventually able to apply MPI-based performance tools to identify bottlenecks in HPF programs, the tuning problem was still not addressed. That is, the user might well understand what was causing the performance problem but have no idea how to change the HPF source to address it.

Back to Top

Significant Influence

HPF also had significant influence on development of high-level parallel languages. In 2007, the CiteSeer database listed 827 citations for the language definition,12 making it the 21st most-cited document in computer science at the time. In addition, more than 1,500 publications in CiteSeer refer to "High Performance Fortran," many including approaches to implementing or improving the language, reflecting a great deal of intellectual activity within the academic community.

Fortran and its variants. Fortran 95. While the meetings leading to HPF 1.1 were under way, the X3J3 committee of the American National Standards Institute (ANSI, http://ansi.org/) developed the Fortran 95 standard. When formally adopted in 1996, it included the HPF FORALL and PURE features nearly verbatim. The HPF contributions to Fortran have been retained in all Fortran standards since then.

HPF/JA. In 1999, the Japan Association for High Performance Fortran, a consortium of Japanese companies, including Fujitsu, Hitachi, and NEC, released HPF/JA,19 with features found in previous programming languages on parallel-vector machines from Hitachi and NEC. An important source contributing to HPF/JA was the HPF+ language2 implemented in a European project led by the Vienna group, with NEC as a project partner. HPF+, based on an analysis of advanced industrial codes, provided a REUSE clause for independent loops asserting reusability of the communication schedule computed during the first iteration of the loop. In the same context, its HALO construct (renamed the REFLECT directive in HPF/JA) allowed the functional specification of non-local data accesses in processors and programmer control of the copying of such data to region boundaries. The LOCAL directive could be used to specify that communication was not needed for data access in some situations, a fact a compiler might be unable to recognize. These and other features allowed for better control over locality in HPF/JA programs. HPF/JA was later implemented on the Earth Simulator.19

OpenMP. OpenMP8 was proposed at the end of 1997 as an extension of Fortran, C, and C++, providing a set of directives that support a shared-memory programming interface, extending earlier work by the Parallel Computing Forum as part of the X3H5 standardization committee18 and SGI directives for shared-memory programming.

As in the Parallel Computing Forum directives, OpenMP provides a means to generate threads for the processors of a shared-memory machine and control access to shared data in an efficient manner. However, OpenMP does not provide features to control locality like HPF's distribution directives. Attempts were made to address this shortcoming by integrating OpenMP with language features that allow locality-aware programming. However, the current OpenMP standard does not reflect any of them.

HPCS languages. Key HPF ideas are still finding their way into newer programming languages, particularly in DARPA's High Productivity Computing Systems (HPCS) program, which aims to ease the burden of programming leading-edge systems based on innovative hardware and software architectures. From HPCS came three new language proposals: Chapel,5 Fortress,21 and X10.7 All are object-oriented languages supporting a range of features for programmability, parallelism, and safety, along with a global name space, explicit multithreading, and explicit mechanisms for dealing with data parallelism and locality. Each execution of a program is bound to a set of virtual locality units mapped by the operating system to physical entities (such as computational nodes). This gives the programmer a way to distribute data collections across locality units, align different collections of data, and establish affinity between computational threads and the data on which they operate. This approach represents a generalization of key elements in HPF, while the data-parallel features in the languages address many of the shortcomings described earlier. In particular, they offer a wider range of constructs for expressing parallelism and synchronization and do not rely nearly as much on advanced compiler optimizations.

The evolution of parallel loop constructs also illustrates the transition from HPF to HPCS languages. For example, the HPF INDEPENDENT construct asserts that the loop does not contain loop-carried dependences, thus excluding data races and allowing correct parallel execution. Chapel distinguishes between a sequential for loop and a parallel forall loop that iterates over the elements of an index domain without restriction on loop-carried dependences. Thus, programmers are responsible for avoiding dependences that lead to data races. The Fortress for-loop is parallel by default, and if a loop iterates over a distributed dimension of an array the iterations are grouped onto processors according to the distributions. A special "sequential" distribution can be used to serialize a for-loop. X10 distinguishes two kinds of parallel loops: the foreach loop, which is restricted to a single locality unit, and the ateach loop, which allows iteration over multiple locality units. Again, the programmer must use the construct correctly to avoid data races.

Parallel scripting languages. Though the Fortran and C communities were willing to tolerate the difficulty of writing MPI code for scalable parallel machines, it seems unlikely that the large group of programmers of high-level scripting languages (such as Matlab, Python, and R) would be willing to do the same. Simplicity is part of the reason for the popularity of these languages.

Nevertheless, there is substantial interest in being able to write parallel code in the languages. As a result, a number of research and commercial projects have explored strategies for parallel programming, particularly in Matlab. Most replace the standard Matlab array representation with a global distributed array and provide replacements for all standard operators performing distributed operations on these arrays. Though it is in the spirit of HPF, the overhead of producing operation and communication libraries by hand limits the number of different distributions such systems are able to support. Many current implementations are therefore restricted to a variant of the general block distributions in HPF 2.0.

The goal of these efforts is to provide the scripting-language community a simple way to obtain scalable parallelism with minimal change to their programs. If they succeed, they will not only vindicate the HPF vision but dramatically increase the community of application developers for scalable parallel machines.

Back to Top

Lessons Learned

HPF aimed to introduce a high-level programming notation to enhance portability and programmer productivity while offering target code performance meeting the expectations of a highly sophisticated user community. HPF did succeed in the first of these goals but failed to achieve performance competitive with MPI-based programs for many important applications. This shortfall became apparent, particularly after MPICH (http://www.mcs.anl.gov/research/projects/mpich2/), a portable, efficient reference implementation of MPI, became available.

The dominant technical reasons for failing to achieve performance competitive with MPI-based programs involved language design, compiler technology, and tool support. First, as discussed earlier, the data distributions provided by HPF could not adequately support large classes of important applications. No particular set of built-in distributions can satisfy all these requirements; rather, what is needed is a general mechanism for generating programmer-defined distributions via an abstraction that can be seen in analogy to the control abstraction created by function definitions. Such a capability could be used to extend a library of standard distributions, with distribution libraries that exploit the specific data structures and access patterns reflecting properties of an application domain or architecture. An approach along these lines was analyzed by Diaconescu and Zima9 in the context of Chapel language development.

Second, compiler technology for generating high-performance target codes was not mature enough at the beginning of the HPF effort. This relates to the absence of industrial-strength implementations of Fortran 90 at that time but also to not understanding how to generate efficient target code for many HPF language features. Despite considerable research addressing these features, there was also a significant lack of practical experience. One important aspect of it was the HPF library, originally envisioned as a pillar of support for efficient HPF coding but never completely implemented. The result was the programmer could not rely on the HPF Library if efficiency was a concern, thus reducing the language's overall usability.

A third set of problems involved tools for performance tuning. Collaboration between compiler and tool developers makes it possible to map performance information back to the HPF source. However, a programmer had only limited ways to exercise fine-grain control over the code generated once the source of performance bottlenecks was identified, other than using the EXTRINSIC interface to drop into MPI. The HPF/JA extensions ameliorated this problem by providing more control over locality. However, additional language features were needed to override compiler actions if necessary.

Finally, some general circumstances have also worked against HPF. First, the HPC market is fairly small; as a consequence, it has always been difficult to obtain funding for the development of industrial-strength compilers and runtime systems. Second, Fortran use seems to be shrinking, so HPF may have suffered due to its being bound to Fortran. And third, developers of large-scale applications make investments that are supposed to be useful for decades; unlike MPI and its broad user and vendor support, long-term HPF viability were indeed less assured from the very beginning.

Back to Top

Conclusion

HPF has succeeded over the past 20 years in the sense it had a significant influence on language and compiler technology for parallel architectures. It included ideas and technologies that have become part of the next generation of high-performance computing languages. These include a global view of data and control, high-level specification of data distributions, interoperability with other language models, and an extensive library of primitive operations for parallel computing. Many of the original implementation impediments were resolved as a result of advanced compiler and runtime technology developed since HPF inception. Nevertheless, for both technical and nontechnical reasons, the language failed to be adopted by a broad user community. Perhaps the time is right today for the high-performance community to again embrace new parallel-programming models, like the one supported by HPF.

Back to Top

Acknowledgments

We thank the referees for their thoughtful, detailed, constructive comments that made a significant contribution to improving the final version of this article.

Back to Top

References

1. American National Standards Institute X3J3/S8.115. Fortran 90, June 1990; http://www.ansi.org

2. Benkner, S., Lonsdale, G., and Zima, H.P. The HPF+ project: Supporting HPF for advanced industrial applications. In Proceedings EuroPar Parallel Processing, Vol. 1685 of Lecture Notes in Computer Science. Springer-Verlag, 1999.

3. Benkner, S. and Zima, H. Compiling High Performance Fortran for distributed-memory architectures. Parallel Computing 25 (1999), 17851825.

4. Blelloch, G.E. NESL: A Nested Data-Parallel Language, Version 3.1. Technical Report CMU-CS-95-170. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Sept. 1995.

5. Chamberlain, B.L., Callahan, D., and Zima, H.P. Parallel programmability and the Chapel language. Special issue on high-productivity languages and models. International Journal of HPC Applications 21, 3 (2007), 291312.

6. Chapman, B., Mehrotra, P., and Zima, H. Programming in Vienna Fortran. Scientific Programming 1, 1 (Fall 1992), 3150.

7. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., and Sarkar, V. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object Oriented Programming, Systems, Languages, and Applications. ACM Press, New York, 2005, 519538.

8. Dagum, L. and Menon, R. OpenMP: An industry-standard API for shared-memory programming. Computational Science and Engineering 5, 1 (1998), 4655.

9. Diaconescu, R.L. and Zima, H.P. An approach to data distributions in Chapel. Special issue on high-productivity programming languages and models. International Journal of High Performance Computing Applications 21, 3 (2007), 313335.

10. Gupta, M., Schonberg, E., and Srinivasan, H. A unified framework for optimizing communication in data-parallel programs. IEEE Transactions on Parallel and Distributed Systems 7, 7 (July 1996), 689704.

11. Hempel, R. and Walker, D.W. The emergence of the MPI message passing standard for parallel computing. Computer Standards & Interfaces 21, 1 (1999), 5162.

12. High Performance Fortran Forum. High Performance Fortran language specification. Scientific Programming 2, 12 (1993), 1170; see also CRPC-TR92225.

13. Hiranandani, S., Kennedy, K., and Tseng, C.-W. Compiling Fortran D for MIMD distributed-memory machines. Commun. ACM 35, 8 (Aug. 1992), 6680.

14. Joisha, P.G. and Banerjee, P. The efficient computation of ownership sets in HPF. IEEE Transactions on Parallel and Distributed Systems 12, 8 (Aug. 2001), 769788.

15. Kennedy, K., Koelbel, C., and Zima, H. The rise and fall of High Performance Fortran: An historical object lesson. In Proceedings of the Third ACM SIGPLAN Conference on the History of Programming Languages (San Diego). ACM Press, New York, 2007.

16. Kennedy, K. and Seo, Y., Eds. Concurrency and computation: Practice and experience. Special issue: High Performance Fortran 14, 89 (2002).

17. Koelbel, C., Mehrotra, P., Saltz, J., and Berryman, H. Parallel loop on distributed machines. In Proceedings of the Fifth Distributed Memory Computing Conference (Apr. 1990), 10971104.

18. Leasure, B. Parallel Processing Model for High-Level Programming Languages. Draft proposed American national standard for information processing systems, Apr. 1994.

19. Shimasaki, M. and Zima, H.P., Eds. The Earth Simulator, special edition. Parallel Computing 30, 12 (Dec. 2004).

20. Snir, M., Otto, S., Huss-Lederman, S., Walker, D., and Dongarra, J. MPI: The Complete Reference: The MPI Core, Second Edition. MIT Press, Cambridge, MA, 1998.

21. Sun Microsystems, Inc. The Fortress Language Specification, Version 0.707, Burlington, MA, July 2005.

22. Thinking Machines Corp. CM Fortran Reference Manual, Version 1.0, Cambridge, MA, Feb. 1991.

Back to Top

Authors

Ken Kennedy was the John and Ann Doerr Professor in the Department of Computer Science at Rice University, Houston, TX.

Charles Koelbel (chk@cs.rice.edu) is a research scientist in the Computer Science Department of Rice University, Houston, TX.

Hans Zima (zima@jpl.nasa.gov) is a principal scientist at the Jet Propulsion Laboratory, California Institute of Technology, and a professor emeritus at the University of Vienna, Austria.

Back to Top

Figures

UF1Figure. Parallel computers, languages, and the search for portable programming (circa 1990).

Back to top


©2011 ACM  0001-0782/11/1100  $10.00

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.


 

No entries found