acm-header
Sign In

Communications of the ACM

Communications of the ACM

Enabling Java For High-Performance Computing


View as: Print Mobile App ACM Digital Library Full Text (PDF) Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook

Java has become increasingly popular as a general-purpose programming language. Current Java implementations focus mainly on the portability and interoperability required for Internet-centric client/server computing. Key to Java's success is its intermediate "bytecode" representation, which can be exchanged and executed by Java Virtual Machines (JVMs) on almost any computing platform. However, along with that popularity has come an increasing need for an efficient execution mode. For sequential execution, just-in-time compilers improve application performance [4]. But high-performance computing applications typically require multiple-processor systems, so efficient interprocessor communication is also needed, in addition to efficient sequential execution.

As an OO language, Java uses method invocation as its main communication concept; for example, inside a single JVM, concurrent threads of control can communicate through synchronized method invocations. On a multiprocessor system with shared memory (SMP), this approach allows for some limited form of parallelism by mapping threads to different physical processors. For distributed-memory systems, Java offers the concept of a remote method invocation (RMI). With RMI, the method invocation, along with its parameters and results, is transferred across a network to and from the serving object on a remote JVM (see the sidebar "Remote Method Invocation").

With these built-in concepts for concurrency and distributed-memory communication, Java provides a unique opportunity for a widely accepted general-purpose language with a large base of existing code and programmers to also suit the needs of parallel (high-performance) computing. Unfortunately, Java is not yet widely perceived by programmers as such, due to the inefficiency of early implementations based on bytecode interpretation. Here, we provide evidence of Java's usefulness for parallel computing by describing efficient implementation techniques, showing the combination of powerful compilers and efficient runtime systems yields Java execution environments able to exploit the computational power of distributed-memory parallel computers scaling to system sizes unreachable in pure shared-memory approaches.

Back to Top

Suited for Parallel Computing

A major Java advantage is that it provides communication mechanisms inside the language environment, whereas other languages, such as Fortran and C++, require external mechanisms like message passing. Recently, however, although programming interfaces between Java and the Message Passing Interface (MPI) standard were developed [5], the MPI message-passing style of communication is difficult to integrate cleanly with Java's OO model, especially as MPI assumes a single-program, multiple-data (SPMD) programming model quite different from Java's multithreading model. One of our purposes here is to show that, with efficient compilers and runtime systems, pure Java is a platform well suited for parallel computing. We explore two approaches for achieving this goal.

The first allows truly parallel execution of multithreaded Java programs on distributed memory platforms. This idea is implemented in the Hyperion system,1 which compiles multithreaded bytecode for execution on a "distributed virtual machine," allowing threads on different nodes to share objectsthe purest approach to Java-based parallel computing on distributed-memory systems. Hyperion completely hides the distributed-memory environment from the application programmer, allowing access of any object from any machine.

The second approach gives programmers the explicit notion of shared objects, so they indicate which objects will be shared among multiple threads; communication between threads is reduced to method invocations on the shared objects. This approach is implemented in the Manta system.2 Manta statically compiles Java source code to executable programs; its runtime system provides highly efficient RMI, as well as a similar mechanism called replicated method invocation (RepMI), allowing for more efficient use of object locality.

The basic implementation techniques we outline here for both Hyperion and Manta ultimately yield efficient parallel execution of Java programs on distributed-memory platforms. We also provide performance data for the respective communication operations, discuss the suitability of the approaches to parallel programming, and compare their promise and limitations for Java-centric parallel computing.

Hyperion: Transparent distributed multithreading. Hyperion allows a Java programmer to view a cluster of processors as executing a single JVM [1]. In Java, concurrency is exposed to the programmer through threads sharing a common address space. The standard library provides facilities to perform various functions, including start a thread and switch control between threads; the Java memory model specifies how threads might interact with the common memory (see the sidebar "Java Multithreading"). Thus, it is possible to map a multithreaded Java program directly onto a cluster. The threads are spread across the processing nodes to deliver actual concurrent execution and load balancing. The Java memory model is implemented by a distributed shared memory (DSM) substrate, so the language's original semantic model remains unchanged.

For efficient sequential execution, Hyperion compiles Java bytecode into optimized native code through a two-step process: Java bytecode is first translated to C, and a C compiler is then used to generate native code for the processors of the cluster. Using a C compiler for generating native code delivers the benefits of platform-specific compiler optimizations while keeping the system itself platform-independent.

Portability between any type of networked system has always been a major objective in the Hyperion design. Therefore, the runtime system is built on top of a portable environment called DSM-PM2, extending the multithreaded library PM2 [9] with a DSM facility. DSM-PM2 provides lightweight multithreading, high-performance communication and page-based DSM. It is portable across a spectrum of high-performance networks, including the Scalable Coherent Interface (SCI), Myrinet, and Gigabit-Ethernet, and can be used with most common communication interfaces, including the standard TCP protocol, MPI, and the Virtual Interface Architecture (VIA).

The central aspect of Hyperion's design is the management of distributed-object memory (see Figure 1). Hyperion's programming model has to provide the illusion of a uniformly accessible, shared-object memory independent of physical object locations. According to the specification of the Java memory model, each thread in Hyperion is conceptually equipped with a local cache, interacting with a common main memory. Caching greatly improves performance if the application exhibits temporal locality, accessing a cached object multiple times before the cache is invalidated.

Hyperion is the only system providing a fully transparent cluster implementation of Java. Similar systems include Java/DSM [12], cJVM [2], and JESSICA [6]. However, these other systems are based on interpreted bytecode, rather than on native compilation, resulting in less efficient program execution. While these systems differ in their approaches to implementing the Java Memory Model, as well as in the type of their target applications, together they collectively demonstrate the potential of using Java to efficiently utilize clusters.

The net result of Hyperion's implementation techniques is efficient execution of unmodified Java programs on a range of distributed clusters. This flexibility is a major incentive for Java users in search of high performance; Table 1 lists the timings of local, cached, and remote elementary DSM operations, measured with Hyperion on a cluster of 200MHz Pentium Pros communicating over Myrinet.

The first two lines in the table are time in microseconds to access local, cached, and remote objects on this platform. Remote access times include the costs of transferring the page containing the object; page size is 4,096B. A remote read operation includes: detecting the absence of the object and transmitting the request (114 microseconds, 30%); transferring the page across the network (134 microseconds, 37%); and additional Hyperion-level processing (122 microseconds, 33%). Writing to a cached copy of an object involves recording the modifications for later consistency updates, adding 0.48 microseconds to reading. Writing to a remote object is more expensive in terms of runtime than reading it because a remote write must transmit the modification back to the home location.

The last line in the table is the time to perform remote and local synchronization, or the time to enter and exit a Java monitor. In the remote case, the lock being accessed by the monitor is on a different node.

Manta: Efficiently shared objects. Instead of providing a shared-memory programming model, as in the philosophy behind Hyperion, Manta requires the programmer to store shared data in remote objects that can be accessed using Java's RMI. Manta's programming model is the same as the one in standard RMI, along with a simple extension allowing the programmer to improve locality by indicating which objects should be replicated. Manta implements this programming model in a highly efficient way, using a completely new implementation of Java and RMI [8]. Manta uses a native offline Java compiler that statically compiles Java source code into executable binary code. Using a static compiler allows aggressive, time-consuming optimizations. Manta's fast RMI implementation consists of three components:

  • Lightweight RMI protocol. This protocol is implemented completely in C, avoiding the layering overhead of other RMI systems that invoke low-level C routines from Java code via the slow Java Native Interface (JNI). It minimizes the overhead of thread switching, buffer management, data format conversions (byte swapping), and copying.
  • Object serialization. The Manta compiler generates specialized serialization routines for serializable argument classes, avoiding the overhead for the runtime type inspection typical of most other Java systems.
  • Efficient communication software. Manta is implemented on top of the Panda communication library [3, 8], which provides message passing, remote procedure call (RPC), and broadcasting. On Myrinet, Panda uses a highly optimized low-level communication substrate; on Ethernet, Panda uses the standard UDP protocol.

The RMI implementation described here is compatible with the Java language specification but uses a different communication protocol. Manta uses additional mechanisms to interoperate with other JVMs [8] (see Figure 2). A parallel Java program compiled with Manta's native compiler runs on a Pentium-based cluster. Application processes use Manta's fast RMI protocol to communicate with one another. They can also communicate with applications running on standard JVMs using the standard RMI protocol. They can even exchange bytecode with the applications, as required for polymorphic RMIs [11]. For the purpose of bytecode exchange in Java programs, the Manta compiler also generates bytecode, which can be sent to remote JVMs; the Manta runtime system contains a compiler to process incoming bytecode from a JVM. The result is that Manta provides efficient sequential code, fast communication, interoperability with JVMs, and polymorphic RMIs. Manta's RMI combines the efficiency of a C-like RPC and the flexibility of Java RMI.

The JavaParty project [10] implemented similar optimizations to Java RMI but without interoperability to standard JVMs. Because JavaParty's RMI is implemented in pure Java, it is also less efficient than Manta's RMI.

Even with all the optimizations performed by Manta, method invocations on shared objects are much slower than sequential method invocation, or invocations on normal Java objects not declared to be remote. Even within the same address space, accessing a remote object is costly in terms of runtime overhead. Manta addresses this problem through the concept of replicated method invocation (RepMI) [7], whereby shared objects are replicated across the processes of a parallel application. The RepMI advantage is that methods that don't modify a replicated object (read-only methods) can be performed on the local copy. Such methods are recognized by the Manta compiler and executed without communication, resulting in completion times close to sequential method invocation. Manta also provides a mechanism to replicate collections of objects, including trees and other graphs.

To obtain high performance, RepMI implements methods that modify a replicated object (write methods) using an update protocol with function shippingthe same approach used in the Orca shared-object language and system [3]. This protocol updates all copies of a replicated object by broadcasting the write operation and performing the operation on all replicas. The broadcast protocol is provided by the Panda library [3] using totally ordered broadcasting so all replicas are updated consistently.

Table 2 lists the timings of local, remote, and replicated method invocations, measured with Manta on a Myrinet cluster with 200MHz Pentium Pros. The remote write method costs 41 microseconds of runtime. Calling a remote read method requires additional serialization of the result data and costs a total of 42 microseconds to complete. A parameterless invocation of the underlying Panda RPC protocol takes 31 microseconds.

Example applications. We evaluated Hyperion and Manta with two small (in terms of code size) example applications measured on two clusters with identical processors (200MHz Pentium Pros) and networks (Myrinet). We compared application execution runtimes against a state-of-the-art commercial just-in-time compilerthe IBM JIT 1.3.0. The first application is the All-pairs Shortest Paths, or ASP, program, computing the shortest path between any pair of nodes in a graph using a parallel version of Floyd's algorithm, which computes shortest paths. ASP uses a distance matrix divided row-wise among the available processors. At the beginning of iteration k, all processors need the value of the kth row of the matrix.

For the shared-memory version of ASP used by Hyperion, a single thread is allocated on each processor. Each thread owns a contiguous block of rows in the graph's shared distance matrix. On each iteration in ASP, each thread fetches the necessary row, updates its own rows, then synchronizes to wait for all other threads to finish the iteration. Figure 3 (top) shows the program performs well on small clusters with limited numbers of nodes; the cluster available to Hyperion has only eight nodes. However, having all threads request the current row separately is likely to limit the ability of the cluster to use more nodes efficiently. This situation might best be addressed by extending Hyperion's programmer interface to include methods for collective communication among the threads of a thread group.

In the RMI version, each row of the distance matrix implements the interface java.rmi.Remote, making it accessible for threads on remote nodes. The processor owning the row for the next iteration stores it in its remotely accessible object. Because each machine has to fetch each row for itself, each row has to be sent across the network multiple times (as with Hyperion), causing high overhead on the machine that owns the row. The replicated ASP implementation uses replicated objects for the rows. Whenever a processor writes a row into its object, the new row is forwarded to all machines. Each processor can then read the row locally. Figure 3 (top) shows that the RMI version performs well on up to 16 nodes. On more nodes, the overhead for sending the rows becomes prohibitive. With 64 nodes, the RMI version completes after 38 seconds, while the RepMI variant needs only 18 seconds. This difference is due to the efficient broadcast of Manta's runtime system.

The other example application is called the Traveling Salesperson Problem (TSP), which computes the shortest path among all cities in a given set. We use a branch-and-bound algorithm, pruning large parts of the search space by ignoring partial routes already longer than the current best solution. The program is parallelized by dynamically distributing the search space over the different nodes. It tracks the best solution found so far. Each node needs an up-to-date copy of this solution to prevent it from doing unnecessary work, thus causing it to read the value frequently. In contrast, updates happen infrequently.

The Hyperion shared-memory version, as in ASP, uses a single thread per node. The object containing the hitherto best solution is protected by a monitor. The program scales well on small clusters due to Hyperion's lightweight implementation of its DSM primitives and the application's favorable ratio of local computation to remote data access (see Figure 3, bottom).

In an RMI version of TSP, the overhead involved in frequently reading a single, remote Minimum object, continuing the hitherto best solution, would result in poor performance. Instead, a manually optimized version has to be used in which the threads read the Minimum value from a local variable. When a thread finds a better Minimum value, it invokes an updating RMI on all peer threads, which, for this purpose, have to be remote objects. In contrast, the replicated version of TSP is straightforward and intuitive, whereby the global Minimum object implements the replication interface. All changes to this object are forwarded automatically. Each node can locally invoke the read method of the object only slightly slower than reading a variable directly. While being as simple as the Hyperion version, the replicated version on 64 nodes completes in 31 seconds, almost as quickly as the very complex, manually optimized RMI version, which needs 28 seconds.

Back to Top

Conclusion

With efficient implementations like the ones provided by Hyperion and Manta, Java represents an unprecedented opportunity for exploiting a widely accepted general-purpose language for high-performance computing. Moreover, Java provides a unique way of rapidly prototyping parallel applications. Starting with a single JVM, parallel applications can be developed based on multithreading. On a small scale (four to eight nodes), a JVM enables truly parallel thread execution on a multiprocessor machine with shared memory. For larger numbers of CPUs, Hyperion-like systems provide transparent execution of multithreaded programs on distributed systems.

However, allowing Hyperion programmers to view a cluster as a black box is a two-edged sword. They can abstract from the internal details of the cluster's individual nodes with private memories. But efficient parallel execution is provided only if each thread predominantly references data that is local or locally cached. If this local-data property does not hold, the communication costs of accessing remote data severely limit the performance improvement obtainable by spreading the threads across a cluster's multiple nodes. Multithreaded Java programs can then be converted into programs making explicit use of shared objects and replicated objects. This conversion requires the programmer to determine which objects will be shared or replicated and adapt the program to use RMI to access them. Given a high-performance implementation of RMI, as with Manta, such programs can obtain high efficiencies even on large-scale, distributed-memory machines.

Back to Top

References

1. Antoniu, G., Bougé, L., Hatcher, P., MacBeth, M., McGuigan, K., and Namyst, R. The Hyperion system: Compiling multithreaded Java bytecode for distributed execution. Parallel Comput. 27, 10 (Sept. 2001), 12791297.

2. Aridor, Y., Factor, M., Teperman, A., Eilam, T., and Schuster, A. Transparently obtaining scalability for Java applications on a cluster. J. Parall. Distrib. Comput. 60, 10 (Oct. 2000), 11591193.

3. Bal, H., Bhoedjang, R., Hofman, R., Jacobs, C., Langendoen, K., Rühl, T., and Kaashoek, M. Performance evaluation of the Orca shared object system. ACM Trans. Comput. Syst. 16, 1 (Feb. 1998), 140.

4. Burke, M., Choi, J.-D., Fink, S., Grove, D., Hind, M., Sarkar, V., Serrano, M., Sreedhar, V., Srinivasan, H., and Whaley, J. The Jalapeño dynamic optimizing compiler for Java. In Proceedings of the ACM 1999 Java Grande Conference (San Francisco, June). ACM Press, New York, 1999, 129141.

5. Carpenter, B., Getov, V., Judd, G., Skjellum, A., and Fox, G. MPJ: MPI-like message passing for Java. Concurrency: Pract. Exper. 12, 11 (Sept. 2000),10191038.

6. Ma, M., Wang, C.-L., and Lau, F. JESSICA: Java-enabled single-system-image computing architecture. J. Parall. Distrib. Comput. 60, 10 (Oct. 2000), 11941222.

7. Maassen, J., Kielmann, T., and Bal, H. Efficient replicated method invocation in Java. In Proceedings of the ACM 2000 Java Grande Conference (San Francisco, June). ACM Press, New York, 2000, 8896.

8. Maassen, J., van Nieuwpoort, R., Veldema, R., Bal, H., and Plaat, A. An efficient implementation of Java's remote method invocation. In Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'99) (Atlanta, May). ACM Press, New York, 1999, 173182.

9. Namyst, R., and Méhaut, J.-F. PM2 Parallel multithreaded machine: A computing environment for distributed architectures. In Proceedings of Parallel Computing (ParCo'95) (Ghent, Belgium, Sept.). Elsevier Science Publishers, Amsterdam, The Netherlands, 1995, 279285.

10. Philippsen, M., Haumacher, B., and Nester, C. More efficient serialization and RMI for Java. Concurrency: Pract. Exper. 12, 7 (May 2000), 495518.

11. Waldo, J. Remote procedure calls and Java remote method invocation. IEEE Concurrency 6, 3 (JulySept. 1998), 57.

12. Yu, W. and Cox, A. Java/DSM: A platform for heterogeneous computing. Concurrency: Pract. Exper. 9, 11 (Nov. 1997), 12131224.

Back to Top

Authors

Thilo Kielmann (kielmann@cs.vu.nl) is an assistant professor in the Division of Mathematics and Computer Science of Vrije Universiteit, Amsterdam, The Netherlands.

Philip Hatcher (pjh@cs.unh.edu) is a professor of computer science in the Computer Science Department of the University of New Hampshire, Durham, NH.

Luc Bougé (Luc.Bouge@ens-lyon.fr) is a professor of computer science in the LIP Computer Science Laboratory of the Ecole Normale Superieure de Lyon, France.

Henri E. Bal (bal@cs.vu.nl) is a professor of computer science and physics in the Division of Mathematics and Computer Science of Vrije Universiteit, Amsterdam, The Netherlands.

Back to Top

Footnotes

The work on Manta is supported in part by a Universitair Stimulerings Funds grant from Vrije Universiteit, Amsterdam, The Netherlands.

The Hyperion-PM2 collaboration is supported by the U.S. National Science Foundation and French Institut National de Recherche en Informatique et en Automatique through the U.S.-France Cooperative Research program based at the University of New Hampshire and the ENS Lyon, France.

1Hyperion was designed and implemented at the University of New Hampshire by a group led by Phil Hatcher, in collaboration with Mark MacBeth and Keith McGuigan. The Hyperion-PM2 interface was built at ENS Lyon, France, by a group led by Luc Bougé, in collaboration with Gabriel Antoniu, who is also the principal author of PM2's DSM support; other members of the PM2 team included Raymond Namyst, Jean-François Méhaut, Olivier Aumage, and Vincent Danjean.

2Manta was designed and implemented at Vrije Universiteit, Amsterdam, The Netherlands, by a group led by Henri Bal, in cooperation with Thilo Kielmann, Jason Maassen, Rob van Nieuwpoort, Ronald Veldema, Rutger Hofman, Ceriel Jacobs, and Aske Plaat. The lower-level Panda and Myrinet communication software was developed by Raoul Bhoedjang, Tim Rühl, Rutger Hofman, Ceriel Jacobs, and Kees Verstoep.

Back to Top

Figures

F1Figure 1. Management of distributed-object memory in Hyperion.

F2Figure 2. Interoperability of Manta and Sun RMI.

F3Figure 3. Application exectution times with Hyperion, RMI, and RepMI; the cluster available to Hyperion has only eight nodes.

Back to Top

Tables

T1Table 1. Completion times (in microseconds) of elementary DSM operations on a Pentium Pro/Myrinet cluster.

T2Table 2. Completion times of read and write operations (in microsedconds) on a Pentium Pro/Myrinet cluster.

Back to Top

Back to Top


©2001 ACM  0002-0782/01/1000  $5.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2001 ACM, Inc.


 

No entries found