Computing Applications

The Ninja Project

The authors' Numerically INtensive JAva programming environment shows that Java can be made to produce Fortran- and C-like performance for applications involving high-performance numerical computing.

By José E. Moreira, Samuel P. Midkiff, Manish Gupta, Pedro V. Artigas, Peng Wu, and George Almasi

Posted Oct 1 2001

Introduction
Sources of Java Performance Difficulties
Java Performance Solutions
Implementation and Results
Conclusion
References
Authors
Figures
Sidebar: Array Package for Java
Sidebar: Numerical Linear Algebra in Java

When the Java programming language was introduced by Sun Microsystems in 1995, there was a perception (properly founded at the time) that its many benefits came at a significant performance cost. The related deficiencies were especially apparent in numerical computing. Our own measurements in 1997 with second-generation Java Virtual Machines (JVMs) found differences in performance of up to one hundredfold relative to C and Fortran. Initial experience with poor performance caused many developers of high-performance numerical applications to reject Java out-of-hand as a platform for their applications.

Despite the more recent progress in Java optimization, the performance of commercially available Java platforms is still not on par with state-of-the-art Fortran and C compilers. Programs using complex arithmetic exhibit particularly poor performance. Moreover, today’s Java platforms are incapable of automatically applying important optimizations to numerical code, including loop transformations and automatic parallelization [12]. Nevertheless, we find no technical barriers to high-performance computing in Java. To prove this thesis, we developed a prototype Java environment, called Numerically INtensive JAva, or NINJA, that has demonstrated that Java can obtain Fortran-like performance on a variety of problems in scientific and technical computing. NINJA has addressed such high-performance programming issues as dense and irregular matrix computations, calculations with complex numbers, automatic loop transformations, and automatic parallelization. The NINJA techniques are straightforward to implement and allow reuse of existing optimization components already deployed by software vendors for other languages [9], thus lowering the economic barriers to Java’s acceptance in numerically intensive applications.

The next challenge for numerically intensive computing in Java is convincing developers and managers in this domain that Java’s benefits can be obtained with performance comparable to highly tuned Fortran and C. Once they accept that Java performance is only an artifact of particular implementations of the language and that there are no technical barriers to achieving excellent numerical performance, NINJA-derived techniques will allow vendors and researchers to quickly deliver high-performance Java platforms to program developers.

Sources of Java Performance Difficulties

Among the many difficulties associated with optimizing numerical code in Java, we’ve identified three characteristics that are, in a way, unique to the language: a lack of regular-shaped arrays; exception checks for null-pointer and out-of-bounds array accesses (combined with a precise exception model); and weak support for complex numbers and other arithmetic systems.

Arrays in Java. Unlike Fortran and C, Java has no direct support for truly rectangular multidimensional arrays. Java allows some simulation of multidimensional arrays through “arrays of arrays.” Figure 1(a) shows an array of arrays used to simulate a rectangular 2D array. In this case, all rows have the same length, but arrays of arrays can be used to construct far more complicated structures, as in Figure 1(b); the only way to determine the minimum length of a row is to examine all rows. In contrast, determining the size of a true rectangular array, as in Figure 1(c), requires looking at only a few parameters. The shape of an array of arrays can change during computation. While there are other possible solutions, the simplest by far is to have a data structure that makes this property explicit, as in the rectangular 2D arrays in Figure 1(c).

Arrays of arrays may also have complicated aliasing patterns, with both intra- and inter-array aliasing, as in Figure 1(b). Alias analysis, which is important for compiler optimizations, is extremely difficult for arrays of arrays but is much easier for true multidimensional arrays like those in Figure 1(c) (Z and T). There is no intra-array aliasing for true multidimensional arrays; inter-array aliasing can be determined through simple tests [12].

Java’s performance difficulties can be solved through a careful combination of language and compiler techniques.

The Java exception model. Java requires all array accesses to be checked to ensure they are not null and within bounds. Java’s exception model states that when the execution of a piece of code causes an exception, all effects of instructions prior to the exception must be “visible,” while effects of instructions after the exception should not be visible [6]. This requirement harms performance in two ways. First, checking the validity of array references contributes to runtime overhead. Second, code reordering in general—and loop iteration reordering in particular—is prohibited, thus preventing almost all optimizations for numerical codes.

Complex numbers in Java. From a numerical perspective, Java has direct support only for real numbers; Fortran and C++ both efficiently support other arithmetic systems, including complex numbers. This efficiency results from the ability to represent low-cost data structures that can be allocated on the stack or in registers. Java, in contrast, requires that any nonprimitive data type be represented as a full-fledged object. Complex numbers are typically implemented as objects of a class Complex, so every time an arithmetic operation generates a new complex value, a new Complex object is allocated and the old value is invalidated and subject to garbage collection.

These three difficulties—arrays, exceptions, and complex numbers—at the core of Java’s performance deficiencies prevent the application of mature compiler optimization technology to Java and thus prevent Java from being truly competitive with more established languages, including Fortran and C.

Java Performance Solutions

Performance can be improved through a combination of language and compiler techniques. For example, we developed new class libraries that enrich the language and compiler techniques that take advantage of these new constructs to perform automatic optimizations. Above all, we maintain full Java portability across all virtual machines.

The Array package and semantic expansion. To attack the absence of truly rectangular multidimensional arrays in Java, we defined an Array package with multidimensional arrays (denoted here as Arrays with a capital A) of various types and ranks. For example, doubleArray2D implements a 2D Array of double-precision floating-point numbers, whereas ComplexArray3D implements a 3D Array of complex numbers (see the sidebar “Array Package for Java”). Several access and manipulation operations are defined for the Array data types. The Arrays have an immutable rectangular and dense shape that simplifies testing for aliases and facilitates the optimization of runtime checks. The Array classes are written in fully compliant Java code and can be run on any JVM, ensuring the portability of programs written using the Array package.

Array elements can be accessed via the get and set element operations. For example, A.get(i,j) retrieves the value of element (i,j) of a 2D Array A. Similarly, A.set(i,j,x) sets the value of that element to the value of x. The runtime overhead of a method invocation for each element access is unacceptable for high-performance computing. This problem is avoided through a compiler technique known as semantic expansion in which the compiler looks for specific method calls and substitutes efficient code for the call. In the case of the Array get and set operations, it leads to code as efficient as with multidimensional arrays in Fortran and C. This technique allows programs using the Array package to achieve high performance when executed on JVMs recognizing that package.

The Complex class and semantic expansion. We also defined a complex number class as part of the Array package, along with methods implementing arithmetic operations on complex numbers. Again, semantic expansion is used to convert calls to these methods into code that directly manipulates complex number values, instead of full-fledged Complex objects. Values are converted to objects in a lazy manner upon encountering an operation that may require OO functionality. Thus, the programmer continues to treat complex numbers as objects (maintaining the clean semantics of the original language), while the compiler transparently transforms them into values for efficiency.

Versioning for safe and alias-free regions. The Array package allows a compiler to perform simple transformations that eliminate the performance problems caused by Java’s runtime checks and exception model. The idea is to create regions of code guaranteed to be free of exceptions. Once these exception-free, or “safe,” regions are created, the compiler can apply code-reordering optimizations [12]. The safe regions are created by the versioning of loop nests. For each optimized loop nest, the compiler creates two versions—safe and unsafe—guarded by a runtime test. The test is constructed so that if all Arrays in the loop nest are valid (not null), and if all the indexing operations inside the loop generate inbound accesses, then the test evaluates to true. In that case, the safe version of the loop is executed; otherwise, the unsafe version is executed. Since the safe version cannot cause an exception, explicit runtime checks are omitted from the code.

We take the versioning approach a step further. Application of automatic loop transformation (and parallelization) techniques by a compiler generally requires alias disambiguation among the various arrays (single or multidimensional) referenced in a loop nest. With the Array package, it is easy to determine whether two arrays are distinct. Using these concepts, we can further specialize the safe version of a loop nest into two variants—one in which all arrays are guaranteed to be distinct (no aliasing), and one in which there may be aliasing between arrays. Mature loop optimization techniques are easily applied to the safe and alias-free regions.

Figure 2 shows an example of the versioning transformations for creating safe and alias-free regions; Figure 2(a) illustrates the original code, explicitly showing all null pointer and array bounds runtime checks being performed. (Operations checknull and checkbounds are actually implicit at the Java and “bytecode” levels but explicit here for illustration purposes.) Figure 2(b) illustrates the versioned code. A simple test for the values of the A and B pointers and a comparison of loop bounds and array extents can determine whether or not the loop will be free of exceptions.

Libraries for numerical computing. Optimized libraries are an important vehicle for achieving high performance in numerical applications. One approach is to make existing native libraries available to Java programmers through the Java Native Interface [3]. Another approach is to develop new libraries entirely in Java (see the sidebar “Numerical Linear Algebra in Java”).

Implementation and Results

We have implemented these ideas in the NINJA prototype, which is based on the IBM XL family of compilers. In these compilers, front-ends for different languages transform programs into a common intermediate representation, called W-Code. The Toronto Portable Optimizer (TPO) W-Code-to-W-Code transformer performs classical optimizations, including constant propagation and dead code elimination, as well as high-level loop transformations based on aggressive dataflow analysis. Finally, the transformed W-Code is converted into optimized machine code by an architecture-specific back-end. Semantic expansion of the Array package methods [1] is implemented within the IBM compiler for the Java front-end [11]. Safe region creation and alias versioning are implemented in TPO.

NINJA translates machine-independent Java bytecode into machine-specific executable code in a separate step, prior to execution. Nothing prevents the techniques described here from being used in a dynamic compiler. Moreover, by using the quasi-static dynamic compilation model [10], the more expensive (in terms of compiler overhead) optimization and analysis techniques can be performed offline, sharply reducing the effect of compilation overhead.

To evaluate the performance effects of these techniques, we recently used a suite of benchmarks and a production data mining application [1, 8]. We compared the performance produced by the NINJA compiler with that of the IBM Development Kit for Java version 1.1.6 and the IBM XLF Fortran compiler on a variety of platforms.

Sequential execution results. Figure 3(a) summarizes results for eight real arithmetic benchmarks when running in strictly sequential (single-threaded) mode. The numbers at the top of the bars indicate actual Mflops. For the Java 1.1.6 version, arrays are implemented as double[][]. The NINJA version uses doubleArray2D Arrays from the Array package, optimized with semantic expansion. For six of the benchmarks (matmul, microdc, lu, cholesky, bsom, and shallow), the performance of the Java version (with the Array package and NINJA compiler) is 80% or more of the performance of the Fortran version. This high performance is due to well-known loop transformations, enabled by our techniques, that enhance data locality.

The impediments to widespread adoption of Java for numerically intensive computing are economic and social, not technical.

Results for complex arithmetic benchmarks. Figure 3(b) summarizes results for five complex arithmetic benchmarks (fft, matmul, lu, cfd, and microac). For the Java 1.1.6 version, complex arrays are represented using a Complex[][] array of Complex objects. The NINJA version uses ComplexArray2D Arrays from the Array package and semantic expansion. In all cases, we observed significant performance improvement between the Java 1.1.6 and NINJA versions ranging from a factor of 35 (1.7 to 60.5Mflops for cfd) to a factor of 75 (1.2 to 89.5Mflops for matmul). We achieved Java performance ranging from 55% (microac) to 85% (fft and cfd) of fully optimized Fortran code.

Parallel execution results. Loop parallelization is another important transformation enabled by safe region creation and alias versioning. Speedup results follow from applying automatic loop parallelization to the eight real arithmetic Java benchmarks. Figure 3(c) shows speedup results (relative to single-processor performance) of the parallel code optimized with NINJA. The compiler was able to parallelize loops in each of the eight benchmarks. Significant speedups were obtained (better than 50% efficiency on four processors) in six of the benchmarks (matmul, microdc, lu, shallow, bsom, and fft).

Results for parallel libraries. Evaluating the effectiveness of our solutions, we applied NINJA to a production data mining application. In this case, we used a parallel version of the Array package with multithreading to exploit parallelism within the Array operations. The user application was strictly sequential code, and all parallelism was exploited transparently to the application programmer; results are in Figure 3(d). The conventional (Java arrays) version of the application achieves only 26Mflops, compared to 120Mflops for the Fortran version. The single-processor Java version with the Array package (bar Array x 1) achieves 109Mflops. Moreover, when run on a multiprocessor, the performance of the Array package version scales with the number of processors (bars Array x 2, Array x 3, and Array x 4 for execution on two, three, and four processors, respectively), achieving almost 300Mflops on four processors.

Conclusion

There are no serious technical impediments to adopting Java as a major language for numerically intensive computing. The techniques we’ve developed and presented here are straightfoward to implement and allow exploitation of existing compiler optimization techniques. Java itself has many features, including simpler pointers and flexibility in choosing object layouts, that facilitate these optimization techniques.

The impediments to Java’s widespread adoption for numerically intensive computing are instead economic and social, including vendor unwillingness to commit the resources to developing product-quality compilers; application developer reluctance to make the transition to new languages for developing new codes; and the widespread view that Java is simply not suited for technical computing. The consequences of this situation include a large pool of programmers being underutilized and millions of lines of code being developed using languages inherently more difficult and less safe to use than Java. Maintaining these non-Java programs will be a burden on scientists and application developers for decades to come.

We hope the concepts and results presented here help overcome these impediments and accelerate acceptance of Java to the benefit of the general computing community.

Figures

Figure 1. Examples of (a) array of arrays simulating a 2D array; (b) array of arrays in a more irregular structure; and (c) rectangular 2D array.

Figure 2. Creation of safe and alias-free regions through automatic compiler versioning.

Figure 3. Performance results of applying our Java optimization techniques to various cases.

Sidebar: Array Package for Java

The Array package for Java (available for download from www.alphaworks. ibm.com/tech/ninja/) provides the functionality and performance associated with true multidimensional arrays. Arrays of arrays, directly supported by the Java programming language and JVMs, are different from true multidimensional arrays. Multidimensional arrays (Arrays) are rectangular collections of elements characterized by three immutable properties: type, rank, and shape. The type of an Array is the type of its elements (such as int, double, and Complex); the rank, or dimensionality, of an Array is its number of axes; and the shape of an Array is determined by the extent of its axes. The dense and rectangular shape of Arrays facilitates the application of automatic compiler optimizations.

The root of the class hierarchy for the Array package is an Array abstract class (not to be confused with the Array package). From the Array class, we derive type-specific abstract classes. The leaves of the hierarchy correspond to final concrete classes, each implementing an Array of specific type and rank. For example, doubleArray2D is a 2D Array of double precision floating-point numbers. The shape of an Array is defined at object-creation time. For example,

intArray3D A =
new intArray3D(m,n,p);

creates an m x n x p 3D Array of integer numbers. Defining a specific concrete final class for each Array type and rank effectively binds the semantics to the syntax of a program, enabling the use of mature compiler technology developed for languages like Fortran and C.

Arrays can be manipulated elementwise or as aggregates. For instance, if one wants to compute a 2D Array C of shape m x n in which each element is the sum of the corresponding elements of Arrays A and B, also of shape m x n, then one can write either

for (int i=0; i<m; i++)

for (int j=0; j<n; j++)

C.set(i,j,A.get(i,j)+B.get(i,j));

C = A.plus(B);

There are subtle differences between these two forms. The latter (aggregate) form has Array semantics; all elements of A and B are first read, the addition is performed, and only then are the resulting values written to the elements of C. The first (elementwise) version computes one element of C at a time. If C happens to share storage with A and/or B, the resulting values of elements of C may differ from the aggregate form. Both elementwise and aggregate forms have their merits, and the Array package is designed so the two forms can be optimized aggressively, as with state-of-the-art Fortran compilers.

The Array package for Java is currently undergoing a standardization process through the Java Community Process (see JSR 083, java.sun.com/aboutJava/communityprocess/jsr/jsr_083_multiarray.html). Standardization is an important step in making Java practical for numerical computing. Note the current naming conventions for the Array package do not follow recommended Java practice; for example, some classes start with lower-case letters. We expect this to change with the standardization process.

Sidebar: Numerical Linear Algebra in Java

Numerical linear algebra operations are important building blocks for scientific and engineering applications. Many modeling problems in these domains can be expressed as a system of linear equations. Much work has been done—by industry, academia, and government—to develop libraries of routines that manipulate and solve these diverse systems of equations using numerical linear algebra. The Basic Linear Algebra Subprograms (BLAS) and the Linear Algebra Package (LAPACK) are popular examples of libraries available to Fortran and C programmers [5]. Part of our work in optimizing Java performance for numerically intensive computing involves developing a linear algebra library in Java as part of the Array package. We call it the Java BLAS.

Linear algebra algorithms (such as for solving for vector x in the equation Ax = b) are expressed in terms of vector and matrix operations. For this reason, we defined two interfaces—BlasVector and BlasMatrix—defining the behavior of vectors and matrices, respectively. For example, any implementation of the BlasMatrix interface must provide methods gemm (for matrix multiplication), trsm (for solution of triangular systems), and syrk (for update of symmetric matrices). Linear algebra algorithms are then expressed strictly in terms of the methods defined by the BlasVector and BlasMatrix interfaces. This approach is particularly appropriate for the implementation of linear algebra algorithms in recursive form [7].

The 1D and 2D floating-point Arrays in the Array package (floatArray1D, floatArray2D, doubleArray1D, doubleArray2D, ComplexArray1D, ComplexArray2D) implement the BlasVector and BlasMatrix interfaces, respectively. Therefore, a single instance of a linear algebra algorithm works for single-precision, double-precision, and complex floating-point numbers. This polymorphism results in our linear algebra library being much smaller than equivalent implementations in C and Fortran. We have achieved very respectable performance with our all-Java implementation, observing that the Java BLAS version achieves 80% of ESSL performance and 75% of the machine peak performance (800Mflops).

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

The Ninja Project

View in the ACM Digital Library

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DOI

10.1145/383845.383867

October 2001 Issue

Published: October 1, 2001

Vol. 44 No. 10

Pages: 102-109

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Jul 26 2024

Establishing Standards for Embodied AI

Shaoshan Liu

Architecture and Hardware

vitruvian man on green binary code background, illustration

BLOG@CACM Jul 24 2024

A Pioneer in Using AI to Teach Reading

Jeremy Roschelle

Architecture and Hardware

BLOG@CACM Jul 23 2024

A Versal Story in the Era of Hardware AI: Why the Chinese Could Win

Aleksandr Romanov and Maksim Popov

Architecture and Hardware

worker amidst rows of circuit boards at Chinese factory

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Sources of Java Performance Difficulties

Java Performance Solutions

Implementation and Results

Conclusion

Figures

Sidebar: Array Package for Java

Sidebar: Numerical Linear Algebra in Java

The Ninja Project

DOI

October 2001 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.