Programming in a functional programming style can often lead to surprisingly elegant solutions to complicated problems. This arises in part from abstracting away from locations and state and thinking instead in terms of values and functions, in a mathematical style. Also, importantly, the lack of side effects means that the components are easily composable. This is particularly important for parallel programs since it means the lack of side effects leads to code that can run in parallel but has a deterministic sequential semantics. Since the functional programming style focuses on values rather than state, it abstracts away from the notion of memory and location. This can be viewed as a failure, or as an opportunity.
On the one side it fails to let the user control how memory is laid out or how operations are ordered during the computation. This disallows many optimizations by the user that are crucial for performance on modern hardware—for example, laying out structures adjacently so they share a cache line, or avoiding levels of indirection, often referred to as boxing.
On the other side it is an opportunity for smart compilers or runtime systems to do these optimizations for the user. The compiler has the advantage that it can be customized for different machines, and can potentially have a more accurate model of the costs of a machine. Also compilers are more capable of searching large parameter spaces—it is surely rare that any humans still do register allocation by hand. On this side, compilers for typed functional languages have taken large steps at generating code that can sometimes match or even beat optimized low-level human generated codes. Such compilers include the MLton compiler for Standard ML and the Glasgow Haskell Compiler (GHC) for Haskell. Both are very proficient at unboxing and hence avoiding levels of indirection. GHC also performs stream fusion, which can avoid generating intermediate results that are expensive to write and read back. The following paper by Mainland, Leshchinskiy, and Peyton Jones points out, however, that stream fusion by itself is not well suited for generating bulk instructions such as vector or SIMD instructions.
The following paper points out that stream fusion by itself is not well suited for generating bulk instructions such as vector or SIMD instructions.
As an example, the authors consider a simple vector dot product. A dot product is expressed naturally, and compositionally, as an element-wise product of the two vectors, followed by a sum of the elements of the resulting vector—or in functional parlance, a zip-with multiply followed by a reduce plus. This is elegant and high-level because it does not directly specify the ordering of how the element-wise multiplies or sums in the reductions are applied.
Naïvely, however, such a dot product creates an intermediate vector containing all the element-wise products. This is inefficient since writing out the intermediate vector and reading it back will end up being a significant portion of the cost. Instead it can be much more efficient to multiply a pair and immediately add it into the running sum, as one would likely write in a loop using an imperative language such as C or Java. The translation from the zip-reduce solution to such a loop form can be done automatically by the Haskell compiler using stream fusion. However, the resulting code is inherently sequential, as would be the C or Java code, and inhibits the use of bulk operations, or vector instructions. Instead, the target code needs to be able to chunk (or block) the vectors into pieces to which the bulk operations or vector instructions can be applied.
The authors propose a solution for such chunking. The approach recognizes that no one representation is useful for all situations, so instead maintains multiple representations of a stream in what they refer to as a bundle. Maintaining multiple representations might seem inherently inefficient due to redundancy, but given the stream framework, only one representation need be generated for a producer at the behest of the consumer, and the unevaluated remaining ones can be tossed. Making this all work imposes several other challenges that are discussed. Ultimately, the paper provides a variety of results that show the approach can lead to Haskell code outperforming C on certain benchmarks even when it uses the vector library.
The holy grail of compilers for functional languages, that is, always outperforming hand-tuned code, has certainly not yet been achieved in general, but compilers for typed functional languages continue to make big steps.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment