Home/Magazine Archive/May 2017 (Vol. 60, No. 5)/Exploiting Vector Instructions with Generalized Stream.../Full Text

Research highlights
## Exploiting Vector Instructions with Generalized Stream Fusion

Ideally, a program written as a composition of concise, self-contained components should perform as well as the equivalent hand-written version where the functionality of what was many components has been manually combined into a monolithic implementation. That is, programmers should not have to sacrifice code clarity or good software engineering practices to obtain performancewe want compositionality without a performance penalty. This work shows how to attain this goal for high-level Haskell in the domain of sequence-processing functions, which includes applications such as array processing.

Prior work on stream fusion^{3} shows how to automatically transform some high-level sequence-processing functions into efficient implementations. It has been used to great effect in Haskell libraries for manipulating byte arrays, Unicode text, and unboxed vectors. However some operations, like vector append, do not perform well within the stream fusion framework. Others, like SIMD computation using the SSE and AVX instructions available on modern x86 chips, do not seem to fit in the stream fusion framework at all. We describe generalized stream fusion, which solves these issues through a careful choice of stream representation. Benchmarks show that high-level Haskell code written using our compiler and libraries can produce code that is faster than both compiler- and hand-vectorized C.

It seems unreasonable to ask a compiler to be able to turn numeric algorithms expressed as high-level Haskell code into tight machine code. The compiler must cope with boxed numeric types, handle lazy evaluation, and eliminate intermediate data structures. However the Glasgow Haskell Compiler has become "sufficiently smart" that, in many domains, Haskell libraries for expressing numerical computations no longer have to sacrifice speed at the altar of abstraction.

The key development that made this sacrifice unnecessary is *stream fusion.*^{3} Algorithms over sequenceswhether they are lists or vectors (arrays)are expressed naturally in a functional language using operations such as folds, maps, and zips. Although highly modular, these operations produce unnecessary intermediate structures that lead to inefficient code. Eliminating these intermediate structures is termed deforestation, or fusion. Equational laws, such as map f map g map (f g), allow some of these intermediate structures to be eliminated; finding more general rules has been the subject of a great deal of research.

Stream fusion, based on the observation that recursive structures can be transformed into non-recursive co-structures for which fusion is relatively straightforward, was the first truly general solution. Instead of working directly with lists or vectors, stream fusion works by re-expressing these functions as operations over streams, each represented as a state and a step function that transforms the state while potentially yielding a single value. Alas, different operations need different stream representations, and no single representation works well for all operations (Section 2.2). Furthermore, for many operations it is not obvious what the choice of representation should be.

We solve this problem with a new *generalized stream fusion* framework where the primary abstraction used to express operations on vectors is a *bundle* of streams. The streams are chosen so that for any given high-level vector operation there is a stream in the bundle whose representation leads to an efficient implementation. The bundle abstraction has no run-time cost because standard optimizations performed by the Glasgow Haskell Compiler (GHC) eliminate intermediate bundle structures. We describe the generalized stream framework as well as a stream representation that leads to efficient vectorized code. Our benchmarks compare to the very best C and C++ compilers and libraries that we could find. Remarkably, our benchmarks show that choosing the proper stream representations can result in machine code that beats compiler-vectorized C and is competitive with hand-tuned assembly.

We begin by providing the background necessary for understanding stream fusion. There is no new material hereit is all derived from Coutts et al.^{3} However, we describe fusion for functions of *vectors* of unboxed values, as implemented in the vector^{10} library, rather than fusion for functions over *lists.* Some of the implementation details are elided, but the essential aspects of stream fusion as we describe them are faithful to the implementation.

The big idea behind stream fusion is to rewrite recursive functions, which are difficult for a compiler to automatically optimize, as non-recursive functions. The abstraction that accomplishes this is the Stream data type:

dataStream awhere

Stream (s Step s a) s Int Stream a

dataStep s a = Yield a s

| Skip s

| Done

A stream is a triple of values: an internal (existentially quantified) state, represented by the type variable s in the above definition, a size, and a step function that, when given a state, produces a Step. A Step may be Done, indicating that there are no more values in the Stream, it may Yield a value and a new state, or it may produce a new state but Skip producing a value. The presence of Skip allows us to easily express functions like filter within the stream fusion framework.

To see concretely how this helps us avoid recursive functions, let us write map for vectors using streams

map (a b) Vector a Vector b

map f = unstream map_{s}f stream

The functions stream and unstream convert a Vector to and from a stream. A Vector is converted to a stream whose state is an integer index and whose step function yields the value at the current index, which is incremented at each step. To convert a stream back into a Vector, unstream allocates memory for a new vector and writes each element to the vector as it is yielded by the streamunstream embodies a recursive loop. Though imperative, the allocation and writing of the vector are safely embedded in pure Haskell using the ST monad.^{9}

The real work is done by map_{s}, which is happily non-recursive:

map

_{s}(a b) Stream a Stream b

map_{s}f (Stream step s) = Stream step's

where

step's =casestep sof

Yield x s' Yield (f x) s'

Skip s' Skip s'

Done Done

With this definition, the equational rule mentioned in the Introduction, map f map g map (f g), falls out automatically. To see this, let us first inline our new definition of map in the expression map f map g:

map f map g

unstream map_{s}f stream unstream map_{s}

g stream

Given this form, we can immediately spot where an intermediate structure is formedby the composition stream unstream. This composition is, in effect, the identity function, so we should be able to eliminate it entirely. GHC's rewrite rules enable programmers to express algebraic identities such as stream unstream `=`

id in a form that GHC can understand and automatically apply. Stream fusion relies *critically* on this ability, and the `vector`

library includes exactly this rule. With the rule in place, GHC transforms our original composition of maps into

map f map g

unstream map_{s}f map_{s}g stream

Conceptually, stream fusion pushes all recursive loops into the final consumer. The two composed invocations of map become a composition of two *non-recursive* calls to map_{s}. The inliner is now perfectly capable of combining map_{s} f map_{s} g into a single Stream function. Stream fusion gives us the equational rule map f map g map (f g) *for free.*

**2.1. Fusing the vector dot product**

The motivating example we will use for the rest of the paper is the vector dot product. A high-level implementation of this function in Haskell might be written as follows:

dotp Vector Double Vector Double Double

dotp v w = sum (zipWith (*) v w)

It seems that this implementation will suffer from severe inefficiencythe call to zipWith produces an unnecessary intermediate vector that is immediately consumed by the function sum. In expressing dotp as a composition of collective operations, we have perhaps gained a bit of algorithmic clarity, but in turn we have incurred a performance hit.

We have already seen how stream fusion eliminates intermediate structures in the case of a composition of two calls to map. Previous fusion frameworks could handle that example but were stymied by the presence of a zipWith. However stream fusion has no problem fusing zipWith, which we can see by applying the stream transformations we saw earlier to dotp.

The first step is to re-express each Vector operation as the composition of a Stream operation and appropriate conversions between Vectors and Streams at the boundaries. The functions zipWith and sum are expressed in this form as follows:

zipWith (a b c) Vector a Vector b

Vector c

zipWith f v w = unstream (zipWith_{s}f (stream v)

(stream w))

sum Num a Vector a a

sum v = foldl's 0 (+) (stream v)

It is now relatively straightforward to transform dotp to eliminate the intermediate structure:

dotp Vector Double Vector Double Double

dotp v w sum (zipWith (*) v w)

foldl'_{s}0 (+) (stream (unstream

(zipWiths (*) (stream v) (stream w))))

foldl'_{s}0 (+)

(zipWiths (*) (stream v) (stream w))

This transformation again consists of inlining a few definitions, something that GHC can easily perform, and rewriting the composition stream unstream to the identity function. After this transformation, the production (by zipWith) and following consumption (by sum) of an intermediate Vector becomes the composition of non-recursive functions on streams.

We can see how iteration is once again pushed into the final consumer by looking at the implementations of foldl'_{s} and zipWith_{s}. The final consumer in dotp is foldl'_{s}, which is implemented by an explicit loop that consumes stream values and combines the yielded values with the accumulator z using the function f (the call to seq guarantees that the accumulator is strictly evaluated):

foldl'

_{s}(a b a) a Stream b a

foldl'_{s}f z (Stream step s) = loop z s

where

loop z s = z 'seq'

casestep sof

Yield x s' loop (f z x) s'

Skip s' loop z s'

Done z

However, in zipWith_{s} there is no loopthe two input streams are consumed until either both streams yield a value, in which case a value is yielded on the output stream, or until one of the input streams is done producing values. The internal state of the stream associated with zipWith_{s} contains the state of the two input streams and a one-item buffer for the value produced by the first input stream:

zipWith

_{s}(a b c) Stream a Stream

b Stream c

zipWith_{s}f (Stream stepa sa na) (Stream stepb sb nb) =

Stream step (sa, sb, Nothing) (min na nb)

where

step (sa, sb, Nothing) =

casestepa saof

Yield x sa' Skip (sa', sb, Just x)

Skip sa' Skip (sa', sb, Nothing)

Done Done

step (sa, sb, Just x) =

casestepb sbof

Yield y sb' Yield (f x y) (sa, sb', Nothing)

Skip sb Skip (sa, sb, Just x)

Done Done

Given these definitions, GHC's call-pattern specialization in concert with the standard inliner suffice to transform dotp into a single loop that does not produce an intermediate structure. If there is any doubt that this results in efficient machine code, we give the actual assembly language inner loop output by GHC using the LLVM back end. Stream fusion preserves the ability to write compositionally without sacrificing performance:

`.LBB2__3:`

`movsd (%rcx), %xmm0`

`mulsd (%rdx), %xmm0`

`addsd %xmm0, %xmm1`

`addq $8, %rcx`

`addq $8, %rdx`

`decq %rax`

`jne .LBB2__3`

**2.2. Stream fusion inefficiencies**

Though stream fusion does well for the examples we have shown, it still does not produce efficient implementations for many other operations. In particular, the inadequacy of the single-value-at-a-time nature of streams becomes particularly problematic when attempting to opportunistically utilize the SIMD instructions available on many current architectures, for example, SSE on x86 and NEON on ARM. These instructions operate in parallel on data values that contains two (or four or eight, depending on the hardware architecture) floating point numbers at a time. To avoid notational confusion, we call these *multi-values*, or sometimes just *multis.*

To enable sum to use SIMD instructions, we would like a stream representation that yields multi-values (rather than scalars), with perhaps a bit of scalar "dribble" at the end of the stream when the number of scalar values is not divisible by the size of a multi.

Although a stream of scalar values is useless for SIMD computation, a stream of multi-values is not quite right either, because of the "dribble" problem. Perhaps, we could get away with a stream that yielded *either* a scalar or a multi at each step, but this would force all scalar-only operations to handle an extra case, complicating the implementations of *all* operations and making them less efficient. There is a better way!

We have seen that different stream operations work best with different stream representations. In this section, we describe how to incorporate multiple stream representations into the stream fusion framework, elaborate on the details of a representation that enables SIMD computation with vectors, and show how to use our framework to transparently take advantage of SIMD instructions in Data Parallel Haskell programs.

The idea underlying generalized stream fusion is straightforward but its effects are wide-ranging: instead of transforming a function over vectors into a function over streams, transform it into a function over a *bundle* of streams. A bundle is a collection of streams, each semantically identical but with a different cost model, allowing each stream operation to choose the most advantageous stream representation in the bundle. We give a simplified version of the Bundle data type here:

dataBundle a = Bundle

{sSize Size

, sElems Stream a

, sChunks Stream (Chunk a)

, sMultis Multis a

}

The sElems field of the Bundle data type contains the familiar stream of scalar values that we saw in Section 2. The stream of Chunks contained in the sChunks field of the record enables the efficient use of bulk memory operations, like vector append, which we do not describe here. We next describe the representation contained in the sMultis field of the record, which enables the efficient use of SSE instructions.

**3.1. A stream representation fit for SIMD computation**

Modifying the stream fusion framework to accommodate SIMD operations opens up the possibility of dramatically increased performance for a wide range of numerical algorithms but requires a more thoughtful choice of representation. We focus on SIMD computations using 128-bit wide vectors and SSE instructions on x86/x64 since that is what our current implementation supports, although the approach generalizes.

Our implementation represents SIMD values using the type family Multi. We have chosen the name to avoid confusion with the Vector type, which represents arrays of arbitrary extent. In contrast, a value of type Multi a is a short vector containing a *fixed* number of elementsknown as its *multiplicity*of type a. On a given platform, Multi a has a multiplicity that is appropriate for the platform's SIMD instructions. For example, on x86, a Multi Double, will have multiplicity 2 since SSE instructions operate on 128-bit wide vectors, whereas a Multi Float will have multiplicity 4. Multi is implemented as an associated type^{1} in the MultiType type class; their definitions are shown in Figure 1. MultiType includes various operations over Multi values, such as replicating a scalar across a Multi and folding a function over the scalar elements of a Multi. These operations are defined in terms of new primitives we added to GHC that compile directly to SSE instructions.

Given a value of type Vector Double, how can we operate on it efficiently using SSE instructions within the generalized stream fusion framework? An obvious first attempt is to include a stream of Multi Doubles in the stream bundle. However, this representation is insufficient for a vector with an odd number of elements since we will have one Double not belonging to a Multi at the endthe "dribble" mentioned earlier. Let us instead try this instead: a stream that can contain *either* a scalar or a Multi. We call this stream a MultisP because the *producer* chooses what will be yielded at each step:

dataEither a b = Left a | Right b

typeMultisP a = Stream (Either a (Multi a))

Now we can implement summation using SIMD operations. Our strategy is to use two accumulating parameters, one for the sum of the Multi values yielded by the stream and one for the sum of the scalar values. Note that (+) is overloaded: we use *SIMD* (+) to add summ and y, and *scalar* (+) to add suml and x:

msumP

_{s}(Num a, Num (Multi a)) MultisP a a

msumP_{s}(Stream step s __) = loop 0.0 0.0 s

where

loop summ suml s =

casestep sof

Yield (Left x) s' loop summ (sum1 + x) s'

Yield (Right y) s' loop (summ + y) sum1 s'

Skip s' loop summ sum1 s'

Done multifold (+) sum1 summ

When the stream is done yielding values, we call the multifold member of the MultiType type class to fold the addition operator over the components of the Multi.

This implementation strategy works nicely for folds. However, if we try to implement the SIMD equivalent of zipWith_{s}, we hit a roadblock. A SIMD version of zipWith_{s} requires that at each step either *both* of its input streams yield a Multi or they *both* yield a scalarif one stream were to yield a scalar while the other yielded a Multi, we would have to somehow buffer the components of the Multi. And if one stream yielded *only* scalars while the other yielded only Multis, we would be hard-pressed to cope.

Instead of a stream representation where the producer chooses what is yielded, let us instead choose a representation where the stream *consumer* is in control:

dataMultisC awhere

MultisC (s Step s (Multi a))

(s Step s a)

s

MultisC a

The idea is for a MultisC a to be able to yield either a value of type Multi a or a value of type athe stream consumer chooses, which by calling one of the two step functions. Note that the existential state is quantified over both step functions, meaning that the same state can be used to yield either a single scalar or a Multi. If there is not a full Multi available, the first step function will return Done. The remaining scalars will then be yielded by the second step function. This representation allows us to implement a SIMD version of zipWith_{s}.

Regrettably, a MultisC *still* is not quite what we need. Consider appending two vectors of Doubles, each of which contains 41 elements. We cannot assume that the two vectors being appended are laid out consecutively in memory, so even though the stream that results from appending them together will contain 82 scalars, this stream is forced to yield a scalar in the middle of the stream. One might imagine an implementation that buffers and shifts partial Multi values, but this leads to very inefficient code. The alternative is for append_{s} to produce a stream in which either a scalar or a Multi is yielded at each stepbut that was the original representation we selected and then discarded because it was not suitable for zips!

The final compromise is to allow eitherbut not bothof these two representations. We cannot allow bothhence there is only one new bundle member rather than twobecause while we can easily convert a MultisC a into a MultisP a, the other direction is not efficiently implementable. The final definition of the Multis type alias is therefore:

type Multis a = Either (MultisC a) (MultisP a)

Each stream function that can operate on Multi values consumes the Multis a in the sMultis field of the stream bundle. It must be prepared to accept either a MultisC or a MultisP, which is a "mixed" stream of scalars and Multi's. However, we always try to produce a MultisC and only fall back to a MultisP as a last resort. Even operations that can work with either representation are often worth specializing for the MultisC form. In the case of msum_{s} above, this allows us to gobble up as many Multi values as possible and only then switch to consuming scalars, thereby cutting the number of accumulating parameters in half and reducing register pressure.

One could imagine attempting a representation that somehow guarantees longer runs of Multis, but this would add complexity and we doubt it would have any advantage over the MultisC representation, which has a distinct "phase shift" between operations on Multi and operations on scalars. For operations like zip that operate on multiple streams, we would need to guarantee that *both* streams have the same structureit simply does not do to have one stream in the pair yield a scalar while the other yields a Multi. The MultiC/MultiP distinction neatly captures this requirement by framing it in terms of who has control over what is yielded next, consumers or producers.

**3.2. A SIMD version of dotp**

With a stream representation for SIMD computation in hand, we can write a SIMD-ized version of the dot product from Section 2:

dotp_simd Vector Double Vector Double Double

dotp_simd v w = msum (mzipWith (*) v w)

The only difference with respect to the scalar implementation in Section 2.1 is that we use variants of foldl' and zipWith specialized to take function arguments that operate on values that are members of the Num type class. While we could have used versions of these functions that take two function arguments (our library supports both options), one for scalars and one for Multis, the forms that use overloading to allow the function argument to be used at both the type a a a and Multi a Multi a Multi a are a convenient shorthand:

mfold' (PackedVector Vector a, Num a, Num (Multi a) )

(b.Num b b b b)

a Vector a a

mzipWith :: (PackedVector Vector a, Num a, Num

(Multi a) )

(b.Num b b b b)

Vector a Vector a Vector a

msum (PackedVector Vector a, Num a, Num (Multi a) )

Vector a a

msum = mfold' (+) 0

The particular fold we use here, mfold', maintains two accumulators (a scalar and a Multi) when given a MultisP a and one accumulator when given a MultisC a. The initial value of the scalar accumulator is the third argument to mfold' and the initial value of the Multi accumulator is formed by replicating this scalar argument across a Multi. The result of the fold is computed by combining the elements of the Multi accumulator and the scalar accumulator using the function multifold from Figure 1. Note that the first argument to mfold' must be associative and commutative. The PackedVector type class constraint ensures both that the type a is an instance of MultiType and that elements contained in the vector are contiguous so that they can be extracted a Multi a at a time.

The stream version of mfold', mfold'_{s}, can generate efficient code no matter what representation is contained in a Multis a. On the other hand, the stream version of mzip-With, mzipWith_{s}, requires that both its vector arguments have a MultisC representation. Since there is no good way to zip two streams when one yields a scalar and the other a Multi, if either bundle argument to mzipWith_{s} does not have a MultisC representation available, mzipWith_{s} falls back to an implementation that uses only scalar operations.

**3.3. Automatic parallelization**

Using SIMD instructions does not come entirely for free. Consider mapping over a vector represented using multis:

mmap (PackedVector Vector a)

(a a)

(Multi a Multi a)

Vector a Vector a

To map efficiently over the vector, it does not suffice to pass a function of type (a a), because that does not work over multis. We must also pass a semantically equivalent multi-version of the function. For simple arithmetic, matters are not too bad:

foo Vector Float Vector Float

foo v = mmap (x y x + y * 2) (x y x + y * 2) v

The two lambdas are at different types, but Haskell's overloading takes care of that. We could attempt to abstract this pattern like this:

mmap (PackedVector Vector a)

(a.Num a a a)

Vector a Vector a

But that attempt fails if you want operations in class Floating, say, rather than Num. What we want is a way to *automatically multi-ize scalar functions* (such as (x y x + y * 2) above), so that we get a pair of a scalar function and a multi function, which in turn can be passed to map.

The programmer has to use mmap, which is a bit inconvenient. However, in separate work,^{2, 13} the Data Parallel Haskell project has shown how to automatically vectorize programs; the target there was turning nested data parallelism into flat data parallelism, but it turns out that we can use the same technology to turn element-wise data parallelism into SIMD multi-style data parallelism. Putting together DPH and the ideas of this paper gives the best of both worlds: programmers can write data parallel programs without considering SIMD, and the compiler will automatically exploit the vector instructions if they are present. Better still, DPH allows us to take advantage of *multiple cores* as well as the SIMD units in each core.

We updated DPH to use our modified `vector`

library. Because DPH programs are vectorized by the compiler so that all scalar operations are turned into operations over wide vectors, by implementing these wide vector operations using our new SIMD functions like msum, programs written using DPH automatically and transparently take advantage of SSE instructionsno code changes are required of the programmer. The full version of the paper includes benchmarks for our modified implementation of DPH.

**3.4. How general is generalized stream fusion?**

We do not mean to suggest that the representations we have chosen for our Bundle data type are complete in any sense except that they allow us to take advantage of bulk memory operations and SIMD instructions, which was our original goal. Generalized stream fusion is not "general" because we have finally hit upon the full set of representations one could possibly ever need, but because the frameworks we have put forth admit multiple new, specialized representations. The key features of generalized stream fusion are (1) the ability to add new specialized stream representations, notably without requiring the library writer to rewrite the entire library; (2) leveraging the compiler to statically eliminate all intermediate Bundle structures and leave behind the single representation that is actually necessary to perform the desired computation; and (3) not requiring the end user to know about the details of Bundles, or even that they exist.

Generalized stream fusion provides a representation and algebraic laws for rewriting operations over this representation whose usefulness extends beyond Haskell. Although we have implemented generalized stream fusion as a library, it could also be incorporated into a compiler as an intermediate language. This was not necessary in our implementation because GHC's generic optimizer is powerful enough to eliminate all intermediate structures created by generalized stream fusion. In other words, GHC is such a good partial evaluator that we were able to build generalized stream fusion as a library rather than incorporating it into the compiler itself. Writing high-level code without paying an abstraction tax is desirable in any language, and compilers other than GHC could also avoid this tax by using the ideas we outline in this paper, although perhaps only by paying a substantial one-time implementation cost.

There are three substantial components of our implementation. We first modified GHC itself to add support for SSE instructions. This required modifying GHC's register allocator to allow overlapping register classes, which was necessary to allow SSE vectors to be stored in registers. We then added support for fully unboxed primitive SIMD vector types and primitive operations over these types to GHC's dialect of Haskell. The STG and C-intermediate languages as well as GHC's LLVM code generator, were also extended to support compiling the new Haskell SIMD primitives. Boxed wrappers for the unboxed primitives and the MultiType type class and its associated Multi type complete the high-level support for working directly with basic SIMD data types. Because the SIMD support we added to GHC utilizes the LLVM back-end, it should be relatively straightforward to adapt our modifications for other CPU architectures, although at this time only x86-64 is supported.

Second, we implemented generalized stream fusion in a modified version of the `vector`

library^{10} for computing with efficient unboxed vectors in Haskell. We replaced the existing stream fusion implementation with an implementation that uses the Bundle representation and extended the existing API with functions such as mfold' and mzipWith that enable using SIMD operations on the contents of vectors. The examples in this paper are somewhat simplified from the actual implementations. For example, the actual implementations are written in monadic form and involve type class constraints that we have elided. Vectors whose scalar elements can be accessed in SIMD-sized groups, that is, vectors whose scalar elements are laid out consecutively in memory, are actually represented using a PackedVector type class. These details do not affect the essential design choices we have described, and the functions used in all examples are simply type-specialized instances of the true implementations.

Third, we modified the DPH libraries to take advantage of our new `vector`

library. The DPH libraries are built on top of the stream representation from a previous version of the `vector`

library, so we first updated DPH to use our bundle representation instead. We next re-implemented the primitive wide-vector operations in DPH in terms of our new SIMD operations on bundles. While we only provided SIMD implementation for operations on double-precision floating point values, this part of the implementation was quite small, consisting of approximately 20 lines of code not counting #`ifdefs`

. Further extending SIMD support in DPH will be easy now that it is based on bundles rather than streams.

Our support for SSE and AVX instructions is part of the standard GHC distribution, and our modifications to the `vector`

and DPH libraries are available in a public git repository.

Our original goal in modifying GHC and the `vector`

library was to make efficient use of SSE instructions from high-level Haskell code. The inability to use SSE operations from Haskell and its impact on performance is a deficiency that was brought to our attention by Lippmeier and Keller.^{11} The first step we took was to write a small number of simple C functions utilizing SSE intrinsics to serve as benchmarks. This gave us a very concrete goalto generate machine code from Haskell that was competitive with these C implementations. It is not a coincidence that one of the first such C functions that we wrote was an implementation of the vector dot product, in both a scalar version and a version using compiler intrinsics for manual SSE support. We omit the C versions, but repeat the definition of the Haskell implementation here:

ddotp Vector Double Vector Double Double

ddotp v w = mfold' (+) 0 (mzipWith (*) v w)

Though not exactly onerous, the C version with SSE support is already unpleasantly more complex than the scalar version. The Haskell version, consisting of a single line of code (not including the optional type signature), is certainly the simplest. Also note that the Haskell programmer can think compositionallyit is natural to think of dot product as pairwise multiplication followed by summation. The C programmer, on the other hand, must manually fuse the two loops into a single multiply-add. Furthermore, as well as being constructed compositionally, the Haskell implementation can itself be *used* compositionally. That is, if the input vectors to ddotp are themselves the results of vector computations, generalized stream fusion will potentially fuse *all* operations in the chainnot just the dot product's zip and foldinto a single loop. In contrast, the C programmer must manifest the input to the C implementation of `ddotp`

as concrete vectors in memorythere is no potential for automatic fusion with other operations in the C version.

Figure 2 compares the single-threaded performance of several implementations of the dot product, including C and Haskell versions that only use scalar operations as well as the implementation provided by GotoBLAS2 1.13.^{5, 6} Times were measured on a 3.40 GHz Intel i7-2600K processor, averaged over 100 runs. To make the relative performance of the various implementations clearer, we show the execution time of each implementation relative to the scalar C version, which is normalized to 1.0, in Figure 3.

Surprisingly, both the naive scalar C implementation and the version written using SSE intrinsics perform approximately the same. This is because GCC automatically vectorizes the scalar implementation. However, the Haskell implementation is almost always faster than both C versions; it is 520% slower for very short vectors (those with fewer than about 16 elements) and 12% slower just at the point where the working set size exceeds the capacity of the L1 cache. Not only does Haskell outperform C on this benchmark, but it outperforms GCC's vectorizer. Once the working set no longer fits in L3 cache, the Haskell implementation is even neck-and-neck with the implementation of `ddotp`

from GotoBLAS, a collection of highly tuned BLAS routines hand-written in assembly language that is generally considered to be one of the fastest BLAS implementation available.

**5.1. Prefetching and loop unrolling**

Why is Haskell so fast? Because in addition to leveraging loop fusion and a careful choice of representation, we have also exploited the high-level stream-fusion framework to embody two additional optimizations: *loop unrolling* and *prefetching.*

The generalized stream fusion framework allowed us to implement the equivalent of loop unrolling by adding under 200 lines of code to the `vector`

library. We changed the MultisC data type to incorporate a *leap*, which is a Step that contains multiple values of type Multi a. We chose Leap to contain four valuesso loops are unrolled four timessince on x86-64 processors this tends not to put too much register pressure on the register allocator. Adding multiple Leaps of different sizes would also be possible. MultisC consumers may choose not to use the Leap stepping function, in which case loops will not be unrolled:

data Leap a = Leap a a a a

data MultisC a where

MultisC (s Step s (Leap (Multi a) ) )

(s Step s (Multi a) )

(s Step s a)

s

MultisC a

Prefetch instructions on Intel processors allow the program to give the CPU a hint about memory access patterns, telling it to prefetch memory that the program plans to use in the future. In our library, these prefetch hints are implemented using prefetch primitives that we added to GHC. When converting a Vector to a MultisC, we know exactly what memory access pattern will be usedeach element of the vector will be accessed in linear order. The function that performs this conversion, stream, takes advantage of this knowledge by executing prefetch instructions as it yields each Leap. Only consumers using Leaps will compile to loops containing prefetch instructions, and stream will only add prefetch instructions for vectors whose size is above a fixed threshold (currently 8192 elements), because for shorter vectors the extra instruction dispatch overhead is not amortized by the increase in memory throughput. A prefetch distance of 128 * 12, based on the line fill buffer size of 128 bytes, was chosen empirically. Loop unrolling and prefetching produce an inner loop for our Haskell implementation of ddotp that is shown in Figure 4.^{a}

Not only can the client of our modified `vector`

library write programs in terms of boxed values and directly compose vector operations instead of manually fusing operations without paying an abstraction penalty, but he or she can transparently benefit from low-level prefetch "magic" baked into the library. Of course the same prefetch magic could be expressed manually in the C version. However, when we originally wrote the C implementation of dot product using SSE intrinsics, we did not know about prefetching. We suspect that many C programmers are in the same state of ignorance. In Haskell, this knowledge is embedded in a library, and clients benefit from it automatically.

Wadler^{16} introduced the problem of deforestation, that is, of eliminating intermediate structures in programs written as compositions of list transforming functions. A great deal of follow-on work^{4, 7, 8, 12, 14, 15} attempted to improve the ability of compilers to automate deforestation through program transformations. Each of these approaches to fusion has severe limitations. For example, Gill et al.^{4} cannot fuse left folds, such as that which arises in sum, or zipWith, and Takano and Meijer^{14} cannot handle nested computations such as mapping a function over concatenated lists. Our work is based on the stream fusion framework described by Coutts et al.,^{3} which can fuse all of these use cases and more. The `vector`

library uses stream fusion to fuse operations on vectors rather than lists, but the principles are the same.

Generalized stream fusion is a strict improvement on stream fusion; by re-casting stream fusion to operate on bundles of streams, each vector operation or class of operations can utilize a stream representation tailored to its particular pattern of computation. Though we focused on leveraging SSE instructions in this article, our implementation also adds support for efficient use of bulk memory operations in vector operations. As part of our work, we added support for low-level SSE instructions to GHC and incorporated generalized stream fusion into the `vector`

library. Using our modified library, programmers can write compositional, high-level programs for manipulating vectors without loss of efficiency. Benchmarks show that these programs can perform competitively with hand-written C.

Although we implemented generalized stream fusion in a Haskell library, the bundled stream representation could be used as an intermediate language in another compiler. Vector operations would no longer be first class in such a formulation, but it would allow a language to take advantage of fusion without requiring implementations of the general purpose optimizations present in GHC that allow it to eliminate the intermediate structures produced by generalized stream fusion.

1. Chakravarty, M.M.T., Keller, G., Peyton Jones, S., Marlow, S. Associated types with class. In *Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages*, POPL, 05 (New York, NY, USA, 2005). ACM, New York, NY USA, 113.

2. Chakravarty, M.M.T., Leshchinskiy, R., Peyton Jones, S., Keller, G., Marlow, S. Data parallel Haskell: A status report. In *Proceedings of the 2007 Workshop on Declarative Aspects of Multicore Programming*, DAMP, 07 (Nice, France, 2007). ACM, New York. NY, USA, 1018.

3. Coutts, D., Leshchinskiy, R., Stewart, D. Stream fusion: From lists to streams to nothing at all. In *Proceedings of the 12th ACM SIGPLAN International Conference on Functional Programming* (Freiburg, Germany, 2007). ACM, New York, NY, USA, 315326.

4. Gill, A., Launchbury, J., Peyton Jones, S.L. A short cut to deforestation. In *Proceedings of the Conference on Functional Programming Languages and Computer Architecture*, FPCA, 93 (1993). ACM, New York, NY, USA, 223232.

5. Goto, K., van de Geijn, R.A. Anatomy of high-performance matrix multiplication. *ACM Trans. Math. Softw. 34*, 3 (2008), 125.

6. Goto, K., van de Geijn, R. High-performance implementation of the Level-3 BLAS. *ACM Trans. Mathem. Softw. 35*, 1 (2008), 114.

7. Hamilton, G.W. Extending higher-order deforestation: Transforming programs to eliminate even more trees. In *Proceedings of the Third Scottish Functional Programming Workshop*, Hammond, K. and Curtis, S., eds. (Exeter, UK Aug. 2001). Intellect Books, 2536.

8. Johann, P. Short cut fusion: Proved and improved. In *Proceedings of the 2nd International Workshop on Semantics, Applications, and Implementation of Program Generation.* Volume 2196 of *Lecture Notes in Computer Science* (Florence, Italy, 2001), 4771.

9. Launchbury, J., Peyton Jones, S.L. State in Haskell. *Lisp Symb. Comput.* 8, 4 (1995), 293341.

10. Leshchinskiy, R. Vector: Efficient arrays, Oct 2012. http://hackage.haskell.org/package/vector.

11. Lippmeier, B., Keller, G. Efficient parallel stencil convolution in Haskell. In *Proceedings of the 4th ACM Symposium on Haskell*, Haskell, 11 (2011). ACM, New York, NY, USA, 5970.

12. Marlow, S., Wadler, P. Deforestation for higher-order functions. In *Proceedings of the 1992 Glasgow Workshop on Functional Programming* J. Launchbury and P. Sansom, eds. Workshops in Computing (1993). Springer-Verlag, London, UK, 154165.

13. Peyton Jones, S., Leshchinskiy, R., Keller, G., Chakravarty, M.M.T. Harnessing the multicores: Nested data parallelism in Haskell. In *IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science*, Hariharan, R., Mukund, M., and Vinay, V., eds. Volume 2 of *Leibniz International Proceedings in Informatics (LIPIcs)* (Dagstuhl, Germany, 2008) Schloss Dagstuhl -Leibniz-Zentrum fuer Informatik, 383414.

14. Svenningsson, J. Shortcut fusion for accumulating parameters & zip-like functions. In *Proceedings of the Seventh ACM SIGPLAN International Conference on Functional Programming*, ICFP, 02 (Pittsburgh, PA, 2002). ACM, New York, NY, USA, 124132.

15. Takano, A., Meijer, E. Shortcut deforestation in calculational form. In *Proceedings of the Seventh International Conference on Functional Programming and Computer Architecture*, FPCA, 95 (1995). ACM, New York, NY, USA, 306313.

16. Wadler, P. Deforestation: Transforming programs to eliminate trees. *Theor. Comput. Sci. 73*, 2 (1990), 231248.

* This work was performed while the author was at Microsoft Research Ltd.

a. The prefetch constant 1600 in the listing is 128 * 12 + 64 since the loop index in the generated assembly is offset by -64 bytes.

The original version of this paper was published in *Proceedings of the 18th SIGPLAN International Conference on Functional Programming* (Boston, MA, 2013), 3748.

Figure 1. The MultiType type class and its associated type, Multi.

Figure 2. Single-threaded performance of double-precision dot product implementations. C implementations were compiled using GCC 4.8.1 and compiler options `-03 -msse4.2 -ffast-math -ftree-vectorize -funroll-loops`

. Sizes of the L1, L2, and L3 caches are marked.

Figure 3. Relative performance of single-threaded ddot implementations. All times are normalized relative to the handwritten, compiler-vectorized, and C implementation.

**©2017 ACM 0001-0782/17/05**

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2017 ACM, Inc.

No entries found