Domain-specific languages (DSLS) make programs shorter and easier to write. They can be stand-alone—for example, LaTeX, Makefiles, and SQL—or they can be embedded in a host language. You might think that DSLs embedded in high-level languages would be abstract or mathematically oriented, far from the nitty-gritty of low-level programming. This is not the case. This article demonstrates how high-level embedded DSLs (EDSLs) really can ease low-level programming. There is no contradiction.
A gentle introduction to EDSLs can be found in the previous article in this series: “Domain-specific Languages and Code Synthesis Using Haskell,” in which Andy Gill considers the pros and cons of implementing a deeply embedded DSL, compared with a stand-alone compiler. Here, the story continues, posing a slightly different question: If you need to produce high-performance, low-level code in a language such as C or CUDA, is it worthwhile to use (or create) a code-generating EDSL, or should you just buckle down and write the low-level code by hand? Our answer is the EDSL may well be a reasonable option.
In 2010, one of this article’s authors was tasked with porting high-performance benchmarks to run on a new computer architecture. One of these benchmarks was a parallel sort designed to use vector instructions6—SSE (streaming SIMD extensions) to be precise. To accomplish vectorization, this sort bottoms out to a bitonic sorting network once the arrays reach a small size. Sorting networks such as these are directly vectorizable. Of course, benchmarks of this kind have no reputation for maintaining clean abstractions in the code. Indeed, this particular benchmark had hard-coded sorting networks with 16, 32, and 64 inputs, using many pages of repetitive and inscrutable code:
This representation of the program has obvious drawbacks. For one, how can it be ported to a new architecture with different vector instructions? It’s impossible—you have to write a new version. Yet, a glance at the bitonic sorter Wikipedia entry (http://en.wikipedia.org/wiki/Bitonic_sorter) shows the algorithm is a simple and elegant recursive function. Why can’t that essential structure be captured, removing the repetition from the example? Repetition can at least be reduced using C preprocessor macros:
SORT16
can then be reused to define SORT32
. Still, C preprocessor macros do not offer the necessary compile-time programmability. Rather, the best solution is to write a simple program generator script that prints the repetitive code here to a file as output.
The lesson here is that software engineers should not hesitate to write programs that generate programs. If you can recognize regularities in your programs, then program generation is probably an option. There are plenty of sophisticated technologies to help with this metaprogramming (MetaOCaml, Scheme macros, Template Haskell, C++ template metaprogramming, among others). Indeed, because the DSLs discussed here are embedded, their host programs are in fact program generators (that is, metaprograms). These DSLs, however, also impose additional structure that yields several benefits.
DSLs offer safety. The script alluded to before is very primitive; it emits each generated program directly as a string. An EDSL instead emits an abstract syntax tree for the DSL program, and with that you can do whatever you like (see Andy Gill’s article on page 42 for more details). In fact, if structured properly, an EDSL used to generate code can offer safety guarantees in the generated code; a well-typed EDSL program should guarantee well-typed generated code. This is a great way to find errors early. For example, if the EDSL is embedded in Haskell, the Haskell type checker finds errors that would have led to broken low-level programs. Finding those errors by debugging the low-level generated code could be much slower—for example, in embedded systems where build and debug times are long.
You can go even further and restrict the DSL to particular idioms so as to gain extra safety properties. For example, you can guarantee only memory-safe generated programs, as in Galois Inc.’s C code generation in its Ivory DSL (http://smaccmpilot.org/languages/).
DSLs enable smarter compilers. These days, DSLs for high performance focus on parallelism, but the degree to which parallel implementation details are exposed to the user varies among DSLs. Some, such as SQL and very high-level array DSLs,2,3,4,11,16,21,22 hide everything but abstract data transformations, which are naturally parallel. These compilers can achieve good performance by limiting the kinds of communication allowed between parallel code regions, enforcing structured data-processing patterns such as map and reduce, and removing features that make auto-parallelization difficult (for example, aliasing, pointers, and arbitrary control flow). By leveraging these restrictions, you can often apply radical code transformations. An example of this would be fusion, the removal of intermediate data structures (for example, replacing map f (map g arr)
with map (f ° g) arr
).
When this approach works well—that is, when the compiler can map the latent parallelism in the declarative specification onto the target architecture efficiently—it is an attractive option. Sometimes, though, experimentation is necessary to find a good parallel decomposition that matches the target, which might be a GPU or FPGA, for example. The user then wants not only fine control over the generated code, but also easy ways to change it. For example, Kansas Lava, the hardware description language described in the aforementioned article by Andy Gill, is expressly intended for design-space exploration, and case studies in forward error correctors show how this leads to high-performance designs. At a higher level of abstraction, the Bluespec (http://www.bluespec.com/high-level-synthesis-tools.html) behavioral language allows algorithm experts to generate high-performance FPGA accelerators. A major selling point is that it is quick and easy to do architectural exploration, particularly using highly parameterized DSL code.
These same benefits apply to low-level software domains, as well as hardware. In general, not all performance tweaking and tuning can be automated, which creates a need for DSLs that enable systematic user-guided performance exploration.
DSLs abstract code-generation tactics. How should we give the programmer control over the generated code, along with easy ways to vary it? Because EDSL programs perform arbitrary computation at program-generation time, they can also encapsulate reusable code-generation tactics, in the guise of regular functions and data types. Let’s take a look at an example: deferred array representations,1,5,7,10 which represent the ability to generate elements (in this, they are similar to the concepts of generators or iterators found in many languages, but the specifics are quite different), rather than elements already residing in memory.
To explain deferred arrays, let’s start with a simple program using the map
and reduce
functions from before:
Does arr2
require memory for its storage (and, therefore, memory traffic to read and write it)? In most traditional languages the answer is an unqualified yes. In high-level array DSLs—which may or may not use deferred arrays—fusion optimizations may eliminate the intermediate array. This is possible because in many of these DSLs,2,4,11,16 arrays are immutable and the DSLs themselves may even be side-effect free, making fusion opportunities easy to recognize in the compiler. Thus, very often a map
operation can fuse into a reduce
operation, which means the (parallel or sequential) loop that is eventually created for reduce
contains the map
‘d function inside that loop.
In the example, however, arr2
is used twice. Most array DSLs will choose not to inline arr2
if it means duplicating work, or they will use a heuristic based on a static cost estimate for f
to decide whether work duplication is worth it to enable fusion. Explicitly deferred arrays, on the other hand, leave this decision up to the programmer. They represent an array in terms of the computation that generates it, and support a fusion-by-default approach. The map
in the previous example would be duplicated and fused into each reduce
, unless the user explicitly makes an array “real” in memory by applying a function called force
. Moreover, deferred arrays gain an additional benefit in the embedded DSL context: they never suffer function-call overhead for the functions used to represent arrays, because they are inlined as the EDSL host program performs code generation.
Deferred-array data types is one example of encapsulating code-generation tactics with a nice API. They provide programmers with high-level array operations, while retaining control over memory usage. They can also help avoid bounds checks without risk, when composing known producer and consumer functions from the provided library (for example, map
and reduce
). Later, we describe a language that provides even more control, with different control-flow patterns encapsulated in different variants of deferred arrays (push and pull). Finally, the article shows how deferred arrays can be used to make operations such as map
highly generic (for example, a map
operation that, depending on the context in which it is applied, generates a sequential loop or a parallel loop, and, if parallel, can operate at one of multiple scales inside a hierarchical parallel architecture, generating very different code in each case).
DSLs can mix it up (design exploration). That last point—that high-level data operators can generate different code based on the context in which they are used—moves closer to the goal of design exploration. In what other ways can a DSL allow programmers to explore different implementation strategies, while changing many fewer lines of code than if they had to convert the generated code manually? One way is to take advantage of program generation to build highly parameterized designs. In a hardware-synthesis language, such as Bluespec or Lava, one wants designs that are flexible and do not commit to a specific circuit area. Likewise, later in this article, you will see GPU programs that are parameterized over the number of GPU hardware threads to use, sometimes changing the generated program structure (injecting sequential loops) if not enough threads are provided. This just-in-time determination of program structure would be difficult to accomplish in handwritten code. With it, we might accept a list of GPU programs in the host program and provision the GPU to run all those computations continuously and simultaneously, with each program adapting to the number of threads it is allotted.
Finally, with increased design-exploration power, a common desire is to auto-tune by generating many variants of a program, testing them all, and selecting the winner. Several DSLs internally perform such an auto-tuning process,8,14,15 though that is beyond the scope of this article. Instead, this article aims to help you weigh the pros and cons of embedded DSLs that generate low-level code. It does so by presenting one concrete example: Obsidian, an embedded DSL that generates CUDA for GPU programming. First, let’s review the CUDA programming model, before describing Obsidian in the remainder of this article.
A CUDA Primer
CUDA is a C dialect for GPU programming that NVIDIA provides. (Readers with previous experience with CUDA can skip this section.)
GPUs began as hardware accelerators for generating computer graphics, but have become increasingly general purpose, driven both by graphics programmers who want more flexibility and by a new breed of programmer, motivated not by graphics but by a desire to accelerate the data-parallel parts of non-graphics applications.
How should we give the programmer control over the generated code, along with easy ways to vary it?
A typical computer system containing a CUDA-capable GPU is split into two parts: the host, which refers to the CPU and main memory of the computer; and the device, which consists of the GPU and its associated memory. The device usually comes in the form of a PCI Express card holding the GPU and memory. The GPU consists of a number of multiprocessors (MPs), each containing some number of functional units (CUDA cores in NVIDIA terminology). Each MP also contains local shared memory, which can be regarded as a programmer-managed cache. The CUDA cores within an MP can communicate via this shared memory.
Massive parallelism. A GPU is capable of managing a large number of threads in flight. It thrives when thousands of threads (doing mostly identical work) are launched together. It has a hierarchical structure managing all these threads. Onto each MP are launched several blocks of threads, each consisting of up to 1,024 threads. Threads within a block (and only within the block) can communicate using the shared memory of the MP. The collection of blocks launched onto the GPU is called the grid. A group of 32 consecutive threads within a block is a warp. Warps execute in lock-step; all threads perform the same instruction at any given moment. Threads within a warp that do not partake in the computation are turned off.
Scalable architecture. GPUs may contain as little as one or as many as 15 MPs. A CUDA program should be able to run on any of the GPUs along the scale. The CUDA programming model mirrors the GPU hierarchy. The programmer writes an SPMD (single-program multiple-data) program that is executed by the threads within a block. Many instances of this program also execute across all blocks launched. All instances are independent and can be launched in any order, or in parallel, across available MPs.
The upshot of all this is that programming a GPU really doesn’t much resemble programming a multicore machine with two, four, or eight cores. Programmers must launch thousands of threads and figure out how they should communicate within the constraints of the GPU. This means they need to think about warps, blocks, and grids, like it or not. (As will be seen, however, a DSL can still make it easier to map abstract data transforms onto the hierarchy.)
The references at the end of this article include links to further reading on CUDA programming and GPU architecture.12,13
The Obsidian Language
In CUDA, parallel for
loops are implicit; the computation is described at the element level. Which elements to access (where to load and store data) is expressed as a function of a thread’s identity—that is, threads must ask “where am I?” and answer the question by computing with blockIdx, blockDim
, and threadIdx
. Obsidian programs, in contrast, describe array-to-array computations at an aggregate level, replacing indexing arithmetic with collective operations. For example:
The vecAdd
function takes two arrays, v1
and v2
, as input and performs elementwise addition using the zipWith
library function (zipWith
is just a map
over two arrays; see Figure 1 for a selection of library functions). vecAdd
is not a complete Obsidian program, however; information about how to map this program onto the GPU hierarchy is needed and will be provided in the program’s types.
To give precise types to operations such as vecAdd
requires an understanding of deferred arrays described in the first section. Specifically, there are push and pull array variants. The vecAdd
program operates on pull arrays. Its type is:
That is, vecAdd
takes two pull arrays of EFloat
values as input and returns a pull array. (Values with types prefixed by E
are actually expression trees, as we will explain.) Pull arrays are implemented as a length plus a function from index to value:
This representation of arrays provides fusion of operations for free. For example, with the above definition of a pull array, the following equations show how a function mapped over a pull array is actually absorbed (composed) into the function-based representation it already uses:
No intermediate data is written to memory for the output of g
, but in some cases it is desirable to write the elements to memory—for example, revisiting the example from the first section, but now assuming that f2
is expensive in the following code:
Here arr2
is used twice, and the programmer should have the power to ensure that arr2
is computed and fully stored in memory. This is where force
functions come in:
The reason for switching to Haskell’s do
notation is that forcePull
is a monadic function. Monads are Haskell’s way of encoding side effects, and in this program the monad in question is Obsidian’s Program
monad, which roughly corresponds to CUDA code. The return type of forcePull
is Program t (Pull a)
, where t
is a parameter designating a level in the GPU hierarchy (Thread, Warp, Block
, or Grid
).
With the Program nomad, we can define push arrays as well:
Like pull arrays, push arrays consist of a length and a function. The difference is that the function in the push array (the filler function) results in a program that encodes an iteration schema at level t
in the hierarchy. The argument to the filler function is itself a function (the receiver function). A pull array fulfills requests for specific elements, whereas a push array only allows a bulk request to push all elements via a receiver function. This relationship is pictured in Figure 2. When invoked, the filler function creates the loop structure, but it inlines the code for the receiver inside that loop. A push array with its elements computed by f
and receiver rcv
might generate: for(i∈[1, N]) {rcv(i,f(i));}
More specifically, when forcing a push array to memory, each invocation of rcv
would write one memory location, A[i] = f(i)
.
The reason Obsidian has both pull and push arrays is related to performance. Some operations are very easy to express on pull arrays and result in good code, such as map, zipWith
and permutations. An operation such as append
, however, which takes two pull arrays and concatenates them, leads to a conditional within the lookup function. When forcing such a concatenated array, this conditional will be executed in every iteration of the generated for
loop (or by each thread). Push arrays, on the other hand, encode their own iteration schema, so rather than executing a conditional, when concatenating push arrays their inherent loop schemas are executed in sequence, neither loop executing a conditional in each iteration.
Push and Pull Conversion
The result of force
(for push arrays) and forcePull
(for pull arrays) is always a pull array. That means a push array can be converted to a pull array by application of force
. Applying force
always has a cost in memory, however, and it realizes the full cost of computing all elements.
Going in the other direction—converting from pull array to push array—requires a function called push
. This function does not have any cost directly associated with it. It merely changes the representation (though switching to the push representation means losing the ability to compute only part of the array). Note, however, that pull arrays do not have the t
parameter like push arrays—they are hierarchy-level agnostic. Using push
on a pull array and fixing the t
parameter of the result locks the array in a given level of the hierarchy, as in the following:
Now the vecAdd
program is complete: push
converts the result array to a push array, and the result type is fixed as Push Grid
, so the iteration becomes Grid
-level parallel.
Design-space exploration. Even a simple operation such as reduction requires design-space exploration for good performance on the GPU. Sequential reductions per thread can be combined with parallel binary-tree-shaped reduction over multiple threads in a block. Both the degree of sequential work and the number of threads used for the parallel work can be varied. This section describes a local (on-chip, shared-memory) reduce
algorithm. It performs one reduction per GPU block. This can serve as a building block in a full-scale reduction algorithm over all the GPU threads.
The code shown in Figure 3 implements a recursive parallel reduction. In each step, the input array is split in the middle, and the elements from the two resulting arrays are added together pairwise, resulting in an intermediate array half as long. The intermediate array is forced to memory using forcePull
, writing to shared memory using one thread per element (in this case, at the Block
level).
Next, many instances of the block-level reduction algorithm are combined to form a grid computation. The code noted in Figure 4 distributes the computation over multiple blocks using map
and pConcat
. The pConcat
function is for nested parallelism; it concatenates the results computed in each block and is in charge of the parallel distribution of work across blocks. (A related function is sConcat
, which performs sequential work within one block or one thread.)
The forcePull
function prevents fusion of operations, so leaving it out in some stages of the reduction is a way of trading off sequential and parallel execution. An optForce
n function could be written that forces only arrays shorter than n elements:
The optForce
function could replace forcePull
in reduceLocal
. The n
parameter in optForce
is one possible tuning parameter to add to reduceLocal;
another is the splitUp
factor (4096
above). By parameterizing on the force cutoff and the splitUp
factor, a family of different reduction codes can be generated from the same EDSL description.
Obsidian provides yet another tuning parameter by default: the number of real GPU threads to allow per block. This parameter is given by the Obsidian user to the code generator and can range over the number of threads per block supported by CUDA (1–1,024). Given the execution model of GPUs, however, only multiples of the warp size (32) make sense.
Allowing the parameters to vary over the following ranges:
splitUp
factor: s∈{32,64, …, 8192}optForce
cut-off: c ∈{32,64, …, s}- Real Threads: t ∈{32,64,96, …, min(1024,s/2)}
would create 929 variants of the reduction code for auto-tuning. For benchmark results for reduction kernels, refer to the authors’ recent technical report.20
Generating CUDA Code from the EDSL
Up to this point we have seen the programmer’s view of Obsidian, with dips into implementation details here and there. But how does it all work? The first article in this series (Gill) shows much of the needed infrastructure: overloading, reification, and expression data types. Obsidian is no different; the EWord32
and EFloat
types seen throughout this article correspond to expression trees. The Program
monad is reified into abstract syntax representing a program in an imperative, CUDA-like language.
Even a simple operation such as reduction requires design-space exploration for good performance on the GPU.
One new concept presented here, compared with Gill’s article, is the use of pull and push arrays. These deferred arrays rely on extracting program fragments as abstract syntax (that is, deep embedding), but the functions that represent push and pull arrays are not themselves captured in abstract syntax. Rather, they are like macros that desugar during Haskell execution, before the Obsidian CUDA-emitting compiler is invoked (shallow embedding).17
An outline of CUDA code generation would include:
- Function reification. EDSL functions (such as
vecAdd)
are applied to a symbolic array, causing it to be evaluated as a Haskell program and yielding its result. The functions Obsidian can reify have push arrays as result. After the application to a symbolic array, therefore, you have a push array. The push array is a function that generates aProgram
, and applying that push array’s filler function to yet another symbolic input extracts a final, completeProgram
. - Monadic reification. Because the
Program
type is a monad, a monad reification technique is applied (see Gill), and the result is an AST (abstract syntax tree) describing a program in a CUDA-like imperative language. This AST, however, has explicit parallel for loops rather than CUDA’s implicit ones. - Shared memory analysis. All intermediate arrays (resulting from
force)
have unique names in the CUDA-like AST, but they still must be laid out in shared memory. Liveness analysis finds the first and last use of each array, and a memory map is generated. Named arrays in the AST are replaced by exact locations in shared memory. - CUDA generation and thread virtualization. This is the final step in code generation. Most aspects of the CUDA-like AST translate directly into CUDA source. The explicit parallel loops, which may have larger iteration spaces than the number of actual CUDA threads, need some further conversion. If the AST contains a loop, “
parFor(i∈[1,512]) {body}),"
but the code generator was instructed to use only 256 threads, then thisparFor
will happen in two stages, implemented by injecting an additional sequential loop.
Conclusion
This article has highlighted a number of potential benefits of DSL technology. If you are going to give embedded DSLs a try, however, there are some downsides to watch out for as well. First, error messages can be problematic, as these are expressed in the terminology of the host language. If Obsidian programmers try to use some feature at the incorrect level in the GPU hierarchy (for example, a too-deep nesting of parallel for
loops by applying too many pConcats
), then the error message will likely state there are missing instances of some Haskell class, rather than explaining why what they tried cannot be done on a GPU. Furthermore, new DSLs naturally lack the library and tooling ecosystems of large established languages.
Using an EDSL such as Obsidian, which exposes underlying architectural structure, allows control over the details that determine performance. Choosing exactly what to abstract and what to expose, however, remains important. For example, in Obsidian programmers can trade off sequential and parallel work and use shared memory (via force
), but the details of shared-memory layout of arrays, or managing their lifetimes in that memory, are automated. Further, just as important is what the DSL prevents the user from doing. In Obsidian, programmers are gently, but firmly, prevented from writing DSL code that would produce CUDA that does not match what the GPU can cope with, such as the too-deep nestings just mentioned.
Writing CUDA code and hand-tuning for performance is often a tedious task. Design decisions made early may be difficult to revert and require much of the code already written to be replaced. It is thus appealing to generate variants of CUDA code from a highly parameterized description in a DSL, and then select the best performing, empirically. Our experiments in previous work20 show the best variant of particular code may differ from one GPU to another.
How much work does it take to produce a code-generating DSL such as Obsidian? In Obsidian’s case there have been years of experiments and redesigns—leading to a Ph.D.!18 But do not let that put you off. The current version is relatively small (2,857 lines in the language implementation and 1,629 lines in the compiler part), and it is freely available to study and copy.19 Further, modern libraries for parsing, code generation from templates (for example, http://hackage.haskell.org/package/language-c-quote), and compiler construction9 make constructing DSLs easier than expected.
So, the next time you find yourself with repetitive low-level code, why not consider an embedded code-generating DSL instead?
Acknowledgments
Push arrays were invented by Koen Claessen. We thank Andy Gill for his valuable feedback on this article. Work on Feldspar and Obsidian was supported by the Swedish Foundation for Strategic Research in the Resource Aware Functional Programming (RAW FP) project and by the National Science Foundation award 1337242.
Related articles
on queue.acm.org
Purpose-Built Languages
Mike Shapiro
http://queue.acm.org/detail.cfm?id=1508217
The Ideal HPC Programming Language
Eugene Loh
http://queue.acm.org/detail.cfm?id=1820518
Creating Languages in Racket
Matthew Flatt
http://queue.acm.org/detail.cfm?id=2068896
Join the Discussion (0)
Become a Member or Sign In to Post a Comment