Good software engineering practice demands generalization and abstraction, whereas high performance demands specialization and concretization. These goals are at odds, and compilers can only rarely translate expressive high-level programs to modern hardware platforms in a way that makes best use of the available resources.
Generative programming is a promising alternative to fully automatic translation. Instead of writing down the target program directly, developers write a program generator, which produces the target program as its output. The generator can be written in a high-level, generic style and can still produce efficient, specialized target programs. In practice, however, developing high-quality program generators requires a very large effort that is often hard to amortize.
We present lightweight modular staging (LMS), a generative programming approach that lowers this effort significantly. LMS seamlessly combines program generator logic with the generated code in a single program, using only types to distinguish the two stages of execution. Through extensive use of component technology, LMS makes a reusable and extensible compiler framework available at the library level, allowing programmers to tightly integrate domain-specific abstractions and optimizations into the generation process, with common generic optimizations provided by the framework.
Building and managing complex software systems is only possible by generalizing functionality and abstracting from particular use cases. Achieving performance, on the other hand, requires concretizing configurations and specializing code to its particular environment. Generative programming can bridge this gap by translating away abstraction overhead and effectively specializing generic programs.
Generative programming can be broadly classified as static or dynamic. Static code generation happens at compile time; a widely used example is the C++ template language or macro systems in other languages. Dynamic code generation that takes place at program runtime brings additional flexibility, because code can be specialized with respect to parameters not yet available at compile time.
Many computations can naturally be separated into stages distinguished by frequency of execution or availability of information. Staging transformations aim at executing certain pieces of code less often or at a time when performance is less critical. In the context of program generation, multistage programming (MSP, staging for short) as established by Taha and Sheard22 allows programmers to explicitly delay evaluation of a program expression to a later stage (thus, staging an expression). The present stage effectively acts as a code generator that composes (and possibly executes) the program of the next stage. A nice property of this approach is that generator and generated code are expressed in a single program, with the aim that programmers can construct a multistage program from a naive implementation of the same algorithm by adding staging annotations in a selective way.
Basic mechanisms for composing program fragments at runtime have existed for a long time in the form of Lisp's "code is data" model and its use of quasi-quotation: syntactic annotations to denote expressions that should remain unevaluated and to mark holes within them, to be filled in with expressions computed elsewhere. Dedicated MSP languages such as MetaML22 and MetaOCaml1 add well-scoping and well-typing guarantees about the generated code. Despite these advances, incorporating dynamic code generation into larger systems in the form of self-optimizing "active" libraries or adding compilation to embedded domain-specific languages (DSLs) that are implemented as libraries remains a significant challenge.
In this paper, we present lightweight modular staging (LMS), a dynamic code generation approach that further reduces the effort to develop sophisticated program generators. The presentation uses the Scala programming language, but the core concepts are not tied to any language in particular.
The classical introductory staging example is to specialize the power function for a given exponent, assuming that a program will take many different numbers to the same power. Considering the usual implementation,
we want to turn the base
b into a staged expression. Following Carette et al.2 and Hofer et al.,8 and in contrast to the quasi-quotation approach of syntactic annotations, the central idea of LMS is to use types to distinguish the computational stages. In particular, we change
b's declared type from
Rep [Double]. The meaning of having type
Rep [Double] is that
b represents a computation that will yield a
Double in the next stage. We also change the function's return type accordingly.
Now we need to regain the ability to do arithmetic on
b, which is no longer a plain
Double. The second idea of LMS is to package operations on staged types as components. To make its required functionality explicit, we wrap the power function itself up as a component (a trait):
In Scala, traits are similar to classes but they can be used in mix-in composition, a limited form of multiple inheritance.15 The self-type annotation
this: Arith denotes a "requires" relationship: whenever an instance of
Power is created, an instance of a concrete (but unspecified) subclass of
Arith (see Figure 3) that provides staged double arithmetic must be mixed in, too.
At the same time, LMS differs from previous approaches in the following important ways:
The last item deserves more explanation, as it is arguably the most prevalent difficulty that programmers face with other staging approaches. Since freely composing code fragments voids any evaluation order guarantees (such as call-by-value semantics), adding quasi-quotation annotations can easily change the result of a program in unforeseen ways or slow down the program by duplicating computation. In practice, quasi-quotation is therefore often used as a low-level tool, like an "assembly language" for code generation, and combined with application-specific front-end layers that apply high-level code optimizations and ensure correct evaluation order. LMS provides a predictable execution order and makes the composition of staged expressions extensible, and thus removes the need for extra front ends.
LMS is a key constituent of Delite,13, 18 an open-source framework for high-performance parallel DSLs that has been used to develop DSLs for demanding application areas such as machine learning, graph processing or mesh-based partial differential equation solvers. In this context, LMS enables "abstraction without regret." DSL programmers can use arbitrary Scala features to structure their programs in the generator stage, with the comforting knowledge that LMS guarantees to remove these abstractions during staging and no runtime price will need to be paid. Coupled with advanced compiler optimizations such as data parallel loop fusion and architecture-specific data structure transformations, Delite generates efficient code for a variety of parallel platforms like multi-core CPUs and GPUs.
The rest of this paper is structured as follows. Section 2 presents an end-to-end example, turning a naive algorithm into an efficient code generator, while introducing the major LMS components on the way. Section 2.1 discusses representations of staged code, Section 2.2 is concerned with optimizations, Section 2.3 is concerned with target code generation, and Section 2.4 concludes the example by showing how the generated code can be integrated in larger programs. Section 3 describes how additional features can be added, in particular functions and recursion. Section 4 discusses related work. An earlier version of this paper appeared at GPCE 2010. The original version contains a few additional examples and a more thorough discussion of how effectful statements are represented, while the present version has been updated with some new material in Section 3.
In the same way as the simple power function shown above, we can stage far more interesting and practically relevant programs, for example, the fast Fourier transform (FFT). A staged FFT, implemented in MetaOCaml, has been presented by Kiselyov et al.12 Their work is a very good showcase for how staging allows one to transform a simple, unoptimized algorithm into an efficient program generator. Achieving this in the context of MetaOCaml, however, required restructuring the program into monadic style and adding a front-end layer for performing symbolic rewritings. Using our approach of just adding
Rep types, we can go from the naive textbook-algorithm to the staged version (shown in Figure 1) by changing literally two lines of the code:
All that is needed is adding the self-type annotation to import arithmetic and trigonometric operations and changing the type of the real and imaginary components of complex numbers from
Merely changing the types will not provide us with the desired optimizations yet. We see below how we can add the transformations described by Kiselyov et al. to generate the same fixed-size FFT code, corresponding to the famous FFT butterfly networks (see Figure 2). Despite the seemingly naive algorithm, this staged code is free from branches, intermediate data structures, and redundant computations. The important point here is that we can add these transformations without any further changes to the code in Figure 1, just by mixing in the trait
FFT with a few others.
In the remainder of this section, we present the LMS framework that is used to generate the code in Figure 2 from the algorithm in Figure 1. Before considering specific optimizations, let us have a closer look at the definition of
Rep and the traits
Trig. The definitions are given in Figure 3. In trait Base, the declaration
type Rep [+T] defines an abstract type constructor (also called a higher kinded type)
Rep which we take to range over possible representations of staged expressions. Since
Rep is abstract, no concrete representation is defined yet; the declaration merely postulates the existence of some representation.
Arith extends trait
Base and contains only abstract members. These postulate the existence of an implicit lifting of
Doubles to staged values and the usual arithmetic operations on staged expressions of type
Rep [Double]. The restriction to
Doubles is just to keep the presentation concise. Any suitable means to abstract over numeric types, such as the "type class"
Numeric from the Scala standard library, could be used to define
Arith in a generic way for a range of numeric types. Analogous to
Double, we could define arithmetic on matrices and vectors and implement optimizations on those operations in exactly the same way. Trait
Trig is similar to
Arith but defines trigonometric operations.
One way to look at
Base, Arith, and
Trig is as the definition of a typed embedded language (or its syntax). The embedding is tagless (i.e., method resolution happens at compile time without additional runtime dispatch overhead)2 and polymorphic,8 in the sense that we are free to pick any suitable concrete implementation that fulfills the given interface.
From a safety point of view, keeping the actual representation inaccessible to the program generator is very important. Otherwise, the program generator could execute a different code depending on the exact structure of a staged expression. Optimizations that replace staged code with simpler but semantically equivalent expressions would risk changing the meaning of the generated program.
2.1. Representing staged code
With the aim of generating code, we could represent staged expressions directly as strings. But for optimization purposes we would rather have a structured intermediate representation that we can analyze in various ways.
We choose a representation on the basis of expression trees or, more exactly, a "sea of nodes" representation that is in fact a directed (and for the moment, acyclic) graph but can be accessed through a tree-like interface. The necessary infrastructure is defined in trait
Expressions, shown in Figure 4.
There are three categories of objects involved: expressions, which are atomic (subclasses of
Exp: constants and symbols; with a "gensym" operator
fresh to create fresh symbols), definitions, which represent composite operations (subclasses of
Def, to be provided by other components), and blocks, which model nested scopes.
The guiding principle is that each definition has an associated symbol and refers to other definitions only via their symbols. In effect, this means that every composite value will be named, similar to administrative normal form (ANF). The trait
Expressions provides methods to find a definition given a symbol, or vice versa. The extractor object
Def allows one to pattern-match on the definition of a given symbol, a facility that is used for implementing rewritings (see below).
The framework ensures that the code that contains staging operations will always be executed within the dynamic scope of at least one invocation of
reifyBlock, which returns a block object and takes as call-by-name argument the present-stage expression that will compute the staged block result. Block objects can be part of definitions, for example, for loops or conditionals.
The implicit conversion method
toAtom converts a definition to an atomic expression and links it to the scope being built up by the innermost enclosing
reifyBlock call. When the definition is known to be side-effect free,
toAtom will search the already encountered definitions for a structurally equivalent one. If a matching previous definition is found, its symbol will be returned, possibly moving the definition to a parent scope to make it accessible. If the definition has side effects or it is seen for the first time, it will be associated with a fresh symbol and saved for future reference. This simple scheme provides a powerful global value numbering (common subexpression elimination) optimization that effectively prevents generating duplicate code. More complicated optimization schemes can be added, too.
Since all operations in interface traits such as
Rep types, defining
Rep [T] = Exp [T] in trait
BaseExp (see Figure 5) means that conversion to symbols will take place within the constructor methods (e.g.,
infix_*). This fact is important because it establishes our correspondence between the evaluation order of the program generator and the evaluation order of the generated program: at the point where the generator calls
toAtom, the composite definition is turned into an atomic value, that is, its evaluation is recorded now and played back later in the same relative order with respect to others within the closest
We observe that there are no concrete definition classes provided by the trait
Expressions. Providing meaningful data types is the responsibility of other traits that implement the interfaces defined previously (
Base and its descendants). For each interface trait, there is one corresponding core implementation trait. Shown in Figure 5, we have traits
BaseExp, ArithExp, and
TrigExp for the functionality required by the FFT example. The trait
BaseExp installs atomic expressions as the representation of staged values by defining
Rep [T] = Exp [T]. Traits
TrigExp define one definition class for each operation defined by
Trig, respectively, and implement the corresponding interface methods to create instances of those classes.
2.2. Implementing optimizations
Some profitable optimizations, such as the global value numbering described above, are very generic. Other optimizations apply only to specific aspects of functionality, for example, particular implementations of constant folding (or more generally, symbolic rewritings) such as replacing computations like
x * 1.0 with
x. Yet, other optimizations are specific to the actual program being staged. In the FFT case, Kiselyov et al.12 describe a number of rewritings that are particularly effective for the patterns of code generated by the FFT algorithm but not as much as for other programs.
What we want to achieve again is modularity, so that optimizations can be combined in a way that is most useful for a given task. To implement a particular rewriting rule (whether specific or generic), say,
x * 1.0 → x, we can provide a specialized implementation of
infix_* (overriding the one in trait
ArithExp) that will test its arguments for a particular pattern. How this can be done in a modular way is shown by the traits
ArithExpOptFFT, which implement some generic and program-specific optimizations (see Figure 6). Note that the use of
x*y within the body of
infix_* will apply the optimization recursively.
In essence, we are confronted with the classic "expression problem" of independently extending a data model to new data variants and new operations. There are many solutions to this problem but most of them are rather heavyweight. More lightweight implementations are possible in languages that support multi-methods, that is, dispatch method calls dynamically based on the actual types of all the arguments. Figure 6 shows how we can achieve essentially the same (plus deep inspection of the arguments) using pattern matching and mix-in composition, making use of the fact that composing traits is subject to linearization.15 We package each set of arithmetic optimizations into its own trait that inherits from
ArithExp and overrides the desired methods (e.g.,
infix_*). When the arguments do not match the rewriting pattern, the overridden method will invoke the "parent" implementation using
super. When several such traits are combined, the super calls will traverse the overridden method implementations according to the linearization order of their containing traits.
Implementing multi-methods in a statically typed setting usually poses three problems: separate type checking/compilation, ensuring nonambiguity, and ensuring exhaustiveness. The described encoding supports separate type-checking and compilation as far as traits do. Ambiguity is ruled out by always following the linearization order and the first-match semantics of pattern matching. Exhaustiveness is ensured at the type level by requiring a default implementation, although no guarantees can be made that the default will not choose to throw an exception at runtime. In the particular case of applying optimizations, the default is always safe as it will just create an expression object.
2.3. Generating code
Code generation is an explicit operation. For the common case where generated code is to be loaded immediately into the running program, the trait
Compile provides a suitable interface in form of the abstract method
compile (see Figure 8). The contract of
compile is to "unstage" a function from staged to staged values into a function operating on present-stage values that can be used just like any other function object in the running program.
For generating Scala code, an implementation of the compilation interface is provided by trait
CompileScala. This trait extends another trait,
ScalaGenBase, whose subclasses are responsible for traversing nested blocks and generating Scala code for individual definition nodes. Of course, the traversal code can also be factored out and shared by multiple code generation targets. Subclasses of
ScalaGenBase are structured in a similar way as those of
Base, that is, one for each unit of functionality (see Figure 9). Code generation can omit side-effect free graph nodes that are unreachable from the final result, effectively performing a dead code elimination optimization.
The overall compilation logic of
CompileScala is relatively simple: emit a class and
apply-method declaration header, emit instructions for each definition node according to the schedule, close the source file, invoke the Scala compiler, load the generated class file, and return a newly instantiated object of that class.
2.4. Putting it all together
In the previous sections, we have discussed the major ingredients of lightweight modular staging, focusing mostly on individual components. Figure 7 shows an overview of the traits encountered so far and their relationships.
Using the staged FFT implementation as part of some larger Scala program is straightforward but requires us to interface the generic algorithm with a concrete data representation. The algorithm in Figure 1 expects an array of
Complex objects as input, each of which contains fields of type
Rep [Double]. The algorithm itself has no notion of staged arrays but uses arrays only in the generator stage, which means that it is agnostic to how data is stored. The enclosing program, however, will store arrays of complex numbers in some native format which we will need to feed into the algorithm. A simple choice of representation is to use
Array [Double] with the complex numbers flattened into adjacent slots. When applying
compile, we will thus receive input of type
Rep [Array [Double]]. Figure 10 shows how we can extend the trait
FFTC to obtain compiled FFT implementations that realize the necessary data interface for a fixed input size.
We can then define a code that creates and uses compiled FFT "codelets" by extending
Constructing an instance of this subtrait (mixed in with the appropriate LMS traits) will execute the embedded code:
We can also use the compiled methods from outside the object:
Providing an explicit type in the definition
val OP: TestFFC = ... ensures that the internal representation is not accessible to the outside, but only the members defined by
TestFFC are accessible.
Many features can be added in a way that is analogous to what we have seen above, but some require a bit more thought. In this section, we take a closer look at staged functions. Basic support for staged function definitions and function applications can be defined in terms of a simple higher order abstract syntax (HOAS)16 representation, similar to those of Carette et al.2 and Hofer et al.8 (see Figure 11). The idea is to provide a
lambda operation that transforms present-stage functions over staged values (type
Rep [A] =>
Rep [B]) to staged function values (type
Rep [A=>B]). Note how this is similar to the signature of compile. To give an example, the staged recursive factorial function will look like this:
As opposed to the earlier power example, an invocation
fac (m) will not inline the definition of
fac but will result in an actual function call in the generated code.
However, the HOAS representation has the disadvantage of being opaque: there is no immediate way to "look into" a Scala function object. If we want to treat functions in the same way as other program constructs, we need a way to transform the HOAS encoding into our graph representation. We can implement
lambda (f) to call
which will unfold the function definition into a Block that represents the entire computation defined by the function. But eagerly expanding function definitions is problematic. For recursive functions, the result would be infinite, that is, the computation will not terminate. What we would like to do instead is to detect recursion and generate a finite representation that makes the recursive call explicit. However, this is difficult because recursion might be very indirect:
Each incarnation of
foo creates a new function
f; unfolding will thus create unboundedly many different function objects.
To detect cycles, we have to compare those functions. This, of course, is undecidable in the general case of taking equality to be defined extensionally, that is, saying that two functions are equal if they map equal inputs to equal outputs. By contrast, the standard reference equality, which compares memory addresses of function objects, is too weak for our purpose:
However, we can approximate extensional equality by intensional (i.e., structural) equality, which is sufficient in most cases because recursion will cycle through a well-defined code path in the program text. Testing intensional equality amounts to checking whether two functions are defined at the same syntactic location in the source program and whether all data referenced by their free variables is equal. Fortunately, the implementation of first-class functions as closure objects offers (at least in principle) access to a "defunctionalized" data type representation on which equality can easily be checked. A bit of care must be taken though, because the structure can be cyclic. On the java virtual machine (JVM), there is a particularly neat trick. We can serialize the function objects into a byte array and compare the serialized representations:
With this method of testing equality, we can implement controlled unfolding. Unfolding functions only once at the definition site and associating a fresh symbol with the function being unfolded allows us to construct a block that contains a recursive call to the symbol we created. Thus, we can create the expected representation for the factorial function above.
3.1. Guarantees by construction
Making staged functions explicit through the use of
lambda enables tight control over how functions are structured and composed. For example, functions with multiple parameters can be specialized for a subset of the parameters. Consider the following implementation of Ackermann's function:
ack (m)(n) will produce a set of mutually recursive functions, each specialized to a particular value of
ack (2)(n), for example, we get
In essence, this pattern implements what is known as "polyvariant specialization" in the partial evaluation community. But unlike automatic partial evaluation, which might or might not be able to discover the right specialization, the use of staging provides strong guarantees about the structure of the generated code. In this case, we are guaranteed that specialization will happen for each value of m (but never for n), statically evaluating tests on values of m and inserting constants for all occurrences of m in the generated code.
Other strong guarantees can be achieved by restricting the interface of function definitions. Being of type
Rep [A=>B], the result of
lambda is a first-class value in the generated code that can be stored or passed around in arbitrary ways. However, we might want to avoid higher order control flow in generated code for efficiency reasons, or to simplify subsequent analysis passes. In this case, we can define a new function constructor
fundef as follows:
The use of
fundef instead of
lambda produces a restricted function that can only be applied but not passed around in the generated code (type
Rep [A]=> Rep [B]). At the same time, a result of
fundef is still a first class value in the code generator. If we do not expose
apply at all to client code, we obtain a guarantee that each function call site unambiguously identifies the function definition being called and no closure objects will need to be created at runtime.
Static program generation approaches include C++ templates, and Template Haskell.20 Building on C++ templates, customizable generation approaches are possible through Expression Templates.23 An example of runtime code generation in C++ is the TaskGraph framework. Active libraries were introduced by Veldhuizen24 and telescoping languages by Kennedy at al.11 Specific toolkits using domain-specific code generation and optimization include FFTW,6 SPIRAL,17 and ATLAS.25
This paper draws a lot of inspiration from the work of Kiselyov et al.12 on a staged FFT implementation. Performing symbolic rewritings by defining operators on lifted expressions and performing common subexpression elimination on the fly is also central to their approach. LMS takes these ideas one step further by making them a central part of the staging framework itself.
Immediately related works on embedding typed languages include those of Carette et al.2 and Hofer et al.8 Lee et al.13, 18 describe how LMS is used in the development of DSLs for high-performance parallel computing on heterogenous platforms.
Multistage Programming Languages such as MetaML22 and MetaOCaml1 have been proposed as a disciplined approach to building code generators. These languages provide three syntactic annotations: brackets, escape, and run, which together provide a syntactic quasi-quotation facility that is similar to that found in Lisp but statically scoped and statically typed.
MSP languages make writing program generators easier and safer, but they inherit the essentially syntactic notion of combining program fragments, which incurs the risk of duplicating or reordering computation.3, 21 Code duplication can be avoided systematically by writing the generator in continuation-passing or monadic style, using appropriate combinators to insert
let-bindings in strategic places. However, this is often impractical since this style significantly complicates the generator code. Another possibility is to make extensive use of side effects in the generator part, but side effects invalidate some of the static guarantees of MSP languages. This dilemma has been described as an "agonizing trade-off," because of which one "cannot achieve clarity, safety, and efficiency at the same time."10
By contrast, lightweight modular staging prevents code duplication by handling the necessary side effects inside the staging primitives, which are semantic combinators instead of syntactic expanders. Therefore, code generators can usually be written in purely functional direct style and are much less likely to invalidate safety assurances.
Another characteristic of some MSP languages is that staged code cannot be inspected because of safety considerations. Thus, domain-specific optimizations must happen on an external intermediate representation, before using the MSP primitives to generate code.12 The burden of choosing and implementing a suitable representation is on the programmer, and it is not clear how different representations can be combined or reused.
Lightweight modular staging provides a systematic interface for adding symbolic rewritings. Safety is maintained by exposing the internal code structure only to rewriting modules but keeping it hidden from the client generator code.
Compiled embedded DSLs, as studied by Leijen and Meijer14 and Elliott et al.,5 can also be implemented using MSP languages by writing an explicit interpreter and adding staging annotations in a second step.4, 7, 19 This is simpler than writing a full compiler, but compared to constructing explicit interpreters, "purely" embedded languages that are implemented as plain libraries have many advantages.9 LMS allows a simpler approach, by starting with a pure embedding instead of an explicit interpreter. In many cases, adding some type annotations in strategic places is all that is needed to get to a staged embedding. If domain-specific optimizations are needed, new AST classes and rewriting rules are easily added.
The development of lightweight modular staging has benefited greatly from the work on Delite in collaboration with the Stanford Pervasive Parallelism Lab, in particular Arvind Sujeeth, Hassan Chafi, Kevin Brown, HyoukJoong Lee and Kunle Olukotun. A number of members of the authors' group at EPFL have applied LMS to interesting use cases and contributed new insights or valuable extensions to the framework.
1. Calcagno, C., Taha, W., Huang, L., Leroy, X. Implementing multi-stage languages using asts, gensym, and reflection. GPCE, F. Pfenning and Y. smaragdakis, eds. Volume 2830 of Lecture Notes in Computer Science (2003). Springer, 5776.
3. Cohen, A., Donadio, S., Garzarán, M.J., Herrmann, C.A., Kiselyov, O., Padua, D.A. In search of a program generator to implement generic transformations for high-performance computing. Sci. Comput. Program. 62, 1 (2006), 2546.
11. Kennedy, K., Broom, B., Cooper, K.D., Dongarra, J., Fowler, R.J., Gannon, D., Johnsson, S.L., Mellor-Crummey, J.M., Torczon, L. Telescoping languages: A strategy for automatic generation of scientific problem-solving systems from annotated libraries. J. Parallel Distrib, Comput. 61, 12 (2001), 18031826.
17. Püschel, M., Moura, J.M.F., Singer, B., Xiong, J., Johnson, J., Padua, D.A., Veloso, M.M., Johnson, R.W. Spiral: A generator for platform-adapted libraries of signal processing alogorithms. IJHPCA, 18, 1 (2004), 2145.
A previous version of this paper was published in the Proceedings of the 9th International Conference on Generative Programming and Component Engineering (Oct. 1013, Eindhoven, The Netherlands).
©2012 ACM 0001-0782/12/0600 $10.00
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from firstname.lastname@example.org or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2012 ACM, Inc.
No entries found