Research and Advances
Computing Applications Research highlights

Lightweight Modular Staging: A Pragmatic Approach to Runtime Code Generation and Compiled DSLs

  1. Abstract
  2. 1. Introduction
  3. 2. An End-to-End Example
  4. 3. Adding More Features
  5. 4. Related Work
  6. Acknowledgments
  7. References
  8. Authors
  9. Footnotes
  10. Figures
Read the related Technical Perspective
flow chart illustration

Good software engineering practice demands generalization and abstraction, whereas high performance demands specialization and concretization. These goals are at odds, and compilers can only rarely translate expressive high-level programs to modern hardware platforms in a way that makes best use of the available resources.

Generative programming is a promising alternative to fully automatic translation. Instead of writing down the target program directly, developers write a program generator, which produces the target program as its output. The generator can be written in a high-level, generic style and can still produce efficient, specialized target programs. In practice, however, developing high-quality program generators requires a very large effort that is often hard to amortize.

We present lightweight modular staging (LMS), a generative programming approach that lowers this effort significantly. LMS seamlessly combines program generator logic with the generated code in a single program, using only types to distinguish the two stages of execution. Through extensive use of component technology, LMS makes a reusable and extensible compiler framework available at the library level, allowing programmers to tightly integrate domain-specific abstractions and optimizations into the generation process, with common generic optimizations provided by the framework.

LMS is well suited to develop embedded domain-specific languages (DSLs) and has been used to develop powerful performance-oriented DSLs for demanding domains such as machine learning, with code generation for heterogeneous platforms including GPUs. LMS has also been used to generate SQL for embedded database queries and JavaScript for web applications.

Back to Top

1. Introduction

Building and managing complex software systems is only possible by generalizing functionality and abstracting from particular use cases. Achieving performance, on the other hand, requires concretizing configurations and specializing code to its particular environment. Generative programming can bridge this gap by translating away abstraction overhead and effectively specializing generic programs.

Generative programming can be broadly classified as static or dynamic. Static code generation happens at compile time; a widely used example is the C++ template language or macro systems in other languages. Dynamic code generation that takes place at program runtime brings additional flexibility, because code can be specialized with respect to parameters not yet available at compile time.

Many computations can naturally be separated into stages distinguished by frequency of execution or availability of information. Staging transformations aim at executing certain pieces of code less often or at a time when performance is less critical. In the context of program generation, multistage programming (MSP, staging for short) as established by Taha and Sheard22 allows programmers to explicitly delay evaluation of a program expression to a later stage (thus, staging an expression). The present stage effectively acts as a code generator that composes (and possibly executes) the program of the next stage. A nice property of this approach is that generator and generated code are expressed in a single program, with the aim that programmers can construct a multistage program from a naive implementation of the same algorithm by adding staging annotations in a selective way.

Basic mechanisms for composing program fragments at runtime have existed for a long time in the form of Lisp’s “code is data” model and its use of quasi-quotation: syntactic annotations to denote expressions that should remain unevaluated and to mark holes within them, to be filled in with expressions computed elsewhere. Dedicated MSP languages such as MetaML22 and MetaOCaml1 add well-scoping and well-typing guarantees about the generated code. Despite these advances, incorporating dynamic code generation into larger systems in the form of self-optimizing “active” libraries or adding compilation to embedded domain-specific languages (DSLs) that are implemented as libraries remains a significant challenge.

*  1.1. Contributions

In this paper, we present lightweight modular staging (LMS), a dynamic code generation approach that further reduces the effort to develop sophisticated program generators. The presentation uses the Scala programming language, but the core concepts are not tied to any language in particular.

The classical introductory staging example is to specialize the power function for a given exponent, assuming that a program will take many different numbers to the same power. Considering the usual implementation,


we want to turn the base b into a staged expression. Following Carette et al.2 and Hofer et al.,8 and in contrast to the quasi-quotation approach of syntactic annotations, the central idea of LMS is to use types to distinguish the computational stages. In particular, we change b‘s declared type from Double to Rep [Double]. The meaning of having type Rep [Double] is that b represents a computation that will yield a Double in the next stage. We also change the function’s return type accordingly.

Now we need to regain the ability to do arithmetic on b, which is no longer a plain Double. The second idea of LMS is to package operations on staged types as components. To make its required functionality explicit, we wrap the power function itself up as a component (a trait):


In Scala, traits are similar to classes but they can be used in mix-in composition, a limited form of multiple inheritance.15 The self-type annotation this: Arith denotes a “requires” relationship: whenever an instance of Power is created, an instance of a concrete (but unspecified) subclass of Arith (see Figure 3) that provides staged double arithmetic must be mixed in, too.

  • LMS shares many benefits with earlier staging approaches:
  • Code generators and generated code are expressed in the same program.
  • Objects that are live within the generator’s heap can be accessed from generated code if the code is invoked directly (cross-stage persistence).
  • Staged expressions inherit the static scope of the generator, and if the generator is well-typed, so is the generated code.
  • Data types representing staged expressions are inaccessible to the program itself (making optimizations that preserve only semantic but not structural equality safe).

At the same time, LMS differs from previous approaches in the following important ways:

  • Staging is driven entirely by types, and no special syntax is required.
  • Given a sufficiently expressive programming language, the whole framework can be implemented as a library (hence lightweight).
  • Staged code fragments are composed through explicit operations, in particular, lifted variants of the usual operators and control flow statements extended with optimizing symbolic rewritings (semantic composition instead of syntactic expansion).
  • Using component technology, operations on staged expressions, data types to represent them, and optimizations (both generic and domain specific) can be extended and composed in a flexible way (hence modular).
  • Different code generation targets can easily be supported; their implementations can share common code.
  • The relative evaluation order of expressions is preserved across stage boundaries. There is no danger of accidentally omitting, reordering, or duplicating computation.

The last item deserves more explanation, as it is arguably the most prevalent difficulty that programmers face with other staging approaches. Since freely composing code fragments voids any evaluation order guarantees (such as call-by-value semantics), adding quasi-quotation annotations can easily change the result of a program in unforeseen ways or slow down the program by duplicating computation. In practice, quasi-quotation is therefore often used as a low-level tool, like an “assembly language” for code generation, and combined with application-specific front-end layers that apply high-level code optimizations and ensure correct evaluation order. LMS provides a predictable execution order and makes the composition of staged expressions extensible, and thus removes the need for extra front ends.

LMS is a key constituent of Delite,13, 18 an open-source framework for high-performance parallel DSLs that has been used to develop DSLs for demanding application areas such as machine learning, graph processing or mesh-based partial differential equation solvers. In this context, LMS enables “abstraction without regret.” DSL programmers can use arbitrary Scala features to structure their programs in the generator stage, with the comforting knowledge that LMS guarantees to remove these abstractions during staging and no runtime price will need to be paid. Coupled with advanced compiler optimizations such as data parallel loop fusion and architecture-specific data structure transformations, Delite generates efficient code for a variety of parallel platforms like multi-core CPUs and GPUs.

LMS has also been used to generate SQL statements for queries embedded in Scala programs and to generate JavaScript from staged Scala fragments, allowing Web applications to execute parts of their logic in the client’s browser.

*  1.2. Organization

The rest of this paper is structured as follows. Section 2 presents an end-to-end example, turning a naive algorithm into an efficient code generator, while introducing the major LMS components on the way. Section 2.1 discusses representations of staged code, Section 2.2 is concerned with optimizations, Section 2.3 is concerned with target code generation, and Section 2.4 concludes the example by showing how the generated code can be integrated in larger programs. Section 3 describes how additional features can be added, in particular functions and recursion. Section 4 discusses related work. An earlier version of this paper appeared at GPCE 2010. The original version contains a few additional examples and a more thorough discussion of how effectful statements are represented, while the present version has been updated with some new material in Section 3.

Back to Top

2. An End-to-End Example

In the same way as the simple power function shown above, we can stage far more interesting and practically relevant programs, for example, the fast Fourier transform (FFT). A staged FFT, implemented in MetaOCaml, has been presented by Kiselyov et al.12 Their work is a very good showcase for how staging allows one to transform a simple, unoptimized algorithm into an efficient program generator. Achieving this in the context of MetaOCaml, however, required restructuring the program into monadic style and adding a front-end layer for performing symbolic rewritings. Using our approach of just adding Rep types, we can go from the naive textbook-algorithm to the staged version (shown in Figure 1) by changing literally two lines of the code:


All that is needed is adding the self-type annotation to import arithmetic and trigonometric operations and changing the type of the real and imaginary components of complex numbers from Double to Rep [Double].

Merely changing the types will not provide us with the desired optimizations yet. We see below how we can add the transformations described by Kiselyov et al. to generate the same fixed-size FFT code, corresponding to the famous FFT butterfly networks (see Figure 2). Despite the seemingly naive algorithm, this staged code is free from branches, intermediate data structures, and redundant computations. The important point here is that we can add these transformations without any further changes to the code in Figure 1, just by mixing in the trait FFT with a few others.

In the remainder of this section, we present the LMS framework that is used to generate the code in Figure 2 from the algorithm in Figure 1. Before considering specific optimizations, let us have a closer look at the definition of Rep and the traits Arith and Trig. The definitions are given in Figure 3. In trait Base, the declaration type Rep [+T] defines an abstract type constructor (also called a higher kinded type) Rep which we take to range over possible representations of staged expressions. Since Rep is abstract, no concrete representation is defined yet; the declaration merely postulates the existence of some representation.

Trait Arith extends trait Base and contains only abstract members. These postulate the existence of an implicit lifting of Doubles to staged values and the usual arithmetic operations on staged expressions of type Rep [Double]. The restriction to Doubles is just to keep the presentation concise. Any suitable means to abstract over numeric types, such as the “type class” Numeric from the Scala standard library, could be used to define Arith in a generic way for a range of numeric types. Analogous to Double, we could define arithmetic on matrices and vectors and implement optimizations on those operations in exactly the same way. Trait Trig is similar to Arith but defines trigonometric operations.

One way to look at Base, Arith, and Trig is as the definition of a typed embedded language (or its syntax). The embedding is tagless (i.e., method resolution happens at compile time without additional runtime dispatch overhead)2 and polymorphic,8 in the sense that we are free to pick any suitable concrete implementation that fulfills the given interface.

From a safety point of view, keeping the actual representation inaccessible to the program generator is very important. Otherwise, the program generator could execute a different code depending on the exact structure of a staged expression. Optimizations that replace staged code with simpler but semantically equivalent expressions would risk changing the meaning of the generated program.

*  2.1. Representing staged code

With the aim of generating code, we could represent staged expressions directly as strings. But for optimization purposes we would rather have a structured intermediate representation that we can analyze in various ways.

We choose a representation on the basis of expression trees or, more exactly, a “sea of nodes” representation that is in fact a directed (and for the moment, acyclic) graph but can be accessed through a tree-like interface. The necessary infrastructure is defined in trait Expressions, shown in Figure 4.

There are three categories of objects involved: expressions, which are atomic (subclasses of Exp: constants and symbols; with a “gensym” operator fresh to create fresh symbols), definitions, which represent composite operations (subclasses of Def, to be provided by other components), and blocks, which model nested scopes.

The guiding principle is that each definition has an associated symbol and refers to other definitions only via their symbols. In effect, this means that every composite value will be named, similar to administrative normal form (ANF). The trait Expressions provides methods to find a definition given a symbol, or vice versa. The extractor object Def allows one to pattern-match on the definition of a given symbol, a facility that is used for implementing rewritings (see below).

The framework ensures that the code that contains staging operations will always be executed within the dynamic scope of at least one invocation of reifyBlock, which returns a block object and takes as call-by-name argument the present-stage expression that will compute the staged block result. Block objects can be part of definitions, for example, for loops or conditionals.

The implicit conversion method toAtom converts a definition to an atomic expression and links it to the scope being built up by the innermost enclosing reifyBlock call. When the definition is known to be side-effect free, toAtom will search the already encountered definitions for a structurally equivalent one. If a matching previous definition is found, its symbol will be returned, possibly moving the definition to a parent scope to make it accessible. If the definition has side effects or it is seen for the first time, it will be associated with a fresh symbol and saved for future reference. This simple scheme provides a powerful global value numbering (common subexpression elimination) optimization that effectively prevents generating duplicate code. More complicated optimization schemes can be added, too.

Since all operations in interface traits such as Arith return Rep types, defining Rep [T] = Exp [T] in trait BaseExp (see Figure 5) means that conversion to symbols will take place within the constructor methods (e.g., cos or infix_*). This fact is important because it establishes our correspondence between the evaluation order of the program generator and the evaluation order of the generated program: at the point where the generator calls toAtom, the composite definition is turned into an atomic value, that is, its evaluation is recorded now and played back later in the same relative order with respect to others within the closest reifyBlock invocation.

We observe that there are no concrete definition classes provided by the trait Expressions. Providing meaningful data types is the responsibility of other traits that implement the interfaces defined previously (Base and its descendants). For each interface trait, there is one corresponding core implementation trait. Shown in Figure 5, we have traits BaseExp, ArithExp, and TrigExp for the functionality required by the FFT example. The trait BaseExp installs atomic expressions as the representation of staged values by defining Rep [T] = Exp [T]. Traits ArithExp and TrigExp define one definition class for each operation defined by Arith and Trig, respectively, and implement the corresponding interface methods to create instances of those classes.

*  2.2. Implementing optimizations

Some profitable optimizations, such as the global value numbering described above, are very generic. Other optimizations apply only to specific aspects of functionality, for example, particular implementations of constant folding (or more generally, symbolic rewritings) such as replacing computations like x * 1.0 with x. Yet, other optimizations are specific to the actual program being staged. In the FFT case, Kiselyov et al.12 describe a number of rewritings that are particularly effective for the patterns of code generated by the FFT algorithm but not as much as for other programs.

What we want to achieve again is modularity, so that optimizations can be combined in a way that is most useful for a given task. To implement a particular rewriting rule (whether specific or generic), say, x * 1.0 → x, we can provide a specialized implementation of infix_* (overriding the one in trait ArithExp) that will test its arguments for a particular pattern. How this can be done in a modular way is shown by the traits ArithExpOpt and ArithExpOptFFT, which implement some generic and program-specific optimizations (see Figure 6). Note that the use of x*y within the body of infix_* will apply the optimization recursively.

In essence, we are confronted with the classic “expression problem” of independently extending a data model to new data variants and new operations. There are many solutions to this problem but most of them are rather heavyweight. More lightweight implementations are possible in languages that support multi-methods, that is, dispatch method calls dynamically based on the actual types of all the arguments. Figure 6 shows how we can achieve essentially the same (plus deep inspection of the arguments) using pattern matching and mix-in composition, making use of the fact that composing traits is subject to linearization.15 We package each set of arithmetic optimizations into its own trait that inherits from ArithExp and overrides the desired methods (e.g., infix_*). When the arguments do not match the rewriting pattern, the overridden method will invoke the “parent” implementation using super. When several such traits are combined, the super calls will traverse the overridden method implementations according to the linearization order of their containing traits.

Implementing multi-methods in a statically typed setting usually poses three problems: separate type checking/compilation, ensuring nonambiguity, and ensuring exhaustiveness. The described encoding supports separate type-checking and compilation as far as traits do. Ambiguity is ruled out by always following the linearization order and the first-match semantics of pattern matching. Exhaustiveness is ensured at the type level by requiring a default implementation, although no guarantees can be made that the default will not choose to throw an exception at runtime. In the particular case of applying optimizations, the default is always safe as it will just create an expression object.

*  2.3. Generating code

Code generation is an explicit operation. For the common case where generated code is to be loaded immediately into the running program, the trait Compile provides a suitable interface in form of the abstract method compile (see Figure 8). The contract of compile is to “unstage” a function from staged to staged values into a function operating on present-stage values that can be used just like any other function object in the running program.

For generating Scala code, an implementation of the compilation interface is provided by trait CompileScala. This trait extends another trait, ScalaGenBase, whose subclasses are responsible for traversing nested blocks and generating Scala code for individual definition nodes. Of course, the traversal code can also be factored out and shared by multiple code generation targets. Subclasses of ScalaGenBase are structured in a similar way as those of Base, that is, one for each unit of functionality (see Figure 9). Code generation can omit side-effect free graph nodes that are unreachable from the final result, effectively performing a dead code elimination optimization.

The overall compilation logic of CompileScala is relatively simple: emit a class and apply-method declaration header, emit instructions for each definition node according to the schedule, close the source file, invoke the Scala compiler, load the generated class file, and return a newly instantiated object of that class.

*  2.4. Putting it all together

In the previous sections, we have discussed the major ingredients of lightweight modular staging, focusing mostly on individual components. Figure 7 shows an overview of the traits encountered so far and their relationships.

Using the staged FFT implementation as part of some larger Scala program is straightforward but requires us to interface the generic algorithm with a concrete data representation. The algorithm in Figure 1 expects an array of Complex objects as input, each of which contains fields of type Rep [Double]. The algorithm itself has no notion of staged arrays but uses arrays only in the generator stage, which means that it is agnostic to how data is stored. The enclosing program, however, will store arrays of complex numbers in some native format which we will need to feed into the algorithm. A simple choice of representation is to use Array [Double] with the complex numbers flattened into adjacent slots. When applying compile, we will thus receive input of type Rep [Array [Double]]. Figure 10 shows how we can extend the trait FFT to FFTC to obtain compiled FFT implementations that realize the necessary data interface for a fixed input size.

We can then define a code that creates and uses compiled FFT “codelets” by extending FFTC:


Constructing an instance of this subtrait (mixed in with the appropriate LMS traits) will execute the embedded code:


We can also use the compiled methods from outside the object:


Providing an explicit type in the definition val OP: TestFFC = … ensures that the internal representation is not accessible to the outside, but only the members defined by TestFFC are accessible.

Back to Top

3. Adding More Features

Many features can be added in a way that is analogous to what we have seen above, but some require a bit more thought. In this section, we take a closer look at staged functions. Basic support for staged function definitions and function applications can be defined in terms of a simple higher order abstract syntax (HOAS)16 representation, similar to those of Carette et al.2 and Hofer et al.8 (see Figure 11). The idea is to provide a lambda operation that transforms present-stage functions over staged values (type Rep [A] => Rep [B]) to staged function values (type Rep [A=>B]). Note how this is similar to the signature of compile. To give an example, the staged recursive factorial function will look like this:


As opposed to the earlier power example, an invocation fac (m) will not inline the definition of fac but will result in an actual function call in the generated code.

However, the HOAS representation has the disadvantage of being opaque: there is no immediate way to “look into” a Scala function object. If we want to treat functions in the same way as other program constructs, we need a way to transform the HOAS encoding into our graph representation. We can implement lambda (f) to call


which will unfold the function definition into a Block that represents the entire computation defined by the function. But eagerly expanding function definitions is problematic. For recursive functions, the result would be infinite, that is, the computation will not terminate. What we would like to do instead is to detect recursion and generate a finite representation that makes the recursive call explicit. However, this is difficult because recursion might be very indirect:


Each incarnation of foo creates a new function f; unfolding will thus create unboundedly many different function objects.

To detect cycles, we have to compare those functions. This, of course, is undecidable in the general case of taking equality to be defined extensionally, that is, saying that two functions are equal if they map equal inputs to equal outputs. By contrast, the standard reference equality, which compares memory addresses of function objects, is too weak for our purpose:


However, we can approximate extensional equality by intensional (i.e., structural) equality, which is sufficient in most cases because recursion will cycle through a well-defined code path in the program text. Testing intensional equality amounts to checking whether two functions are defined at the same syntactic location in the source program and whether all data referenced by their free variables is equal. Fortunately, the implementation of first-class functions as closure objects offers (at least in principle) access to a “defunctionalized” data type representation on which equality can easily be checked. A bit of care must be taken though, because the structure can be cyclic. On the java virtual machine (JVM), there is a particularly neat trick. We can serialize the function objects into a byte array and compare the serialized representations:


With this method of testing equality, we can implement controlled unfolding. Unfolding functions only once at the definition site and associating a fresh symbol with the function being unfolded allows us to construct a block that contains a recursive call to the symbol we created. Thus, we can create the expected representation for the factorial function above.

*  3.1. Guarantees by construction

Making staged functions explicit through the use of lambda enables tight control over how functions are structured and composed. For example, functions with multiple parameters can be specialized for a subset of the parameters. Consider the following implementation of Ackermann’s function:


Calling ack (m)(n) will produce a set of mutually recursive functions, each specialized to a particular value of m. For ack (2)(n), for example, we get


In essence, this pattern implements what is known as “polyvariant specialization” in the partial evaluation community. But unlike automatic partial evaluation, which might or might not be able to discover the right specialization, the use of staging provides strong guarantees about the structure of the generated code. In this case, we are guaranteed that specialization will happen for each value of m (but never for n), statically evaluating tests on values of m and inserting constants for all occurrences of m in the generated code.

Other strong guarantees can be achieved by restricting the interface of function definitions. Being of type Rep [A=>B], the result of lambda is a first-class value in the generated code that can be stored or passed around in arbitrary ways. However, we might want to avoid higher order control flow in generated code for efficiency reasons, or to simplify subsequent analysis passes. In this case, we can define a new function constructor fundef as follows:


The use of fundef instead of lambda produces a restricted function that can only be applied but not passed around in the generated code (type Rep [A]=> Rep [B]). At the same time, a result of fundef is still a first class value in the code generator. If we do not expose lambda and apply at all to client code, we obtain a guarantee that each function call site unambiguously identifies the function definition being called and no closure objects will need to be created at runtime.

Back to Top

4. Related Work

Static program generation approaches include C++ templates, and Template Haskell.20 Building on C++ templates, customizable generation approaches are possible through Expression Templates.23 An example of runtime code generation in C++ is the TaskGraph framework. Active libraries were introduced by Veldhuizen24 and telescoping languages by Kennedy at al.11 Specific toolkits using domain-specific code generation and optimization include FFTW,6 SPIRAL,17 and ATLAS.25

This paper draws a lot of inspiration from the work of Kiselyov et al.12 on a staged FFT implementation. Performing symbolic rewritings by defining operators on lifted expressions and performing common subexpression elimination on the fly is also central to their approach. LMS takes these ideas one step further by making them a central part of the staging framework itself.

Immediately related works on embedding typed languages include those of Carette et al.2 and Hofer et al.8 Lee et al.13, 18 describe how LMS is used in the development of DSLs for high-performance parallel computing on heterogenous platforms.

Multistage Programming Languages such as MetaML22 and MetaOCaml1 have been proposed as a disciplined approach to building code generators. These languages provide three syntactic annotations: brackets, escape, and run, which together provide a syntactic quasi-quotation facility that is similar to that found in Lisp but statically scoped and statically typed.

MSP languages make writing program generators easier and safer, but they inherit the essentially syntactic notion of combining program fragments, which incurs the risk of duplicating or reordering computation.3, 21 Code duplication can be avoided systematically by writing the generator in continuation-passing or monadic style, using appropriate combinators to insert let-bindings in strategic places. However, this is often impractical since this style significantly complicates the generator code. Another possibility is to make extensive use of side effects in the generator part, but side effects invalidate some of the static guarantees of MSP languages. This dilemma has been described as an “agonizing trade-off,” because of which one “cannot achieve clarity, safety, and efficiency at the same time.”10

By contrast, lightweight modular staging prevents code duplication by handling the necessary side effects inside the staging primitives, which are semantic combinators instead of syntactic expanders. Therefore, code generators can usually be written in purely functional direct style and are much less likely to invalidate safety assurances.

Another characteristic of some MSP languages is that staged code cannot be inspected because of safety considerations. Thus, domain-specific optimizations must happen on an external intermediate representation, before using the MSP primitives to generate code.12 The burden of choosing and implementing a suitable representation is on the programmer, and it is not clear how different representations can be combined or reused.

Lightweight modular staging provides a systematic interface for adding symbolic rewritings. Safety is maintained by exposing the internal code structure only to rewriting modules but keeping it hidden from the client generator code.

Compiled embedded DSLs, as studied by Leijen and Meijer14 and Elliott et al.,5 can also be implemented using MSP languages by writing an explicit interpreter and adding staging annotations in a second step.4, 7, 19 This is simpler than writing a full compiler, but compared to constructing explicit interpreters, “purely” embedded languages that are implemented as plain libraries have many advantages.9 LMS allows a simpler approach, by starting with a pure embedding instead of an explicit interpreter. In many cases, adding some type annotations in strategic places is all that is needed to get to a staged embedding. If domain-specific optimizations are needed, new AST classes and rewriting rules are easily added.

Back to Top


The development of lightweight modular staging has benefited greatly from the work on Delite in collaboration with the Stanford Pervasive Parallelism Lab, in particular Arvind Sujeeth, Hassan Chafi, Kevin Brown, HyoukJoong Lee and Kunle Olukotun. A number of members of the authors’ group at EPFL have applied LMS to interesting use cases and contributed new insights or valuable extensions to the framework.

Back to Top

Back to Top

Back to Top

Back to Top


F1 Figure 1. FFT code. Only the real and imaginary components of complex numbers need to be staged.

F2 Figure 2. Computation graph for size-4 FFT. Autogenerated from staged code in Figure 1.

F3 Figure 3. Interface traits defining staged operations. For simplicity, operations are defined for Double only.

F4 Figure 4. Expression representation (method implementations have been omitted).

F5 Figure 5. Implementing the interface traits from Figure 3 using the expression types from Figure 4.

F6 Figure 6. Extending the implementation from Figure 5 with generic (top) and specific (bottom) optimizations (analog of TrigExp has been omitted).

F7 Figure 7. Component architecture. Arrows denote extends relationships and dashed boxes represent units of functionality.

F8 Figure 8. Code generation interface and skeleton of Scala compilation component.

F9 Figure 9. Scala code generation for selected expressions.

F10 Figure 10. Extending the FFT component from Figure 1 with explicit compilation.

F11 Figure 11. Representation of λ-abstractions as Scala function values (higher order abstract syntax).

Back to top

    1. Calcagno, C., Taha, W., Huang, L., Leroy, X. Implementing multi-stage languages using asts, gensym, and reflection. GPCE, F. Pfenning and Y. smaragdakis, eds. Volume 2830 of Lecture Notes in Computer Science (2003). Springer, 57–76.

    2. Carette, J. Kiselyov, O., chieh Shan, C., Finally tagless, partially evaluated: Tagless staged interpreters for simpler typed languages. J. Funct. Program, 19, 5 (2009), 509–543.

    3. Cohen, A., Donadio, S., Garzarán, M.J., Herrmann, C.A., Kiselyov, O., Padua, D.A. In search of a program generator to implement generic transformations for high-performance computing. Sci. Comput. Program. 62, 1 (2006), 25–46.

    4. Czarnecki, K., O'Donnell, J.T., Striegnitz, J., Taha, W., Dsl implementation in metaocaml, template haskell, and c++. In Domain-Specific Program Generation (2003), 51–72.

    5. Elliott, C., Finne, S., de Moor, O. Compiling embedded languages. J. Funct, Program, 13, 3 (2003), 455-481.

    6. Frigo, M. A fast Fourier transform compiler. In PLDI (1999), 169–180.

    7. Guerrero, M., Pizzi, E., Rosenbaum, R., Swadi, K.N., Taha, W. Implementing dsls in metaocaml. OOPSLA Companion, J. M. Vlissides and D. C. Schmidt, eds. (2004). ACM, 41–42.

    8. Hofer, C., Ostermann, K., Rendel, T., Moors, A. Polymorphic embedding of dsls. GPCE, Y. Smaragdakis and J.G. Siek, eds. (2008). ACM, 137–148.

    9. Hudak, P. Modular domain specific languages and tools. In Proceedings of Fifth International Conference on Software Reuse (June 1998), 134–142.

    10. Kameyama, Y., Kiselyov, O., chieh Shan, C. Shifting the stage: staging with delimited control. PEPM, G. Puebla and G. Vidal, eds. (2009). ACM, 111–120.

    11. Kennedy, K., Broom, B., Cooper, K.D., Dongarra, J., Fowler, R.J., Gannon, D., Johnsson, S.L., Mellor-Crummey, J.M., Torczon, L. Telescoping languages: A strategy for automatic generation of scientific problem-solving systems from annotated libraries. J. Parallel Distrib, Comput. 61, 12 (2001), 1803–1826.

    12. Kiselyov, O., Swadi, K.N., Taha, W. A methodology for generating verified combinatorial circuits. EMSOFT, G.C. Buttazzo, ed. (2004). ACM, 249–258.

    13. Lee, H., Brown, K.J., Sujeeth, A.K., Chafi, H., Rompf, T., Odersky, M., Olukotun, K. Implementing domain-specific languages for heterogeneous parallel computing. IEEE Micro 31, 5 (2011), 42–53.

    14. Leijen, D., Meijer, E. Domain specific embedded compilers. In DSL (1999), 109–122.

    15. Odersky, M., Zenger, M. Scalable component abstractions. OOPSLA, R. E. Johnson and R.P. Gabriel, eds. (2005). ACM, 41–57.

    16. Pfenning, F., Elliott, C. Higher-order abstract syntax. In PLDI (1988), 199–208.

    17. Püschel, M., Moura, J.M.F., Singer, B., Xiong, J., Johnson, J., Padua, D.A., Veloso, M.M., Johnson, R.W. Spiral: A generator for platform-adapted libraries of signal processing alogorithms. IJHPCA, 18, 1 (2004), 21–45.

    18. Rompf, T., Sujeeth, A.K., Lee, H., Brown, K.J., Chafi, H., Odersky, M., Olukotun, K. Building-blocks for performance oriented dsls. In DSL (2011), 93–117.

    19. Sheard, T., Benaissa, Z.E.A., Pasalic, E. Dsl implementation using staging and monads. In DSL (1999), 81–94.

    20. Sheard, T., Jones, S.L.P. Template meta-programming for haskell. SIGPLAN Not. 37, 12 (2002), 60–75.

    21. Swadi, K.N., Taha, W., Kiselyov, O., Pasalic, E. A monadic approach for avoiding code duplication when staging memoized functions. PEPM, J. Hatcliff and F. Tip, eds. (2006). ACM, 160–169.

    22. Taha, W., Sheard, T. Metaml and multi-stage programming with explicit annotations. Theor, Comput, Sci. 248(1–2) (2000), 211–242.

    23. Veldhuizen. T "Expression Templates," in Stanley B. Lippman (Editor), C++ Gems (SIGS Books and Multimedia, 1996), pp. 475–488

    24. Veldhuizen, T.L. Active libraries and universal languages. PhD thesis, Indiana University Computer Science (May 2004).

    25. Whaley, R.C., Petitet, A., Dongarra, J. Automated empirical optimizations of software and the atlas project. Parallel Comput. 27(1–2) (2001), 3–35.

    A previous version of this paper was published in the Proceedings of the 9th International Conference on Generative Programming and Component Engineering (Oct. 10–13, Eindhoven, The Netherlands).

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More