### Abstract

Optimizing programs to run efficiently on modern parallel hardware is hard but crucial for many applications. The pre-dominantly used imperative languages force the programmer to intertwine the code describing functionality and optimizations. This results in a portability nightmare that is particularly problematic given the accelerating trend toward specialized hardware devices to further increase efficiency.

Many emerging domain-specific languages (DSLs) used in performance-demanding domains such as deep learning attempt to simplify or even fully automate the optimization process. Using a high-level—often functional—language, programmers focus on describing functionality in a declarative way. In some systems such as Halide or TVM, a separate *schedule* specifies how the program should be optimized. Unfortunately, these schedules are not written in well-defined programming languages. Instead, they are implemented as a set of *ad hoc* predefined APIs that the compiler writers have exposed.

In this paper, we show how to employ functional programming techniques to solve this challenge with elegance. We present two functional languages that work together—each addressing a separate concern. RISE is a functional language for expressing computations using well-known data-parallel patterns. ELEVATE is a functional language for describing optimization strategies. A high-level RISE program is transformed into a low-level form using optimization strategies written in ELEVATE. From the rewritten low-level program, high-performance parallel code is automatically generated. In contrast to existing high-performance domain-specific systems with scheduling APIs, in our approach programmers are not restricted to a set of built-in operations and optimizations but freely define their own computational patterns in RISE and optimization strategies in ELEVATE in a composable and reusable way. We show how our holistic functional approach achieves competitive performance with the state-of-the-art imperative systems such as Halide and TVM.

### 1. Introduction

As Moore’s Law and Dennard scaling are coming to an end,^{7} performance and energy efficiency gains no longer come for free for software developers that have to optimize for an increasingly diverse set of hardware by exploiting subtle details of the architecture. The accelerating trend toward specialized hardware, which offers extreme benefits for performance and energy efficiency if the optimized software exploits it, emphasizes the need for performance portability.

The predominant imperative and low-level programming approaches such as C, CUDA, or OpenCL force programmers to intertwine the code describing the program’s functional behavior with optimization decisions, making them—by design—non performance portable. As an alternative, higher-level domain-specific approaches have emerged allowing programmers to declaratively describe the functional behavior without committing to a specific implementation. Prominent examples are machine learning systems such as TensorFlow^{1} or PyTorch,^{10} where the compilers and runtime systems are responsible for optimizing the computations expressed as data flow graphs. Large teams of engineers provide fast implementations for the most common hardware, for TensorFlow including Google’s specialized TPU hardware. This labor-intensive support of new hardware is currently only sustainable for the biggest companies—and even they struggle as highlighted by two of the original authors of TensorFlow.^{3}

TVM^{4} and Halide^{11} are two state-of-the-art high-performance domain-specific code generators used in machine learning and image processing. Both attempt to tackle the performance portability challenge by separating the program into two parts: schedules and algorithms. A *schedule* describes the optimizations to apply to an *algorithm* that defines the functional behavior of the computation. Schedules are implemented using a set of predefined *ad hoc* APIs that expose a fixed set of optimization options. TVM’s and Halide’s authors describe these APIs as a scheduling *language*, but they lack many desirable properties of a programming language. Most crucially, programmers are not able to define their own abstractions. Even the composition of existing optimization primitives can be unintuitive due to the lack of precise semantics and implicit behavior limiting experts’ control. Furthermore, for some desirable optimizations, it is not sufficient to change the schedule, but programmers must redefine the algorithm itself—violating the promise of separating algorithm and schedule. To overcome the innovation obstacle of manually optimizing for specialized hardware and for achieving automated performance portability, we will need to rethink how we separate, describe, and apply optimizations in a more principled way.

In this paper, we describe a more principled but still practical holistic functional approach to high-performance code generation (Figure 1). We combine `RI`

a data-parallel functional language for expressing computations, with a functional strategy language, called **SE,**`ELEVATE. RI`

provides well-known functional data-parallel patterns for expressing computations at a high level. **SE**`ELEVATE`

enables programmers to define their own abstractions for building optimization strategies in a composable and reusable way. As we will see in our experimental results, our approach provides competitive performance compared with the state of the art while being built with and leveraging functional principles resulting in an elegant and composable design.

**Figure 1. Overview of our holistic functional approach to achieving high performance: Computations are expressed as High-Level Programs written in the data-parallel language RISE. These programs are rewritten following the instructions of an Optimization Strategy expressed in the strategy language ELEVATE. From the rewritten Low-Level Programs that encode optimizations explicitly, High-Performance Code is generated.**

### 2. Motivation and Background

We motivate the need for more principled optimizations with a study of TVM, the state of the art in high-performance domain-specific compilation for machine learning.

**2.1. Scheduling languages for high-performance code generation**

Halide^{11} proposed decoupling a program into the *algorithm*, describing the functional behavior, and the *schedule*, specifying how the compiler should optimize. This has inspired similar approaches such as TVM^{4} in deep learning.

Listings 1 and 2 show TVM Python code^{15} for optimizing matrix multiplication. Listing 1 shows a simple version where lines 2–6 define the computation: *A* and *B* are multiplied by performing the dot product for each coordinate pair (*x*, *y*). The dot product is expressed as pairwise multiplications and summation using the `tvm.sum`

operator (line 6). Line 8 instructs the compiler to use the default schedule-generating code to compute the output matrix sequentially.

**Listing 1. Matrix–matrix multiplication in TVM. Lines 2–6 define the computation A X B, line 8 instructs the compiler to use the default schedule computing the output matrix sequentially in a row-major order.**

**Listing 2. Optimized matrix–matrix multiplication in TVM. Lines 2–8 define an optimized version of the algorithm in Listing 1, the other lines define a schedule specifying the optimizations for targeting CPUs.**

**Modifications for optimizing performance.** Listing 2 shows an optimized version. The schedule in lines 10–23 specifies multiple optimizations including tiling (line 12), vectorization (line 19), and loop unrolling (line 18) for optimizing the performance on multicore CPUs. However, in order to optimize the memory access pattern, the algorithm has to be changed: A copy of the `B matrix (pB)`

is introduced in line 5 (and used in line 8) whose elements are reordered depending on the tile size. This optimization is not expressible with scheduling primitives and, therefore, requires the modification of the algorithm—violating the promise of separating algorithm and schedule. Generally, the separation between algorithm and schedule is blurred because both share the same Python identifiers and must live in the same scope. This unsharp separation limits the reuse of schedules across algorithms.

The optimized parallel schedule uses built-in optimization primitives. Some are hardware-specific (like `vectorize`

), some are algorithmic optimizations useful for many applications (like `tiling`

to increase data locality), and others are low-level optimizations (like `unroll`

and `reorder`

that transform loop nests). However, TVM’s scheduling language is not easily extensible, as adding optimization primitives requires extending the underlying compiler.

The behavior of some primitives is not intuitive, and only informal documentation is provided, for example, for `cache_write`

: *“Create a cache write of original tensor, before storing into tensor.”* Reasoning about schedules is difficult due to the lack of clear descriptions of optimization primitives.

If no schedule is provided (as in Listing 1), the compiler employs a set of implicit default optimizations that are out of reach for the user’s control. The implicit optimizations sometimes lead to the surprising behavior that algorithms without a schedule perform better (e.g., due to auto-vectorization) than ones with a provided schedule.

**2.2. The need for a principled way to separate, describe, and apply optimizations**

Out of the shortcomings of scheduling APIs, we identify desirable features for a more principled way to separate, describe, and apply optimizations for high-performance code generation. Our holistic functional approach aims to do the following:

*Separate concerns*: Computations should be expressed at a high abstraction level only. They should not be changed to express optimizations.*Facilitate reuse*: Optimization strategies should be defined separated from the computational program facilitating reusability of programs and strategies.*Enable composability*: Computations*and*strategies should be written as compositions of user-defined (possibly domain-specific) building blocks;*both languages*should facilitate the creation of higher-level abstractions.*Allow reasoning*: Computational patterns, but also especially strategies, should have a precise, well-defined semantics allowing reasoning about them.*Be explicit*: Implicit default behavior should be avoided to empower users to be in control.

*Fundamentally, we argue that a more principled high-performance code generation approach should be holistic by considering computation and optimization strategies equally important. As a consequence, a strategy language should be built with the same standards as a language describing computation.*

In this paper, we present such an approach combining two functional languages: `RI`

and **SE**`ELEVATE`

.

Figure 2 shows an example of a `RI`

program defining matrix multiplication computation as a composition of well-known data-parallel functional patterns. Below is an **SE**`ELEVATE`

strategy that defines one possible optimization by applying the well-known tiling optimization. The optimization strategy is defined as a sequential composition (‘`;`

‘) of user-defined strategies that are themselves defined as compositions of simple rewrite rules giving the strategy precise semantics. We do not use implicit behavior and instead generate low-level code according to the optimization strategy.

**Figure 2. Matrix-matrix multiplication in RISE (top) and the tiling optimization strategy in ELEVATE (bottom).**

### 3. Rise: A Language for Expressing Data-Parallel Computations

`RI`

is a functional programming language with data-parallel patterns for expressing computations over multidimensional arrays. **SE**`RI`

is a spiritual successor of LIFT**SE**^{12,13,14} that has demonstrated that functional, high-performance code generation is feasible for different domains, including dense linear algebra, sparse linear algebra, and stencil computations.^{6}

The top of Figure 2 shows an example of a `RI`

program using usual functional constructs of function abstraction (written **SE**`fun (x, e)`

), application (written with parenthesis), identifiers, and literals (underlined). We use the following syntactic sugar: We write reverse function application as `e |> f`

(equivalent to `f(e)`

); we write function composition as `g << f`

and in the reverse form as `f >> g`

, both meaning `f`

is applied before `g`

. The type system allows to symbolically track the length of arrays using a restricted form of dependent types. Grammar and typing rules are given elsewhere.^{5,2}

`RI`

defines a set of high-level primitives that describe computations over multidimensional arrays. These primitives are well known in the functional programming community: **SE**`fst`

, `snd`

, and the binary functions `add`

and `mult`

have their obvious meaning. `map`

and `reduce`

are the well-known higher-order functions operating on arrays and allowing for easy parallelization. `zip`

, `split`

, `join`

, and `transpose`

shape multidimensional array data in various ways.

**A functional representation of hardware features.** Besides the high-level primitives, `RI`

offers low-level primitives to indicate how to exploit the underlying hardware. Generally, programmers do not directly use these primitives; instead, they are introduced by rewrite rules. The **SE**`mapSeq`

and `mapPar`

patterns indicate if a function is applied to an array using a sequential or a parallel loop. Similarly, `reduceSeq`

and `reduceSeqUnroll`

indicate if the sequential reduction loop should be unrolled or not. `RI`

does not provide a parallel reduction as a building block because it is expressable using other low-level primitives such as **SE**`mapPar. toMem (a)(fun (x, b)`

) indicates that `a`

is stored in memory and accessible in `b`

with the name `x`

. Three more low-level patterns, `mapVec`

, `asVector`

, and `asScalar`

, enable the use of SIMD-style vectorization. The low-level primitives presented here are OpenMP-specific for expressing parallelization on CPUs; a similar set of low-level primitives exists for targeting the OpenCL programming language for GPUs.

**Strategy-preserving code generation from RISE.** The compilation of `RI`

programs is slightly unusual. A high-level program is rewritten using a set of rewrite rules into the low-level patterns. From the low-level representation, imperative parallel code is generated. This design leads to a clear separation of concerns—one of the key aims that we set out for our approach. All optimization decisions must be made in the rewriting stage before code generation. Atkey et al.**SE**^{2} describe a code generation process that is guaranteed to be *strategy preserving*, meaning that no implicit implementation decisions are made. Instead, the compiler respects the implementation and optimization decisions explicitly encoded in the low-level `RI`

program.**SE**

### 4. Elevate: A Language for Describing Optimization Strategies

`ELEVATE`

is a functional language for describing optimization strategies with a standard feature set, including recursion, algebraic data types, and pattern matching. It is heavily inspired by earlier work on strategy languages,^{8} for example, Stratego,^{16,18} and complements our functional language `RI`

that describes computations. Our current implementation is a shallow-embedded DSL in Scala, and we use Scala-like notation for **SE**`ELEVATE`

strategies in the paper.

A *strategy* is the fundamental building block of `ELEVATE`

encoding a program transformation as a function with type:

typeStrategy[P] = P => RewriteResult[P]

Here, `P`

is the type of the rewritten program, such as `Rise`

for `RI`

programs. **SE**`A RewriteResult`

is an applicative error monad encoding the success or failure of applying a strategy:

`RewriteResult[P] = Success[P](p: P)`

| Failure[P] (s: Strategy[P])

In case of success, `Success`

contains the rewritten program; otherwise, `Failure`

contains the unsuccessful strategy.

The simplest examples of strategies are strategies that always succeed (`id`

) and always fail (`fail`

):

defid [P]:Strategy[P] = (p:P) => Success(p)

deffail [P]:Strategy [P] = (p:P) => Failure(fail)

**4.2. Rewrite rules as strategies**

In `ELEVATE`

, rewrite rules are also strategies, that is, functions satisfying the type given above. The left-hand side of the well-known `mapFusion`

rule (Figure 3) is expressed in `RI`

as:**SE**

**Figure 3. Rewrite rules of high-level RISE expressions used for optimizations in this paper.**

valp: Rise = fun(xs, map (f)(map(g)(xs)))

The fusion rule is implemented in `ELEVATE`

as follows:

defmapFusion: Strategy [Rise] = p => pmatch{

caseapp (app (map, f), app (app (map, g), xs))

`=> Success (map (fun (x, f(g(x)))) (xs) )`

case_ => Failure (mapFusion) }

We are mixing `RI`

(i.e., **SE**`map(f)`

) and `ELEVATE`

expressions and use `app (f, x)`

to pattern-match the function application that we write as `f(x)`

in `RI`

. The expression nested inside **SE**`Success`

is the result of the rewrite rule application. Figure 3 shows all rewrite rules used as basic building blocks for expressing optimizations such as tiling, discussed later.

An idea that `ELEVATE`

inherits from Stratego^{17} is to describe strategies as compositions—one of our key aims. Therefore, we introduce strategy combinators.

The `seq`

combinator composes two strategies `fs`

and `ss`

sequentially by applying the first strategy to the input program `p`

, and then, the second strategy is applied to the result.

defseq [P]:

`Strategy [P] => Strategy [P] => Strategy [P] =`

`fs => ss => p => fs(p) >>= (q => ss (q))`

The `seq`

strategy is successful when it applied both strategies successfully in succession; otherwise, `seq`

fails. In our combinator’s implementation, we use the monadic interface of `RewriteResult`

and use the standard Haskell operators `»=`

for monadic bind, `< | >`

for mplus, and `<$>`

for fmap.

The `lChoice`

combinator is given two strategies and applies the second one only if the first failed.

deflChoice [P]:

`Strategy [P] => Strategy [P] => Strategy [P] =`

`fs => ss => p => fs (p) < | > ss (p)`

We use `<+`

as an infix operator for `lChoice`

and ‘`;`

‘ for `seq`

. Using these basic combinators, we define others such as `try`

that applies a strategy and, in case of failure, applies the identity strategy. Therefore, `try`

never fails.

deftry [P]: Strategy [P] => Strategy [P] =

`s => p => (s <+ id)(p)`

`repeat`

applies a strategy until it is no longer applicable.

defrepeat [P]: Strategy [P] => Strategy [P] =

`s => p => try(s ';' repeat (s) ) (p)`

**4.4. Traversals as strategy transformers**

In the implementation of the `mapFusion`

strategy, the **match** statement will try to pattern-match its argument—the entire program. This means that a strategy on its own is hard to reuse in different circumstances.

In addition, a strategy is often applicable at multiple places within the same program or only applicable at a specific location. For example, the `mapFusion`

strategy is applicable twice in the following `RI`

program:**SE**

valmaps3 = fun (xs, map (f) (map (g) (map (h) (xs))))

We may fuse the first or last two `maps`

, as shown in Figure 4.

**Figure 4. Two possible locations for applying the map-fusion rule within the same program.**

In `ELEVATE`

, we use *traversals* to describe at which exact location a strategy is applied. Luttik and Visser^{9} proposed two basic traversals encoded as strategy transformers:

typeTraversal [P] = Strategy [P] => Strategy [P]

defall [P]: Traversal [P];

defone [P]: Traversal [P];

Traversal `all`

applies a given strategy to all sub-expressions of the current expression and fails if the strategy is not applicable to all sub-expressions. `one`

applies a given strategy to exactly one sub-expression and fails if the strategy is not applicable to any sub-expression.

In `ELEVATE`

, we view these basic traversals as a type class: an interface that has to be implemented for each program type `P`

. The implementation for `RI`

is straightforward. **SE**`RI`

programs are represented by ASTs such as the one in Figure 4; therefore, **SE**`all`

and `one`

correspond to the obvious implementations on the tree-based representation.

To fuse the first two `maps`

in Figure 4, we use the `one`

traversal: `one(mapFusion)(maps3)`

. This applies the `mapFusion`

strategy, not at the root of the AST, but instead one level down, first trying to apply the strategy (unsuccessfully) to the function parameter and then (successfully) to the function body highlighted in the upper-right blue box.

To fuse the last two `maps`

, we use the `one`

traversal twice: `one(one(mapFusion))(maps3)`

. This successfully applies the fusion strategy to the expression highlighted in the lower-right purple box in Figure 4.

The traversals we have discussed so far are not specific to `RI`

These traversals are flexible but offer only limited control as for **SE.**`one`

the selection of sub-expressions is either non-deterministic or implementation-dependent (as for `RI`

). Particularly in the context of program optimization, it rarely makes sense to apply a strategy to **SE**`all`

sub-expressions.

In `ELEVATE`

, one can easily specify program language-specific traversals. As we have seen in the previous section, `RI`

is a functional language using λ-calculus as its representation. Therefore, it makes sense to introduce traversals that navigate the two core concepts of λ-calculus: **SE**`fun`

ction abstraction and `app`

lication. To apply a strategy to the body of a function abstraction, we define the `body`

traversal that applies a strategy to the function body, and if successful, a function is built around the transformed body. Similarly, we define traversals `function`

and `argument`

for function applications.

For the `RI`

program shown in Figure 4, we can now describe a precise path in the AST. To fuse the first two **SE**`maps`

, we may write `body(mapFusion)(maps3)`

, and to fuse the others, we write `body(argument(mapFusion))(maps3)`

.

All traversal primitives introduced so far apply their given strategies only to immediate sub-expressions. Using strategy combinators and traversals, we can define recursive strategies that traverse entire expressions:

deftopDown [P]: Traversal [P] =

`s => p => (s <+ one (topDown(s))) (p)`

defbottomUp [P]: Traversal [P] =

`s => p => (one (bottomUp (s)) <+ s) (p)`

deftryAll [P]: Traversal [P] =

`s => p => (all (tryAll (try(s))) ';' try (s)) (p)`

`topDown`

and `bottomUp`

are useful strategies traversing an expression either from the top or from the bottom, trying to apply a strategy at every sub-expression and stopping at the first successful application. If the strategy is not applicable at any sub-expression, `topDown`

and `bottomUp`

fail. The `tryAll`

strategy is often more useful as it wraps its given strategy in a `try`

and thus never fails but applies the strategy wherever possible. Also, note that the `tryAll`

strategy traverses the AST bottom-up instead of top-down.

When implementing rewrite rules, such as the `mapFusion`

rule as strategies, the match statement expects the input expression to be in a particular syntactic form. For a functional language like `RI`

, we might, for example, expect that expressions are fully β-reduced. To ensure that expressions satisfy a **SE***normal-form*, we define:

defnormalize [P]: Strategy [P] => Strategy [P] =

`s => p => repeat (topDown (s))(p)`

The `normalize`

strategy repeatedly applies a given strategy to every possible sub-expression until it cannot be applied anymore. Therefore, after `normalize`

successfully finishes, it is not possible to apply the given strategy to any sub-expression any more.

For `RI`

specifically, we use in the following two normal forms whose implementation is explained in the original paper**SE**^{5}:

- the Beta-Eta-Normal-Form (
`BENF`

) transforms a`RI`

expression such that the standard lambda calculus transformations β-reduction and η-reduction are no longer applicable, and**SE** - the Data-Flow-Normal-Form (
`DFNF`

) ensures a particular syntactic structure for higher-order`RI`

primitives like**SE**`map`

. Specifically,`DFNF`

ensures that a function abstraction is present in every higher-order primitive and that higher-order primitive are fully applied.

### 5. Optimizations as Strategies

In this section, we explore the use of `ELEVATE`

to encode high-performance optimizations by leveraging its ability to define custom abstractions. We use TVM^{4} as a comparison for a state-of-the-art imperative optimizing deep learning compiler with a scheduling API implemented in Python. TVM allows expressing computations using an EDSL (in Python) and controlling the application for optimizations using a separate scheduling API.

We start by expressing basic scheduling primitives such as `parallel`

in `ELEVATE`

. Then, we explore the implementation of more complex scheduling primitives like `tile`

by composition in `ELEVATE`

, whereas it is a built-in optimization in TVM. Following our functional approach, we express sophisticated optimization strategies as compositions of a small set of general rewrite rules resulting in a more principled and even more powerful design. Specifically, the tiling optimization strategy in `ELEVATE`

can tile arbitrary many dimensions instead of only two, while being composed of only five `RI`

-specific rewrite rules.**SE**

**5.1. Basic scheduling primitives as strategies**

TVM’s scheduling primitives `parallel`

, `split`

, and `unroll`

specify loop transformations. We implement those as rewrite rules for `RI`

. The **SE**`parallel`

primitive indicates the parallel computation of a particular loop. In `RI`

, this is indicated by transforming a high-level **SE**`map`

into its low-level `mapPar`

version as expressed in the following `ELEVATE`

strategy:

defparallel: Strategy [Rise] = p => pmatch{

case map=> Success (mapPar)

case_ => Failure (parallel)}

We define a strategy for the sequential `mapSeq`

similarly.

TVM’s `split`

primitive implements loop blocking. In `RI`

, this is achieved by rewriting the computation over an array expressed by **SE**`map(f)`

: First, the input is split into a two-dimensional array using `split(n)`

, then `f`

is `map`

ped twice to apply the function to all elements of the now nested array, and finally, the resulting array is attended into the original one-dimensional form using `join`

.

defsplit (n:Int): Strategy [Rise] = p => pmatch{

caseapp (map, f) =>

`Success (split (n) >> map (map (f)) >> join )`

case_ => Failure (split(n)) }

The `unroll`

strategy rewrites the high-level `map`

and `reduce`

primitives into specific `RI`

low-level primitives that will be unrolled by the **SE**`RI`

compiler during code generation.**SE**

**5.2. Multidimensional tiling as a strategy**

Tiling is a crucial optimization improving the cache hit rate by exploiting locality within a small neighborhood of elements. TVM’s `tile`

is a more complicated scheduling primitive to implement because it is essentially a combination of two traditional loop transformations: loop blocking and loop interchange. In fact, `tile`

in TVM is a built-in combination of `split`

for loop blocking and `reorder`

for loop interchange. We already saw how to implement `split`

using `ELEVATE`

. We will now discuss how to implement a `tile`

strategy using a combination of rules, normal-forms, and domain-specific traversals. Where TVM only implements 2D tiling, our generalized strategy tiles an arbitrary number of dimensions.

We require five basic rules for expressing our multidimensional tiling strategy: `splitJoin, addId, idToTranspose, transposeMove,`

and `mapFission`

(all shown in Figure 3). In addition, we require three standard *λ*-calculus-specific transformations: *η*– and *β*-reduction, and *η*-abstraction. We implement these rules as basic `ELEVATE`

strategies.

Listing 3 shows the `ELEVATE`

implementation of the tiling optimization. The multidimensional `tileND`

strategy expects a list of tile sizes, one per tiled dimension. The intuition for the implementation is simple: First, we ensure that the required rules are applicable to the input expression by normalizing the expression using the `DFNF`

normal form. Then, we apply the previously introduced `split`

strategy to every `map`

to be blocked, recursively going from the innermost to outermost, as explained below. Finally, the `interchange`

strategy rearranges the blocked loops in the expected final order; this strategy is explained in detail in the original paper.^{5}

**Listing 3. ELEVATE strategy implementing tiling recursively for arbitray dimensions.**

To recursively apply the loop blocking strategy to nested maps, we make use of the `RI`

-specific traversal **SE**`fmap`

:

deffmap: Traversal [Rise] = s =>

`function (argOf (map, body(s)))`

`fmap`

traverses to the function argument of a `map`

primitive and applies the given strategy `s`

. Note that the strategy requires the function argument of a `map`

primitive to be a function abstraction, which we can assume because we normalize the expression using `DFNF`

. The `fmap`

strategy is useful because it can be nested to “push” the application of the given strategy inside of a `map`

-nest. For example,

`fmap (fmap (split (n))) (DFNF (map (map (map (f)))))`

skips two `map`

s and applies loop blocking to the inner-most `map`

. In Listing 3 line 4, we use `fmap`

to recursively call `tileND`

applying loop blocking first to the inner `map`

s before to the outer `map`

.

**Listing 4. ELEVATE blocking strategy**

**5.3. Abstractions for describing locations in RISE**

In TVM, named identifiers describe locations at which optimizations should be applied. For example, TVM’s `split`

is invoked with an argument specifying the loop to block:

Using identifiers ties schedules to computational expressions and makes reusing schedules hard. `ELEVATE`

does not use names to identify locations, but instead uses the traversals defined in Section 4. This is another example of how we facilitate reuse—one of the key aims of our approach.

By using `ELEVATE`

‘s traversal strategies, we apply the basic scheduling strategies in a much more flexible way: For example, `topDown(parallel)`

traverses the AST from top to bottom and will thus always parallelize the outermost `map`

, corresponding to the outermost for loop. `tryAll(parallel)`

traverses the whole AST instead, and all `map`

s are parallelized.

In order to apply optimizations on large ASTs, it is often insufficient to use the `topDown`

or `tryAll`

traversals. For example, we might want to block a specific loop in a loop nest. None of the introduced traversals so far allow the description of a precise loop conveniently, or rather a precise location, required for these kinds of optimizations. *Strategy predicates* allow us to describe locations in a convenient way. A strategy predicate checks the program for a syntactic structure and returns `Success`

without changing the program if the structure is found. Two simple examples for strategy predicates are `isReduce`

and `isApp`

that check if the current node is a `reduce`

primitive or an applied function, respectively. These predicates can be composed with the regular traversals to define precise locations. The ‘`@`

‘ strategy allows us to describe the application of strategies at precise locations conveniently:

def'@' [P] (s: Strategy [P], t:Traversal [P]) = t(s)

We write this function in infix notation.

The left-hand side of the ‘`@`

‘ operator specifies the strategy to apply, and the right-hand side specifies the precise location as a traversal. This nicely separates the strategy to apply from the traversal describing the location. This abstraction is especially useful for complex optimization strategies with nested location descriptions. For `RI`

, we specify multiple useful traversals and predicates, which can be extended as needed. Two useful ones are **SE**`outermost`

and `mapNest`

that are defined as follows:

defoutermost: Strategy [Rise] => Traversal [Rise]

`= pred => s => topDown (pred ';' s)`

defmapNest (d: Int): Strategy [Rise] = p =>

`(d`

match{

casexifx == Ī => Success (p)

casexifx < Ī => Failure (mapNest (d))

case_ => fmap (mapNest (d-1)) (p)})

`outermost`

traverses from top to bottom visiting nested primitives from outermost to innermost, trying to apply the predicate. If the predicate is applied successfully, it applies the given strategy `s`

. Similarly, we define function `innermost`

, which instead uses `bottomUp`

. The `mapNest`

predicate recursively traverses a `DFNF`

-normalized `map`

nest of a given depth using nested `fmap`

traversals. If the traversal is successful, a `map`

nest of depth d has been found.

By combining these abstractions, we conveniently describe applying the tiling optimization to the two outermost loop nests elegantly in `ELEVATE`

:

`(tile (32, 32) '@' outermost (mapNest (2)))(mm)`

### 6. Experimental Evaluation

In this section, we evaluate our functional approach to high-performance code generation. We use `ELEVATE`

strategies to describe optimizations that are equivalent to TVM schedules using matrix–matrix multiplication as our case study. We compare the performance achieved using code generated by the `RI`

compiler and code generated by TVM. The original paper**SE**^{5} also includes a performance comparison with Halide.

**6.1. Optimizing matrix-matrix multiplication**

For our case study of matrix-matrix multiplication, we follow a tutorial from the TVM authors that discuss seven progressively optimized versions: *baseline, blocking, vectorized, loop permutation, array packing, cache blocks,* and *parallel.* For each TVM schedule, we developed an equivalent `ELEVATE`

strategy using the TVM-like scheduling abstractions and the traversal utilities. The *vectorized, loop permutation* and *cache blocks* versions are not discussed here for brevity but are discussed in the original paper^{5}; we discuss the rest here.

** Baseline.** For the

*baseline*version, TVM uses a default schedule, whereas

`ELEVATE`

describes the implementation decisions explicitly as shown in Figure 5—one of the key aims that we set out for our approach.
**Figure 5. RISE matrix multiplication expression (top) and baseline strategy in ELEVATE (bottom).**

The TVM algorithm shown in Listing 1 computes the dot product in a single statement in line 6. The `RI`

program shown at the top of Figure 5 describes the dot product with separate **SE**`map`

and `reduce`

primitives, which are fused as described in the `ELEVATE`

program below using the `fuseReduceMap`

rewrite rules from Figure 3. The `lowerToC`

strategy rewrites `map`

and `reduce`

into their sequential versions. Both systems generate equivalent C code of two nested loops iterating over the output matrix and a single nested reduction loop performing the dot product.

** Blocking.** For the

*blocking*version, we reuse the

`baseline`

and `lowerToC`

strategy, but first, we use the abstractions built in the previous sections, emulating the TVM schedule as shown in Listings 4 and 5. We first apply `tile`

, then `split`

, and then `reorder`

, just as specified in the TVM schedule. To `split`

the reduction, we need to fission the fused map and reduce primitives again using `fissionReduceMap`

. We describe locations using `outermost`

and `innermost`

applying `tile`

to the outermost `map`

s and `split`

to the nested reduction. In contrast to TVM, for `reorder`

, we identify dimensions by index rather than by name. We introduce the ‘`;;`

‘ combinator for convenience denoting that we apply `DFNF`

to normalize intermediate expressions between each step.
**Listing 5. TVM blocking schedule**

** Array packing.** As already discussed in the motivation section, some optimizations are not expressible in TVM’s scheduling API without changing the algorithm—clearly violating the separation between specifying computations and optimizations. Specifically, the

*array packing*optimization permutes the B matrix’s elements in memory improving memory access patterns by introducing an additional computation

`pB`

in Listing 2 line 6, before using it in line 8.
**Listing 6. ELEVATE parallel strategy**

For our implementation of the *array packing* optimization, we are not required to change the `RI`

program, but define and apply the array packing of matrix **SE***B* simply as another rewrite step in `ELEVATE`

(Listing 6 line 10). Our `arrayPacking`

strategy is itself composed out of other strategies, for example, `storeInMemory`

or `loopPerm`

, which are left out for brevity here but are explained in detail in the original paper.^{5}

** Parallel.** The TVM version

*parallel*changes the algorithm yet again to introduce a temporary buffer (

`CC`

in Listing 2 line 11) for the accumulation along the K-dimension to improve the cache writing behavior and unrolls the inner reduction loop. For expressing the *parallel*version in

`ELEVATE`

(Listing 6 line 12), we reuse the `arrayPacking`

strategy (line 13) while adding strategies for unrolling the innermost reduction (line 16) and parallelizing the outer-most loop (line 14).
**6.2. Rewriting overhead and performance**

We now investigate the scalability, overhead, and performance of our functional rewrite-based approach.

** Scalability and overhead.** To evaluate scalability and the overhead of rewriting, we are counting the number of successfully applied rewrite steps performed when applying a strategy to the

`RI`**SE**

matrix multiply expression. We count every intermediate step, which includes traversals as these are implemented as rewrite steps too.Figure 6 shows the number of rewrites for each version. No major optimizations are applied to the *baseline* version, and 657 rewrite steps are performed. However, as soon as interesting optimizations are applied, we reach about 40,000 steps for the next three versions and about 63,000 for the most complicated optimizations. These high numbers clearly show the scalability of our compositional approach, in which complex optimizations are composed of a small set of fundamental building blocks. It also shows that abstractions are required to control this many rewrite steps. The high-level strategies encode practical optimizations and hide massive numbers of individual rewrite steps that are performed. Applying the strategies to the `RI`

expression took less than two seconds per version on a commodity laptop, demonstrating the moderate overhead of our unoptimized implementation.**SE**

**Figure 6. Total number of successful rewrite steps when applying different optimization strategies.**

** Performance comparison against TVM.** Finally, we are interested in the performance achieved when optimizing

`RI`**SE**

programs with `ELEVATE`

compared with TVM. Ideally, the `RI`**SE**

code optimized with `ELEVATE`

should be similar to the TVM-generated code and achieve competitive performance. We performed measurements on an Intel multicore CPU. For a detailed description of the experimental setup see the original paper.^{5}

Figure 7 shows the performance of `RI`

– and TVM-generated code. The code generated by **SE**`RI`

controlled by the **SE**`ELEVATE`

optimization strategies performs competitively with TVM. Our experiment indicates a matching trend across versions compared with TVM, showing that our ELEVATE strategies encode the same optimizations used in the TVM schedules. The most optimized parallel `RI`

generated version improves the performance over the baseline by about 110X. The strategies developed in an extensible style by composing individual rewrite steps using **SE**`ELEVATE`

are practically usable and provide competitive performance for important high-performance code optimizations.

**Figure 7. Performance of TVM- vs. RISE-generated code that has been optimized by ELEVATE strategies.**

### 7. Conclusion

In this paper, we presented a holistic functional approach to high-performance code generation. We presented two functional languages: `RI`

for describing computations as compositions of data-parallel patterns and **SE**`ELEVATE`

for describing optimization strategies as composition of rewrite rules. We showed that our approach successfully: *separates concerns* by truly separating the computation and strategy languages; *facilitates reuse* of computational patterns as well as rewrite rules; *enables composability* by building programs as well as rewrite strategies as compositions of a small number of fundamental building blocks; *allows reasoning* about programs and strategies with well-defined semantics; and *is explicit* by empowering users to be in control over the optimization strategy that is respected by our compiler. In contrast to existing imperative systems with scheduling APIs such as Halide and TVM, programmers are not restricted to apply a set of built-in optimizations but define their own optimization strategies. Our experimental evaluation demonstrates that our holistic functional approach achieves competitive performance compared with the state-of-the-art code generator TVM.

## Join the Discussion (0)

## Become a Member or Sign In to Post a Comment