## Thursday, August 3, 2017

### Building Haskell projects with GHC

For some reason, building Haskell projects with GHC/GHCi is usually much faster than using Stack or cabal-install. This document explains how to do it.

UPDATE: A more accurate title would be: Building Haskell projects with ghc --make and cabal-cargs.

UPDATE: A more accurate description of the problem is:

• By default, Stack and cabal-install will compile with optimizations and linking, etc. to produce efficient libraries or executables. If we are only interested in checking the code or simple testing without performance concerns, we can speed up the process by turning off these things when calling GHC.
• Even if we modify the Cabal file with ghc-options to turn off optimizations, etc., two problems remain with Stack/cabal-install:
• To my knowledge they are not able to build just a single file and its dependencies. In contrast, ghc --make can target a single file just fine.
• When a package consists of a main library and targets such as test suites that depend on the library, Stack/cabal-install have to build and install the whole main library before starting to build the test suite. In contrast, ghc --make can view the whole project as a single unit and only compile the necessary files.

# Prerequisites

Install the command line tool cabal-cargs.

GHC can be used as a fast tool for checking syntax/types/etc. of a file in the project. Here is one way to do that:

ghc --make -O0 -no-link -dynamic -hidir .ghc-temp -odir .ghc-temp cabal-cargs file.hs

Some things to note:

• We are not interested in generating an optimized executable, so we turn off optimizations and linking by passing -O0 -no-link to GHC.
• We put temporary files in the directory .ghc-temp to avoid sprinkling temporary files in the project directory.
• We use cabal-cargs to get all required additional flags from the project’s Cabal file.

The first time the above command is issued for file.hs, GHC will also build the modules on which file.hs depends. In sub-sequent runs, only dependencies that have changed will be rebuilt.

The way cabal-cargs is used in the above command, it will set the flags for the whole project, including all targets (library, test suites, executables, etc.) In order to only build files from a a particular target, pass the flag --sourcefile to cabal-cargs:

ghc --make -O0 -no-link -dynamic -hidir .ghc-temp -odir .ghc-temp cabal-cargs --sourcefile=test/file.hs test/file.hs

This option allows cabal-cargs to find the target that includes test/file.hs and only set the flags for this target. Note that if, for example, test/file.hs is a part of a test suite that depends on the main library, you need to first install the main library before running the above command.

# Using GHCi

In order to load a project file in GHCi, we just change some flags to GHC:

ghc --interactive -hidir .ghc-temp -odir .ghc-temp cabal-cargs file.hs

Note that by keeping the -hidir and -odir flags, we allow GHCi to make use of previously generated intermediate files. This means that it only has to load (and interpret) files that have changed since they were previously built. Thus, using ghc --make before ghc --interactive can significantly improve the time it takes for the interactive mode to load.

# Interaction with Stack and Cabal sandboxes

Users of Stack or Cabal sandboxes will need to set the environment that tells GHC where to find installed packages before running the above commands. A convenient way to do this is to use stack exec or cabal exec; for example:

stack exec -- ghc --make -O0 -no-link -dynamic -hidir .ghc-temp -odir .ghc-temp cabal-cargs file.hs

or

cabal exec -- ghc --make -O0 -no-link -dynamic -hidir .ghc-temp -odir .ghc-temp cabal-cargs file.hs

# Convenient wrapper scripts

I use the following Bash scripts as convenient wrappers around the commands shown in this text:

$cat ghcc #!/bin/bash exec ghc --make -O0 -no-link -dynamic -hidir .ghc-temp -odir .ghc-temp cabal-cargs "$@"

$cat ghcci #!/bin/bash exec ghc --interactive -hidir .ghc-temp -odir .ghc-temp cabal-cargs "$@"

$cat ghcc1 #!/bin/bash exec ghc --make -O0 -no-link -dynamic -hidir .ghc-temp -odir .ghc-temp cabal-cargs --sourcefile="$1" "$1"$ cat ghcci1
#!/bin/bash
exec ghc --interactive -hidir .ghc-temp -odir .ghc-temp cabal-cargs --sourcefile="$1" "$1"

## Wednesday, November 30, 2016

### Benchmarking FFT in Feldspar

This post is a report on some adventures in benchmarking and improving the FFT implementation included as an example in RAW-Feldspar. Feldspar is a Haskell-embedded domain-specific language for digital signal processing and embedded systems. It has a compiler that generates C code.

The main purpose of this experiment was to back up the claim that RAW-Feldspar can generate high-performance code. Additionally we will see some of the advantages provided by RAW-Feldspar when it comes to quickly exploring design alternatives. (In the remainder of the text we will refer to RAW-Feldspar as just Feldspar unless the context requires more specificity.)

In order to get a solid baseline for the benchmarks, we will compare against FFTW 3.x, which is among the fastest FFT implementations out there. We should note, however, that the point is not to compete with FFTW (which is unrealistic given the tremendous amount of work and sophistication that has gone into its development), but rather to get an absolute reference for comparison. If absolute performance is needed, FFTW can be called from Feldspar.

The following files are relevant to this post:

• FFT.hs – the final version of the FFT
• FFT_bench.hs – the benchmark code for Feldspar’s FFT
• FFT_bench.c – the generated C code for the FFT benchmark for input size $2^16$. Note that it’s a standalone program that has the FFT inlined into the main function.
• FFTW.hs – a Feldspar wrapper for FFTW and code to benchmark it

In order to run the code, make sure to first install RAW-Feldspar:

cabal install raw-feldspar

# FFT in Feldspar

Feldspar uses the iterative FFT described by Axelsson and Sheeran (2012). Readers who want to understand the implementation in detail are encouraged to read the paper. However, there are some differences due to the fact that the paper was written for a different version of Feldspar. In particular, the FFT in RAW-Feldspar has a more explicit treatment of memory.

The core of the Feldspar FFT is given by the function fftCore (here slightly simplified):

fftCore
:: ( Pully ts (Data (Complex a))
, Manifestable Run vec (Data (Complex a))
, Finite vec
, RealFloat a
, PrimType a
, PrimType (Complex a)
)
=> Store (Data (Complex a))  -- ^ Double-buffered storage
-> ts                        -- ^ Twiddle factors
-> Data Length               -- ^ Number of stages
-> vec                       -- ^ Vector to transform
-> Run (DManifest (Complex a))

fftCore st ts n vec = do
let step i = return . twids ts n i (length vec) . bfly i
loopStore st (n,-1,Excl 0) (step . subtract 1) vec
>>= bitRev st n

Let us start by explaining the type:

• The first argument is a double-buffered store used to hold intermediate results. The FFT does not allocate any memory on its own.
• The second argument is a vector of twiddle factors. Taking this vector as argument allows the twiddle factors to be pre-computed and even shared across multiple FFT runs.
• The third argument – n – is the number of stages needed in the FFT. It must hold that $l = 2^n$ where $l$ is the length of the input vector. Taking n as an argument avoids recomputing this value when running multiple FFTs of the same size.
• The fourth argument is the vector to be transformed.
• The somewhat scary class context has to do with the fact that we want fftCore to work for different types of vectors (type variables ts and vec) and different types of elements (type variable a). The overloading of ts will come in handy below when we move between inlined and pre-computed twiddle factors.

The body of the function consists of a loop of n iterations followed by a bit-reversal permutation. The loop body is given by the step function, which comprises one “stage” of the FFT. Each stage is a butterfly (bfly) followed by twiddle factor multiplication (twids).

Here is a typical way to call fftCore for an input vector vec:

    do ...
st <- newStore (length vec)
n  <- shareM (ilog2 (length vec))
ts <- manifestFresh $Pull (twoTo (n-1)) (\i -> tw (twoTo n) i) fftCore st ts n vec ... The first line allocates a new double-buffered store of the appropriate length, and the next line computes the number of stages using the integer base-2 logarithm. The third line computes the vector of twiddle factors. ## Twiddle factor alternatives The twiddle vector is constructed as a “pull” vector. In general Pull l f constructs a vector of length l where the element at index i is given by f i. The function manifestFresh writes this vector to a fresh array which is then passed as the ts argument to fftCore. This means that the twiddle factors are computed right away, and fftCore can access them cheaply by just looking them up in the array. Interestingly, we can also choose to skip manifestFresh and pass the pull vector directly to fftCore (remember the overloading of the arguments of fftCore):  ... ts <- return$ Pull (twoTo (n-1)) (\i -> tw (twoTo n) i)
fftCore st ts n vec
...

Pull vectors, unless explicitly written to memory, will always fuse with the computation that uses them, so the above change has the effect of inlining the twiddle factor computation into the FFT loop. With just a simple change on one line we are thus able to move back and forth between a naive FFT and one with pre-computed twiddle factors.

## Loop unrolling

The function fftCore has two main parts: the loop with bfly and twids, and the bitRev permutation. Both of these parts are structured as doubly-nested loops: the outer loop is the $log_2(l)$ stages, and each stage is a loop over the vector being transformed.

A common technique for increasing instruction-level parallelism is to unroll bodies of inner loops. This optimization is often performed automatically by the C compiler at high optimization levels, but there is no general guarantee that it will be applied in a given situation. For this reason, it is interesting to be able to express unrolling manually.

Feldspar provides a function for unrolling the computation described by a pull vector:

unroll :: (Pully vec a, MonadComp m)
=> Length  -- ^ Number of steps to unroll
-> vec
-> Push m a

The first argument is a Haskell integer (i.e. an integer known at compile time) that gives the number of steps to unroll. The result is a “push” vector. A push vector is similar to pull in that it can fuse with computations at the use site. However, in contrast to pull, push captures a strategy for writing its element to a destination (normally a memory array). For example, a push vector can use a loop that writes a number of elements in each iteration.

To unroll the inner loop of fftCore by, say, two steps, we simply insert unroll 2 after the twids stage:

    let step i = return . unroll 2 . twids ts n i (length vec) . bfly i

Similarly, we can insert unroll in the definition of bitRev.

Here is the inner loop of the generated code without unrolling (local variable declarations omitted for brevity):

  for (v10 = 0; v10 < 65536; v10++) {
...
let11 = (bool) (v10 & 1 << (int32_t) (uint32_t) v9);
let12 = store2[v10 ^ 1 << (int32_t) (uint32_t) v9];
let13 = store2[v10];
let14 = let11 ? let12 - let13 : let13 + let12;
store3[v10] = let11 ? a5[(v10 & ~(4294967295 <<
(int32_t) (uint32_t) v9)) <<
(int32_t) r4 - 1 -
(int32_t) (uint32_t) v9] *
let14 : let14;
}

And the same loop unrolled by two steps:

 for (v10 = 0; v10 < 65536; v10 = v10 + 2) {
...
r11 = v10;
let12 = (bool) (r11 & 1 << (int32_t) (uint32_t) v9);
let13 = store2[r11 ^ 1 << (int32_t) (uint32_t) v9];
let14 = store2[r11];
let15 = let12 ? let13 - let14 : let14 + let13;
store3[r11] = let12 ? a5[(r11 & ~(4294967295 <<
(int32_t) (uint32_t) v9)) <<
(int32_t) r4 - 1 -
(int32_t) (uint32_t) v9] *
let15 : let15;
r16 = v10 + 1;
let17 = (bool) (r16 & 1 << (int32_t) (uint32_t) v9);
let18 = store2[r16 ^ 1 << (int32_t) (uint32_t) v9];
let19 = store2[r16];
let20 = let17 ? let18 - let19 : let19 + let18;
store3[r16] = let17 ? a5[(r16 & ~(4294967295 <<
(int32_t) (uint32_t) v9)) <<
(int32_t) r4 - 1 -
(int32_t) (uint32_t) v9] *
let20 : let20;
}

# Evaluation

We have seen the interface to Feldspar’s FFT and how to explore different options regarding twiddle pre-computation and loop unrolling. The next step is to benchmark the FFT to see how the design decisions affect performance and how the FFT is doing compared to state of the art.

We will roughly follow the benchmark methodology advocated for FFTW. In particular, we present the fictive “MFLOP” count in order to get a comparable efficiency metric independent of input size.

The benchmarks were run on a 3.00GHz Intel Core 2 Duo CPU (E8400) with 6MB L2 cache, using GCC 5.4.0 as compiler. Optimization level -O3 was used.

## FFTW

As a baseline for comparison, we first benchmark FFTW version 3.3.4. The benchmark is implemented as a Feldspar program that calls out to the fftw3 C library. The initialization or “planning” phase is done before the measurements start, as advocated by the FFTW benchmark methodology. Note that planning in FFTW includes pre-computing twiddle factors (Frigo and Johnson 2005).

The result is an efficiency ranging from around 8000 MFLOPS to 3500 MFLOPS for input sizes from $2^4$ to $2^17$ (see plot below). The result is more or less consistent with the results presented here.

## Feldspar FFT

The program for benchmarking the Feldspar FFT uses the same principle as the one for FFTW. We pre-compute the twiddle factors and the number of stages, and this is done before the measurement starts.

The results are shown in the diagram below. The Feldspar FFT runs steadily at around 600 MFLOPS without unrolling and at an average of around 700 MFLOPS with unrolling. We chose to unroll the main loop by four steps and the bitRev loop by 8 steps. The average speed gain from unrolling is 21%.

The plot does not include the measurements from inlining the twiddle factors. We simply note that inlining makes the FFT more than 7 times slower.

# Conclusion

## Performance

Our benchmarks showed several interesting things:

• The Feldspar FFT runs significantly slower than FFTW: from 1/9 of the speed for smaller inputs to 1/5 of the speed for larger inputs.
• Pre-computing the twiddle factors has a major impact on performance.
• Unrolling the inner loops has a noticeable but small impact on performance in the Feldspar FFT.

It is hard to know how big a difference compared to FFTW is reasonable for a relatively straightforward FFT implementation such as the one in Feldspar. Our working assumption is that the implementations included in FFTW’s own benchmark (see e.g. the results for 3.0 GHz Intel Core Duo) are all reasonable in the sense that they would not compare against naive toy implementations. For example Mix-FFT which runs at around 1/6 of the speed of FFTW is claimed to be a “very fast … FFT routine” on its home page.

Seeing that the Feldspar implementation compares rather well with Mix-FFT and other implementations in the FFTW benchmark, we make the hand-wavy conclusion that Feldspar and its FFT indeed perform reasonably well.

We have not done any evaluation of memory use. But manual inspection of the C code generated from the Feldspar FFT shows no signs of excessive memory: three arrays are allocated (all on the stack) – one for the input, one for intermediate results and one for the twiddle factors.

## Design exploration

One advantage of using Feldspar compared to programming directly in C is the ease by which alternative implementations can be explored. We saw two examples of such exploration in this post:

• Inlined vs. pre-computed twiddle factors
• Unrolling of inner loops

Both of these decisions can be altered by minimal changes to the existing code. In contrast, the changes would require much more work in C. Furthermore, the changes done in Feldspar (switching between return and manifestFresh, and inserting unroll) are always safe1, whereas the corresponding changes in C are easy to get wrong.

# References

Axelsson, Emil, and Mary Sheeran. 2012. “Feldspar: Application and Implementation.” In Central European Functional Programming School: 4th Summer School, CEFP 2011, Budapest, Hungary, June 14-24, 2011, Revised Selected Papers, 402–39. Springer Berlin Heidelberg. doi:10.1007/978-3-642-32096-5_8.

Frigo, Matteo, and Steven G. Johnson. 2005. “The Design and Implementation of FFTW3.” In Proceedings of the IEEE, 93:216–31. 2. doi:10.1109/JPROC.2004.840301.

1. Technically, unroll s is only safe to insert for vectors whose length is a multiple of s. Feldspar automatically emits an assert statement to check this requirement. It is possible to turn off such checks, and this is what we did for our benchmarks; see the flag compilerAssertions = select [] in FFT_bench.hs.

## Wednesday, March 30, 2016

### Compilation as a Typed EDSL-to-EDSL Transformation

[arXiv:1603.08865]

# Abstract

This article is about an implementation and compilation technique that is used in RAW-Feldspar which is a complete rewrite of the Feldspar embedded domain-specific language (EDSL) (Axelsson et al. 2010). Feldspar is high-level functional language that generates efficient C code to run on embedded targets.

The gist of the technique presented in this post is the following: rather writing a back end that converts pure Feldspar expressions directly to C, we translate them to a low-level monadic EDSL. From the low-level EDSL, C code is then generated.

1. The translation is simpler to write than a complete C back end.
2. The translation is between two typed EDSLs, which rules out many potential errors.
3. The low-level EDSL is reusable and can be shared between several high-level EDSLs.

Although the article contains a lot of code, most of it is in fact reusable. As mentioned in Discussion, we can write the same implementation in less than 50 lines of code using generic libraries that we have developed to support Feldspar.

# Introduction

There has been a recent trend to implement low-level imperative EDSLs based on monads (Persson, Axelsson, and Svenningsson 2012; J. D. Svenningsson and Svensson 2013; Hickey et al. 2014; Bracker and Gill 2014). By using a deep embedding of monadic programs, imperative EDSL code can be translated directly to corresponding code in the back end (e.g. C or JavaScript code).

A deep embedding of a monad with primitive instruction gives a representation of the statements of an imperative language. But in order for a language to be useful, there also needs to be a notion of pure expressions. Take the following program as example:

prog = do
writeOutput (i+i)

In order to generate code from this program we need a representation of the whole syntax, including the expression i+i. And it is usually not enough to support such simple expressions. For example, in RAW-Feldspar, we can write the following,

prog = do
i <- fget stdin
printf "sum: %d\n" $sum$ map square (0 ... i)
where
square x = x*x

and have it generate this C code:

 #include <stdint.h>
#include <stdio.h>
int main()
{
uint32_t v0;
uint32_t let1;
uint32_t state2;
uint32_t v3;

fscanf(stdin, "%u", &v0);
let1 = v0 + 1;
state2 = 0;
for (v3 = 0; v3 < (uint32_t) (0 < let1) * let1; v3++) {
state2 = v3 * v3 + state2;
}
fprintf(stdout, "sum: %d\n", state2);
return 0;
}

Note that the pure expression sum $map square (0 ... i) was turned into several statements in the C code (everything from let1 = to the end of the loop). What happened here was that sum$ map square (0 ... i) was first transformed to a more low-level (but still pure) expression that involves a let binding and a looping construct. This expression was then translated to C code.

However, the story is actually more interesting than that: Instead of translating the pure expression directly to C, which can be a tedious and error-prone task, it is first translated to a typed low-level EDSL that only has simple expressions and a straightforward translation to C. Going through a typed EDSL makes the translation simple and robust, and this low-level EDSL has the advantage of being generic and reusable between different high-level EDSLs.

This post will show a simplified version of the technique used in RAW-Feldspar. For the full story, see the RAW-Feldspar implementation (tag article-2016-03). The low-level EDSL that is used for compiling Feldspar to C is provided by the package imperative-edsl.

# Outline of the technique

Our technique is based on a deep monadic embedding of imperative programs:

data Prog exp a where ...

instance Monad (Prog exp)

By parameterizing Prog on the expression type exp, we can have imperative programs with different representations of expressions.

Next, we define two expression languages:

data LowExp a  where ...
data HighExp a where ...

LowExp is a simple language containing only variables, literals, operators and function calls. HighExp is a richer language that may contain things like let binding, tuples and pure looping constructs.

Since LowExp is so simple, we can easily – once and for all – write a back end that translates Prog LowExp to C code:

lowToCode :: Prog LowExp a -> String

Once this is done, we can implement the compiler for HighExp simply as a typed translation function between two EDSLs:

translateHigh :: Prog HighExp a -> Prog LowExp a

compile :: Prog HighExp a -> String
compile = lowToCode . translateHigh

If we want to make changes to HighExp, we only need to update the translateHigh function. And if we want to implement an EDSL with a different expression language, we simply write a different translation function for that language.

# A deep embedding of imperative programs

The code is available as a literate Haskell file.

We make use of the following GHC extensions:

{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE Rank2Types #-}
{-# LANGUAGE UndecidableInstances #-}

And we also need a few helper modules:

import Control.Monad.State
import Data.Int
import Data.IORef

We are now ready to define our monadic deep embedding. Our EDSL will be centered around the monad Prog (inspired by the Program monad in Operational and the work of J. D. Svenningsson and Svensson (2013)):

data Prog exp a where
Return :: a -> Prog exp a
(:>>=) :: Prog exp a -> (a -> Prog exp b) -> Prog exp b
CMD    :: CMD exp a -> Prog exp a

instance Functor (Prog exp) where
fmap f m = m >>= return . f

instance Applicative (Prog exp) where
pure  = return
(<*>) = ap

return = Return
(>>=)  = (:>>=)

As seen in the Monad instance, Return and (:>>=) directly represent uses of return and (>>=), respectively. CMD injects a primitive instruction from the type CMD a which will be defined below.

The attentive reader may react that Prog actually does not satisfy the monad laws, since it allows us to observe the nesting of (>>=) in a program. This is, however, not a problem in practice: we will only interpret Prog in ways that are ignorant of the nesting of (>>=).

(The Operational package honors the monad laws by only providing an abstract interface to Program, preventing interpretations that would otherwise be able to observe the nesting of (>>=).)

## Primitive instructions

The CMD type contains the primitive instructions of the EDSL. It has instructions for mutable references, input/output and loops:

data CMD exp a where
-- References:
InitRef :: Type a => exp a -> CMD exp (Ref a)
GetRef  :: Type a => Ref a -> CMD exp (Val a)
SetRef  :: Type a => Ref a -> exp a -> CMD exp ()

-- Input/output:
Read     :: CMD exp (Val Int32)
Write    :: exp Int32 -> CMD exp ()
PrintStr :: String -> CMD exp ()

-- Loops:
For :: exp Int32 -> (Val Int32 -> Prog exp ()) -> CMD exp ()

The instructions InitRef, GetRef and SetRef correspond to newIORef, readIORef and writeIORef from Data.IORef.

Read and Write are used to read integers from stdin and write integers to stdout. PrintStr prints a static string to stdout.

For n body represents a for-loop of n iterations, where body gives the program to run in each iteration. The argument to body ranges from 0 to n-1.

As explained in Outline of the technique, exp represents pure expressions, and we have it as a parameter in order to be able to use the same instructions with different expression types.

Now we have three things left to explain: Type, Val and Ref.

The Type class is just used to restrict the set of types that can be used in our EDSL. For simplicity, we will go with the following class

class    (Eq a, Ord a, Show a) => Type a
instance (Eq a, Ord a, Show a) => Type a

Val is used whenever an instruction produces a pure value (e.g. GetRef). It would make sense to use exp a instead of Val a. However, that would lead to problems later when we want to translate between different expression types. We can think of Val a as representing a value in any expression type.

In order to interpret a program, we must have concrete representations of the types returned by the primitive instructions. For example, interpreting the program CMD Read :>>= \a -> ..., requires creating an actual value of type Val Int32 to pass to the continuation. However, depending on the interpretation, we may need different representations for this value.

In a standard interpretation (e.g. running the program in Haskell’s IO monad), we want Val a to be represented by an actual value of type a. But in a non-standard interpretation such as compilation to C code, we want Val a to be a variable identifier. To reconcile these different needs, we define Val as a sum type:

data Val a
= ValRun a       -- Concrete value
| ValComp VarId  -- Symbolic value

-- Variable identifier
type VarId = String

For the same reason, Ref is defined to be either an actual IORef or a variable identifier:

data Ref a
= RefRun (IORef a)  -- Concrete reference
| RefComp VarId     -- Symbolic reference

Unfortunately, having sum types means that our interpretations will sometimes have to do incomplete pattern matching. For example, the standard interpretation of GetRef r will assume that r is a RefRun (a variable identifier has no meaning in the standard interpretation), while the interpretation that compiles to C will assume that it is a RefComp. However, the EDSL user is shielded from this unsafety. By making Val and Ref abstract types, we ensure that users’ programs are unaware of the interpretation which is applied to them. As long as our interpreters are correct, no-one will suffer from the unsafety due to Val and Ref being sum types.

## Interpretation

The following function lifts an interpretation of instructions in a monad m to an interpretation of programs in the same monad (corresponding to interpretWithMonad in Operational):

interpret :: Monad m
=> (forall a . CMD exp a -> m a)
-> Prog exp a -> m a
interpret int (Return a) = return a
interpret int (p :>>= k) = interpret int p >>= interpret int . k
interpret int (CMD cmd)  = int cmd

Using interpret, we can define the standard interpretation that runs a program in the IO monad:

runIO :: EvalExp exp => Prog exp a -> IO a
runIO = interpret runCMD

runCMD :: EvalExp exp => CMD exp a -> IO a
runCMD (InitRef a)           = RefRun <$> newIORef (evalExp a) runCMD (GetRef (RefRun r)) = ValRun <$> readIORef r
runCMD (SetRef (RefRun r) a) = writeIORef r (evalExp a)
runCMD Read                  = ValRun . read <$> getLine runCMD (Write a) = putStr$ show $evalExp a runCMD (PrintStr s) = putStr s runCMD (For n body) = mapM_ (runIO . body . ValRun) [0 .. evalExp n - 1] Most of the work is done in runCMD which gives the interpretation of each instruction. Note the use of ValRun and RefRun on both sides of the equations. As said earlier, ValComp and RefComp are not used in the standard interpretation. Since running a program also involves evaluating expressions, we need the constraint EvalExp exp that gives access to the function evalExp (for example in the interpretation of InitRef). EvalExp is defined as follows: class EvalExp exp where -- Evaluate a closed expression evalExp :: exp a -> a ## Front end We are now ready to define the smart constructors that we will give as the interface to the EDSL user. For many instructions, the smart constructor just takes care of lifting it to a program using CMD, for example: initRef :: Type a => exp a -> Prog exp (Ref a) initRef = CMD . InitRef setRef :: Type a => Ref a -> exp a -> Prog exp () setRef r a = CMD (SetRef r a) But instructions that involve Val will need more care. We do not actually want to expose Val to the user, but rather use exp to hold values. To achieve this, we introduce a class for expressions supporting injection of constants and variables: class FreeExp exp where -- Inject a constant value constExp :: Type a => a -> exp a -- Inject a variable varExp :: Type a => VarId -> exp a Using FreeExp, we can make a general function that converts Val to an expression – independent of interpretation: valToExp :: (Type a, FreeExp exp) => Val a -> exp a valToExp (ValRun a) = constExp a valToExp (ValComp v) = varExp v Now we can define the rest of the front end, using valToExp where needed: getRef :: (Type a, FreeExp exp) => Ref a -> Prog exp (exp a) getRef = fmap valToExp . CMD . GetRef readInput :: FreeExp exp => Prog exp (exp Int32) readInput = valToExp <$> CMD Read

writeOutput :: exp Int32 -> Prog exp ()
writeOutput = CMD . Write

printStr :: String -> Prog exp ()
printStr = CMD . PrintStr

for :: FreeExp exp
=> exp Int32 -> (exp Int32 -> Prog exp ()) -> Prog exp ()
for n body = CMD $For n (body . valToExp) As a convenience, we also provide a function for modifying a reference using a pure function: modifyRef :: (Type a, FreeExp exp) => Ref a -> (exp a -> exp a) -> Prog exp () modifyRef r f = setRef r . f =<< getRef r # Pure expressions Now that we have a front end for the imperative EDSL, we also need to define an expression type before we can use it for anything. We begin with a very simple expression type that only has variables, literals and some primitive functions: data LowExp a where LVar :: Type a => VarId -> LowExp a LLit :: Type a => a -> LowExp a LAdd :: (Num a, Type a) => LowExp a -> LowExp a -> LowExp a LMul :: (Num a, Type a) => LowExp a -> LowExp a -> LowExp a LNot :: LowExp Bool -> LowExp Bool LEq :: Type a => LowExp a -> LowExp a -> LowExp Bool As common in Haskell EDSLs, we provide a Num instance to make it easier to construct expressions: instance (Num a, Type a) => Num (LowExp a) where fromInteger = LLit . fromInteger (+) = LAdd (*) = LMul This instance allows us to write things like 5+6 :: LowExp Int32. The implementation of LowExp is completed by instantiating the EvalExp and FreeExp classes: instance EvalExp LowExp where evalExp (LLit a) = a evalExp (LAdd a b) = evalExp a + evalExp b evalExp (LMul a b) = evalExp a * evalExp b evalExp (LNot a) = not$ evalExp a
evalExp (LEq a b)  = evalExp a == evalExp b

instance FreeExp LowExp where
constExp = LLit
varExp   = LVar

## First example

We can now write our first imperative example program:

sumInput :: Prog LowExp ()
sumInput = do
r <- initRef 0
for 4 $\i -> do printStr " > " n <- readInput modifyRef r (+n) printStr "The sum of your numbers is " s <- getRef r writeOutput s printStr ".\n" This program uses a for-loop to read four number from the user. It keeps the running sum in a reference, and prints the final sum before quitting. An example run can look as follows: *EDSL_Compilation> runIO sumInput Please enter 4 numbers > 1 > 2 > 3 > 4 The sum of your numbers is 10. ## Code generation The code generator is defined using interpret with the target being a code generation monad Code: code :: ShowExp exp => Prog exp a -> Code a code = interpret codeCMD lowToCode :: Prog LowExp a -> String lowToCode = runCode . code The definition of codeCMD is deferred to Appendix: Code generation. In this article, the code generator only produces simple imperative pseudo-code. Here is the code generated from the sumInput example: *EDSL_Compilation> putStrLn$ lowToCode sumInput
r0 <- initRef 0
for v1 < 4
v2 <- stdin
v3 <- getRef r0
setRef r0 (v3 + v2)
end for
v4 <- getRef r0
writeOutput v4
readInput ".\n"

# High-level expressions

The expression language LowExp is really too simple to be very useful. For example, it does not allow any kind of iteration, so any iterative program has to be written using the monadic for-loop.

To fix this, we introduce a new expression language, HighExp:

data HighExp a where
-- Simple constructs (same as in LowExp):
HVar :: Type a => VarId -> HighExp a
HLit :: Type a => a -> HighExp a
HAdd :: (Num a, Type a) => HighExp a -> HighExp a -> HighExp a
HMul :: (Num a, Type a) => HighExp a -> HighExp a -> HighExp a
HNot :: HighExp Bool -> HighExp Bool
HEq  :: Type a => HighExp a -> HighExp a -> HighExp Bool

-- Let binding:
Let  :: Type a
=> HighExp a                 -- value to share
-> (HighExp a -> HighExp b)  -- body
-> HighExp b

-- Pure iteration:
Iter :: Type s
=> HighExp Int32            -- number of iterations
-> HighExp s                -- initial state
-> (HighExp s -> HighExp s) -- step function
-> HighExp s                -- final state

instance (Num a, Type a) => Num (HighExp a) where
fromInteger = HLit . fromInteger
(*) = HMul

instance FreeExp HighExp where
constExp = HLit
varExp   = HVar

HighExp contains the same simple constructs as LowExp, but has been extended with two new constructs: let binding and iteration. Both of these constructs use higher-order abstract syntax (HOAS) (Pfenning and Elliot 1988) as a convenient way to introduce local variables.

The following program uses HighExp to represent expressions:

powerInput :: Prog HighExp ()
powerInput = do
printStr " > "; m <- readInput
printStr " > "; n <- readInput
printStr "Here's a fact: "
writeOutput m; printStr "^"; writeOutput n; printStr " = "
writeOutput $Iter n 1 (*m) printStr ".\n" It asks the user to input two numbers m and n, and and prints the result of m^n. Note the use of Iter to compute m^n as a pure expression. # Compilation as a typed EDSL translation Now we are getting close to the main point of the article. This is what we have so far: • A way to generate code from low-level programs (of type Prog LowExp a). • The ability to write high-level programs (of type Prog HighExp a). What is missing is a function that translates high-level programs to low-level programs. Generalizing the problem a bit, we need a way to rewrite a program over one expression type to a program over a different expression type – a process which we call “re-expressing”. Since Prog is a monad like any other, we can actually express this translation function using interpret: reexpress :: (forall a . exp1 a -> Prog exp2 (exp2 a)) -> Prog exp1 b -> Prog exp2 b reexpress reexp = interpret (reexpressCMD reexp) reexpressCMD :: (forall a . exp1 a -> Prog exp2 (exp2 a)) -> CMD exp1 b -> Prog exp2 b reexpressCMD reexp (InitRef a) = CMD . InitRef =<< reexp a reexpressCMD reexp (GetRef r) = CMD (GetRef r) reexpressCMD reexp (SetRef r a) = CMD . SetRef r =<< reexp a reexpressCMD reexp Read = CMD Read reexpressCMD reexp (Write a) = CMD . Write =<< reexp a reexpressCMD reexp (PrintStr s) = CMD (PrintStr s) reexpressCMD reexp (For n body) = do n' <- reexp n CMD$ For n' (reexpress reexp . body)

The first argument to reexpress translates an expression in the source language to an expression in the target language. Note the type of this function: exp1 a -> Prog exp2 (exp2 a). Having a monadic result type means that source expressions do not have to be mapped directly to corresponding target expressions – they can also generate imperative code!

The ability to generate imperative code from expressions is crucial. Consider the Iter construct. Since there is no corresponding construct in LowExp, the only way we can translate this construct is by generating an imperative for-loop.

## Translation of HighExp

Now that we have a general way to “re-rexpress” programs, we need to define the function that translates HighExp to LowExp. The first cases just map HighExp constructs to their corresponding LowExp constructs:

transHighExp :: HighExp a -> Prog LowExp (LowExp a)
transHighExp (HVar v)   = return (LVar v)
transHighExp (HLit a)   = return (LLit a)
transHighExp (HAdd a b) = LAdd <$> transHighExp a <*> transHighExp b transHighExp (HMul a b) = LMul <$> transHighExp a <*> transHighExp b
transHighExp (HNot a)   = LNot <$> transHighExp a transHighExp (HEq a b) = LEq <$> transHighExp a <*> transHighExp b

It gets more interesting when we get to Let. There is no corresponding construct in LowExp, so we have to realize it using imperative constructs:

transHighExp (Let a body) = do
r  <- initRef =<< transHighExp a
a' <- CMD $GetRef r transHighExp$ body $valToExp a' The shared value a must be passed to body by-value. This is achieved by initializing a reference with the value of a and immediately reading out that value. Note that we cannot use the smart constructor getRef here, because it would return an expression of type LowExp while body expects a HighExp argument. By using CMD$ GetRef r, we obtain a result of type Val, which can be translated to any expression type using valToExp.

Exercise: Change the translation of Let to use call-by-name semantics instead.

Translation of Iter is similar to that of Let in that it uses a reference for the local state. The iteration is realized by an imperative for loop. In each iteration, the state variable r is read, then the next state is computed and written back to the r. After the loop, the content of r is returned as the final state.

transHighExp (Iter n s body) = do
n' <- transHighExp n
r  <- initRef =<< transHighExp s
for n' $\_ -> do sPrev <- CMD$ GetRef r
sNext <- transHighExp $body$ valToExp sPrev
setRef r sNext
getRef r

Exercise: Change the translation of Iter by unrolling the body twice when the number of iterations is known to be a multiple of 2:

transHighExp (Iter (HMul n (HLit 2)) s body) = do
...

All that is left now is to put the pieces together and define the compile function:

translateHigh :: Prog HighExp a -> Prog LowExp a
translateHigh = reexpress transHighExp

compile :: Prog HighExp a -> String
compile = lowToCode . translateHigh

To see that it works as expected, we compile the powerInput example:

*EDSL_Compilation> putStrLn $compile powerInput printStr "Please enter two numbers\n" printStr " > " v0 <- readInput printStr " > " v1 <- readInput printStr "Here's a fact: " writeOutput v0 printStr "^" writeOutput v1 printStr " = " r2 <- initRef 1 for v3 < v1 v4 <- getRef r2 setRef r2 (v4 * v0) end for v5 <- getRef r2 writeOutput v5 printStr ".\n" Note especially the for-loop resulting from the pure expression Iter n 1 (*m). # Discussion The simplicity of the presented technique can be seen by the fact that most of the code is reusable. • Prog, interpret and reexpress are independent of the expression language. • LowExp and its associated functions can be used as target for many different high-level languages. The only (non-trivial) parts that are not reusable are HighExp and transHighExp. This shows that in fact very little work is needed to define an EDSL complete with code generation. ## Generalized implementation In fact, most of the code presented in this article is available in generic libraries: • Prog, interpret and reexpress are available in operational-alacarte (under the names Program, interpretWithMonad and reexpress). • Variants of the instructions in CMD are available as separate composable types in the package imperative-edsl (which, in turn, depends on operational-alacarte). As a proof of concept, we provide a supplementary file that implements the HighExp compiler using the generic libraries. The whole compiler, not counting imports and the definition of HighExp, is just 30 lines of code! What is more, it emits real runnable C code instead of the pseudo-code used in this article. ## Richer high-level languages Part of the reason why transHighExp was so easy to define is that HighExp and LowExp support the same set of types. Things get more complicated if we want to extend HighExp with, for example, tuples. Readers interested in how to implement more complex expressions using the presented technique are referred to the RAW-Feldspar source code (tag article-2016-03). # Appendix: Code generation Here we define he missing parts of the code generator introduced in Code generation. ## Code generation of instructions -- Code generation monad type Code = WriterT [Stmt] (State Unique) type Stmt = String type Unique = Integer runCode :: Code a -> String runCode = unlines . flip evalState 0 . execWriterT . indent -- Emit a statement in the generated code stmt :: Stmt -> Code () stmt s = tell [s] -- Generate a unique identifier unique :: String -> Code VarId unique base = do u <- get; put (u+1) return (base ++ show u) -- Generate a unique symbolic value freshVar :: Type a => Code (Val a) freshVar = ValComp <$> unique "v"

-- Generate a unique reference
freshRef :: Type a => Code (Ref a)
freshRef = RefComp <$> unique "r" -- Modify a code generator by indenting the generated code indent :: Code a -> Code a indent = censor$ map ("    " ++)

-- Code generation of instructions
codeCMD :: ShowExp exp => CMD exp a -> Code a
codeCMD (InitRef a) = do
r <- freshRef
stmt $unwords [show r, "<- initRef", showExp a] return r codeCMD (GetRef r) = do v <- freshVar stmt$ unwords [show v, "<- getRef", show r]
return v
codeCMD (SetRef r a) = stmt $unwords ["setRef", show r, showExp a] codeCMD Read = do v <- freshVar stmt$ unwords [show v, "<- readInput"]
return v
codeCMD (Write a)    = stmt $unwords ["writeOutput", showExp a] codeCMD (PrintStr s) = stmt$ unwords ["printStr", show s]
codeCMD (For n body) = do
i <- freshVar
stmt $unwords ["for", show i, "<", showExp n] indent$ code (body i)
stmt "end for"

## Showing expressions

instance Show (Val a) where show (ValComp a) = a
instance Show (Ref a) where show (RefComp r) = r

class ShowExp exp where
showExp :: exp a -> String

bracket :: String -> String
bracket s = "(" ++ s ++ ")"

instance ShowExp LowExp where
showExp (LVar v)   = v
showExp (LLit a)   = show a
showExp (LAdd a b) = bracket (showExp a ++ " + " ++ showExp b)
showExp (LMul a b) = bracket (showExp a ++ " * " ++ showExp b)
showExp (LNot a)   = bracket ("not " ++ showExp a)
showExp (LEq a b)  = bracket (showExp a ++ " == " ++ showExp b)

# Acknowledgement

This work is funded by the Swedish Foundation for Strategic Research, under grant RAWFP.

# References

Axelsson, Emil, Koen Claessen, Gergely Dévai, Zoltán Horváth, Karin Keijzer, Bo Lyckegård, Anders Persson, Mary Sheeran, Josef Svenningsson, and András Vajda. 2010. “Feldspar: A Domain Specific Language for Digital Signal Processing Algorithms.” In 8th IEEE/ACM International Conference on Formal Methods and Models for Codesign (MEMOCODE 2010), 169–78. IEEE. doi:10.1109/MEMCOD.2010.5558637.

Bracker, Jan, and Andy Gill. 2014. “Sunroof: A Monadic DSL for Generating JavaScript.” In Practical Aspects of Declarative Languages, edited by Matthew Flatt and Hai-Feng Guo, 8324:65–80. Lecture Notes in Computer Science. Springer International Publishing. doi:10.1007/978-3-319-04132-2_5.

Hickey, Patrick C., Lee Pike, Trevor Elliott, James Bielman, and John Launchbury. 2014. “Building Embedded Systems with Embedded DSLs.” In Proceedings of the 19th ACM SIGPLAN International Conference on Functional Programming, 3–9. ICFP ’14. New York, NY, USA: ACM. doi:10.1145/2628136.2628146.

Persson, Anders, Emil Axelsson, and Josef Svenningsson. 2012. “Generic Monadic Constructs for Embedded Languages.” In Implementation and Application of Functional Languages, edited by Andy Gill and Jurriaan Hage, 7257:85–99. Lecture Notes in Computer Science. Springer Berlin Heidelberg. doi:10.1007/978-3-642-34407-7_6.

Pfenning, Frank, and Conal Elliot. 1988. “Higher-Order Abstract Syntax.” In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, 199–208. PLDI ’88. New York, NY, USA: ACM. doi:10.1145/53990.54010.

Svenningsson, Josef David, and Bo Joel Svensson. 2013. “Simple and Compositional Reification of Monadic Embedded Languages.” In Proceedings of the 18th ACM SIGPLAN International Conference on Functional Programming, 299–304. ICFP ’13. New York, NY, USA: ACM. doi:10.1145/2500365.2500611.