This post is a report on some adventures in benchmarking and improving the FFT implementation included as an example in RAW-Feldspar. Feldspar is a Haskell-embedded domain-specific language for digital signal processing and embedded systems. It has a compiler that generates C code.

The main purpose of this experiment was to back up the claim that RAW-Feldspar can generate high-performance code. Additionally we will see some of the advantages provided by RAW-Feldspar when it comes to quickly exploring design alternatives. (In the remainder of the text we will refer to RAW-Feldspar as just Feldspar unless the context requires more specificity.)

In order to get a solid baseline for the benchmarks, we will compare against FFTW 3.x, which is among the fastest FFT implementations out there. We should note, however, that the point is not to compete with FFTW (which is unrealistic given the tremendous amount of work and sophistication that has gone into its development), but rather to get an absolute reference for comparison. If absolute performance is needed, FFTW can be called from Feldspar.

The following files are relevant to this post:

FFT.hs – the final version of the FFT
FFT_bench.hs – the benchmark code for Feldspar’s FFT
FFT_bench.c – the generated C code for the FFT benchmark for input size $2^16$ . Note that it’s a standalone program that has the FFT inlined into the main function.
FFTW.hs – a Feldspar wrapper for FFTW and code to benchmark it

In order to run the code, make sure to first install RAW-Feldspar:

cabal install raw-feldspar

FFT in Feldspar

Feldspar uses the iterative FFT described by Axelsson and Sheeran (2012). Readers who want to understand the implementation in detail are encouraged to read the paper. However, there are some differences due to the fact that the paper was written for a different version of Feldspar. In particular, the FFT in RAW-Feldspar has a more explicit treatment of memory.

The core of the Feldspar FFT is given by the function fftCore (here slightly simplified):

fftCore
    :: ( Pully ts (Data (Complex a))
       , Manifestable Run vec (Data (Complex a))
       , Finite vec
       , RealFloat a
       , PrimType a
       , PrimType (Complex a)
       )
    => Store (Data (Complex a))  -- ^ Double-buffered storage
    -> ts                        -- ^ Twiddle factors
    -> Data Length               -- ^ Number of stages
    -> vec                       -- ^ Vector to transform
    -> Run (DManifest (Complex a))

fftCore st ts n vec = do
    let step i = return . twids ts n i (length vec) . bfly i
    loopStore st (n,-1,Excl 0) (step . subtract 1) vec
      >>= bitRev st n

Let us start by explaining the type:

The first argument is a double-buffered store used to hold intermediate results. The FFT does not allocate any memory on its own.
The second argument is a vector of twiddle factors. Taking this vector as argument allows the twiddle factors to be pre-computed and even shared across multiple FFT runs.
The third argument – n – is the number of stages needed in the FFT. It must hold that $l = 2^n$ where $l$ is the length of the input vector. Taking n as an argument avoids recomputing this value when running multiple FFTs of the same size.
The fourth argument is the vector to be transformed.
The somewhat scary class context has to do with the fact that we want fftCore to work for different types of vectors (type variables ts and vec) and different types of elements (type variable a). The overloading of ts will come in handy below when we move between inlined and pre-computed twiddle factors.

The body of the function consists of a loop of n iterations followed by a bit-reversal permutation. The loop body is given by the step function, which comprises one “stage” of the FFT. Each stage is a butterfly (bfly) followed by twiddle factor multiplication (twids).

Here is a typical way to call fftCore for an input vector vec:

    do ...
       st <- newStore (length vec)
       n  <- shareM (ilog2 (length vec))
       ts <- manifestFresh $ Pull (twoTo (n-1)) (\i -> tw (twoTo n) i)
       fftCore st ts n vec
       ...

The first line allocates a new double-buffered store of the appropriate length, and the next line computes the number of stages using the integer base-2 logarithm. The third line computes the vector of twiddle factors.

Twiddle factor alternatives

The twiddle vector is constructed as a “pull” vector. In general Pull l f constructs a vector of length l where the element at index i is given by f i. The function manifestFresh writes this vector to a fresh array which is then passed as the ts argument to fftCore. This means that the twiddle factors are computed right away, and fftCore can access them cheaply by just looking them up in the array.

Interestingly, we can also choose to skip manifestFresh and pass the pull vector directly to fftCore (remember the overloading of the arguments of fftCore):

       ...
       ts <- return $ Pull (twoTo (n-1)) (\i -> tw (twoTo n) i)
       fftCore st ts n vec
       ...

Pull vectors, unless explicitly written to memory, will always fuse with the computation that uses them, so the above change has the effect of inlining the twiddle factor computation into the FFT loop. With just a simple change on one line we are thus able to move back and forth between a naive FFT and one with pre-computed twiddle factors.

Loop unrolling

The function fftCore has two main parts: the loop with bfly and twids, and the bitRev permutation. Both of these parts are structured as doubly-nested loops: the outer loop is the $log_2(l)$ stages, and each stage is a loop over the vector being transformed.

A common technique for increasing instruction-level parallelism is to unroll bodies of inner loops. This optimization is often performed automatically by the C compiler at high optimization levels, but there is no general guarantee that it will be applied in a given situation. For this reason, it is interesting to be able to express unrolling manually.

Feldspar provides a function for unrolling the computation described by a pull vector:

unroll :: (Pully vec a, MonadComp m)
    => Length  -- ^ Number of steps to unroll
    -> vec
    -> Push m a

The first argument is a Haskell integer (i.e. an integer known at compile time) that gives the number of steps to unroll. The result is a “push” vector. A push vector is similar to pull in that it can fuse with computations at the use site. However, in contrast to pull, push captures a strategy for writing its element to a destination (normally a memory array). For example, a push vector can use a loop that writes a number of elements in each iteration.

To unroll the inner loop of fftCore by, say, two steps, we simply insert unroll 2 after the twids stage:

    let step i = return . unroll 2 . twids ts n i (length vec) . bfly i

Similarly, we can insert unroll in the definition of bitRev.

Here is the inner loop of the generated code without unrolling (local variable declarations omitted for brevity):

  for (v10 = 0; v10 < 65536; v10++) {
      ...
      let11 = (bool) (v10 & 1 << (int32_t) (uint32_t) v9);
      let12 = store2[v10 ^ 1 << (int32_t) (uint32_t) v9];
      let13 = store2[v10];
      let14 = let11 ? let12 - let13 : let13 + let12;
      store3[v10] = let11 ? a5[(v10 & ~(4294967295 <<
                                        (int32_t) (uint32_t) v9)) <<
                               (int32_t) r4 - 1 -
                               (int32_t) (uint32_t) v9] *
          let14 : let14;
  }

And the same loop unrolled by two steps:

 for (v10 = 0; v10 < 65536; v10 = v10 + 2) {
      ...
      r11 = v10;
      let12 = (bool) (r11 & 1 << (int32_t) (uint32_t) v9);
      let13 = store2[r11 ^ 1 << (int32_t) (uint32_t) v9];
      let14 = store2[r11];
      let15 = let12 ? let13 - let14 : let14 + let13;
      store3[r11] = let12 ? a5[(r11 & ~(4294967295 <<
                                        (int32_t) (uint32_t) v9)) <<
                               (int32_t) r4 - 1 -
                               (int32_t) (uint32_t) v9] *
          let15 : let15;
      r16 = v10 + 1;
      let17 = (bool) (r16 & 1 << (int32_t) (uint32_t) v9);
      let18 = store2[r16 ^ 1 << (int32_t) (uint32_t) v9];
      let19 = store2[r16];
      let20 = let17 ? let18 - let19 : let19 + let18;
      store3[r16] = let17 ? a5[(r16 & ~(4294967295 <<
                                        (int32_t) (uint32_t) v9)) <<
                               (int32_t) r4 - 1 -
                               (int32_t) (uint32_t) v9] *
          let20 : let20;
  }

Evaluation

We have seen the interface to Feldspar’s FFT and how to explore different options regarding twiddle pre-computation and loop unrolling. The next step is to benchmark the FFT to see how the design decisions affect performance and how the FFT is doing compared to state of the art.

We will roughly follow the benchmark methodology advocated for FFTW. In particular, we present the fictive “MFLOP” count in order to get a comparable efficiency metric independent of input size.

The benchmarks were run on a 3.00GHz Intel Core 2 Duo CPU (E8400) with 6MB L2 cache, using GCC 5.4.0 as compiler. Optimization level -O3 was used.

FFTW

As a baseline for comparison, we first benchmark FFTW version 3.3.4. The benchmark is implemented as a Feldspar program that calls out to the fftw3 C library. The initialization or “planning” phase is done before the measurements start, as advocated by the FFTW benchmark methodology. Note that planning in FFTW includes pre-computing twiddle factors (Frigo and Johnson 2005).

The result is an efficiency ranging from around 8000 MFLOPS to 3500 MFLOPS for input sizes from $2^4$ to $2^17$ (see plot below). The result is more or less consistent with the results presented here.

Feldspar FFT

The program for benchmarking the Feldspar FFT uses the same principle as the one for FFTW. We pre-compute the twiddle factors and the number of stages, and this is done before the measurement starts.

The results are shown in the diagram below. The Feldspar FFT runs steadily at around 600 MFLOPS without unrolling and at an average of around 700 MFLOPS with unrolling. We chose to unroll the main loop by four steps and the bitRev loop by 8 steps. The average speed gain from unrolling is 21%.

The plot does not include the measurements from inlining the twiddle factors. We simply note that inlining makes the FFT more than 7 times slower.

Conclusion

Performance

Our benchmarks showed several interesting things:

The Feldspar FFT runs significantly slower than FFTW: from 1/9 of the speed for smaller inputs to 1/5 of the speed for larger inputs.
Pre-computing the twiddle factors has a major impact on performance.
Unrolling the inner loops has a noticeable but small impact on performance in the Feldspar FFT.

It is hard to know how big a difference compared to FFTW is reasonable for a relatively straightforward FFT implementation such as the one in Feldspar. Our working assumption is that the implementations included in FFTW’s own benchmark (see e.g. the results for 3.0 GHz Intel Core Duo) are all reasonable in the sense that they would not compare against naive toy implementations. For example Mix-FFT which runs at around 1/6 of the speed of FFTW is claimed to be a “very fast … FFT routine” on its home page.

Seeing that the Feldspar implementation compares rather well with Mix-FFT and other implementations in the FFTW benchmark, we make the hand-wavy conclusion that Feldspar and its FFT indeed perform reasonably well.

We have not done any evaluation of memory use. But manual inspection of the C code generated from the Feldspar FFT shows no signs of excessive memory: three arrays are allocated (all on the stack) – one for the input, one for intermediate results and one for the twiddle factors.

Design exploration

One advantage of using Feldspar compared to programming directly in C is the ease by which alternative implementations can be explored. We saw two examples of such exploration in this post:

Inlined vs. pre-computed twiddle factors
Unrolling of inner loops

Both of these decisions can be altered by minimal changes to the existing code. In contrast, the changes would require much more work in C. Furthermore, the changes done in Feldspar (switching between return and manifestFresh, and inserting unroll) are always safe¹, whereas the corresponding changes in C are easy to get wrong.

References

Axelsson, Emil, and Mary Sheeran. 2012. “Feldspar: Application and Implementation.” In Central European Functional Programming School: 4th Summer School, CEFP 2011, Budapest, Hungary, June 14-24, 2011, Revised Selected Papers, 402–39. Springer Berlin Heidelberg. doi:10.1007/978-3-642-32096-5_8.

Frigo, Matteo, and Steven G. Johnson. 2005. “The Design and Implementation of FFTW3.” In Proceedings of the IEEE, 93:216–31. 2. doi:10.1109/JPROC.2004.840301.

Technically, unroll s is only safe to insert for vectors whose length is a multiple of s. Feldspar automatically emits an assert statement to check this requirement. It is possible to turn off such checks, and this is what we did for our benchmarks; see the flag compilerAssertions = select [] in FFT_bench.hs.↩

Functional Discoveries

Wednesday, November 30, 2016

Benchmarking FFT in Feldspar