============================
REDUCERON MEMO 33
Fourth quarterly review
Matthew N, 20 September 2009
============================

Since the second quarterly review (Memo 21), we have enhanced the FPGA
implementation in three ways:

  1. The arity-limit has been lifted from 4 to 7.

  2. The constraint that case expressions can only appear in the
     spines of function bodies has been lifted, allowing much greater
     opportunity for inlining.

  3. A two-space copying garbage collector has been implemented.  A
     one-space collector was considered (Memo 28), which also takes
     time linear in the size of the live heap, but has not been
     implemented.

We made following two performance improvements, discussed in Memo 25:

  1. Improved update-avoidance by tagging each application with a bit
     sating whether or not it is a normal form.  Applications which
     are already in normal form do not need to be updated.  The result
     is that, on average, 87% of updates are avoided instead of 60%.

  2. Inlining all non-recursive functions whose bodies contain just a
     single application gives an average 18% speed-up.  The reason for
     this speed-up is discussed in Memos 2 and 25.

All of the above modifications were made without lowering the maximum
clock frequency of the design.

We made the following measurements.

  1. We carried out some parameter-tuning experiments (Memo 25),
     using the machine semantics (implemented in Haskell).  We
     measured the effect of max application length, max spine length,
     and max instantiations per cycle on execution time.  We found
     that the values 4, 6, and 2 respectively were good choices, but
     that there was a mild-to-moderate benefit in increasing max
     instantiations per cycle to 3.

  2. We measured the performance of the FPGA implementation (Memo 30)
     and found that, in terms of number of clock-cycles, it is over
     four times faster than Matthew's thesis implementation.  On the new
     FPGA, both the new and the thesis implementations clock around
     the same frequency of 120MHz.

  3. We profiled the FPGA implementation, and found that the average
     time taken by garbage collector is now 2%, compared to
     11% in the thesis implementation (Memo 30).  Explanations for
     this include spinelessness and a larger heap.  We also found that
     on average 22% of time is now taken performing swaps and
     primitive reductions.

Consequently, we have decided not to take the concurrent garbage
collector route, at least for now.  On the other hand, speculative
evaluation of primitive redexes now looks very worthwhile, and we
sketched a possible design in Memo 31.  Once implemented, we expect
the percentage time spent doing function unfolding to increase
significantly, in which case increasing the max instantiation limit
will become a much bigger win.

We have implemented an efficient emulator for the Reduceron in C,
allowing us to quickly explore ideas.

Our very own Lava implementation has been polished and released on
Hackage.  There is haddock documentation, a range of examples, and a
feature overview (Memo 23).  One notable improvement is that the
Recipe library for behavioural description has been rewritten.  It is
more concise, and supports static analyses (including an optimiser)
and shared procedure calls.

We documented the design of the Caching Octostack in Memo 27, an
important part of the FPGA implementation.  A preliminary experiment
showed that the caching version has a significantly lower logic delay
than the non-caching version, as expected.

We wrote a compiler from F-lite to C, allowing F-lite programs to
be run on an FPGA soft-core, which could be useful for comparative
purposes.

We added several more F-lite programs, including CountDown, Cichelli,
Mate, and KnuthBendix.

We started thinking about the design of a twin Reduceron (Memo 32).

What next?
----------

Lots to do:

  1. Implement "speculative evaluation of primitive redexes", first on
     emulator, then FPGA.

  2. Write a technical report, detailing the new Reduceron.

  3. Prepare talk for Fun in the Afternoon, Cambridge, 26 November,
     assuming my talk offer is accepted.

  4. Improve compiler, by more adventurous inlining and
     body-splitting, and maybe even incorporate Jason's supercompiler
     pass.

  5. Prototype a twin Reduceron.