=========================
REDUCERON MEMO 14
First quarterly review
Matthew N, 9 January 2009
=========================

Three months after starting on the Reduceron project, this memo
summarises our main results and thoughts to date.

1. F-lite
---------

Quite early on, Matthew decided to re-implement the Reduceron compiler
from scratch.  There were a few motivations, the main two being (1)
compiling full Haskell'98, as we previously attempted, is a lot of
work and not necessary for the research, and (2) we wanted a simple,
standalone compiler that does not rely on a large and complex codebase
such as Yhc.  So we developed F-lite, a simple functional language,
and implemented a compiler for it.  (See Memo 9 for details.)

2. Inlining
-----------

One of the first problems we considered was that of how to widen
function bodies, to take advantage of the Reduceron's fast
block-instantiation capability.

We observed that functions which perform case analysis are often split
into several smaller functions by the case-elimination algorithm, one
new function being introduced for each case alternative.  This
splitting has the effect of turning directly-recursive functions
into sets of mutually-recursive ones.

We found a very simple yet effective inlining rule that is to some
extent able to reunite the split-up functions, and make them directly
recursive again.  Somewhat counter-intuitively, the trick was to
inline the case analysis function into the case alternatives, rather
than the other way around!  (See Memo 2 for details.)

Subsequently, Colin observed that nested case analysis results in a
set of mutually-recursive functions being introduced, which are not
made directly-recursive again after applying the inlining rule [Memo 5].

After translating our benchmark programs to F-lite, and adding
some new ones, we performed some measurements.  We found that our
simple inlining rule gave an 9%-20% improvement on all nine programs,
a mean speed-up of 13% [Memo 10].

3. Arity-raising
----------------

Colin observed the potential of eta-expansion to increase
opportunities for inlining [Memo 3].  We implemented this
transformation, but after benchmarking we soon realised that it
doesn't always preserve sharing, causing a big slow-down in some
programs.  However, in some programs there is indeed a good speed-up
[Memo 10], so Matthew started to think about using a sharing analysis
to determine cases in which arity-raising is safe [Memo 11].  But this
line of work has been suspended due to more pressing matters discussed
below.

4. Single-cycle reads
---------------------

In the current Reduceron, reading the heap takes two clock-cycles
because of the delay introduced by the cascading multiplexor and the
rotation logic used to implement quad-word memories.  Matthew proposed
to abandon quad-word memories in favour of quadrupling the word-size,
and also to cascade block RAMs differently.  Together these ideas will
allow single-cycle heap-reads, and save a lot of logic.  Single-cycle
reads will in turn reduce the time taken to unwind and unfold by one
cycle each.  Another benefit of having a fixed application size is
that updating can be easily done by overwriting the root of the redex
rather than introducing indirections.  The downside is that there will
be some wasted capacity, both in terms of memory and processor
utilisation, when the lengths of applications are not multiples of
four.  (See Memo 7 for details.)

5. Parallel stack accesses
--------------------------

Also in Memo 7, we recalled that the way in which the Reduceron
implements parallel access to the top eight stack elements has the
limitation that functions cannot take more than eight arguments.  We
proposed to overcome the problem using an eight-port stack, where any
eight stack elements can be read in parallel, not just the top eight
[Memo 7].  After some more thought, we realised this approach has a
clock-cycle overhead associated with it, putting the original approach
in better light.  To overcome the arity-limitation, Matthew modified
Dijkstra's abstraction algorithm to transform functions into ones that
take no more than eight arguments.  (See Memo 12 for details.)

6. Spinelessness
----------------

In a spineless machine, the spine of a function body is written only
to the heap.  In the Reduceron, the spine is written to the heap and
the stack, but time is saved by performing these two writes in
parallel.  Nevertheless, there are two potential benefits in making
the Reduceron a spineless machine.  

First of all, not having to write the spine to the heap and not having
to update the heap on every application frees up the heap, allowing
the information needed for the next reduction to be fetched one
clock-cycle earlier.  But the cost of updating must be taken into
account as a distinct process.  Depending on how often updates are
avoided (see next section), there will be an overall saving of 0 to 1
cycles per unfold.  Combine this with single-cycle reads and we open
up the possibility of single-cycle unfolding (whenever the update can
be avoided, and the body contains two applications or less).  In the
current Reduceron, unfolding always takes at least three cycles.

Furthermore, since a spineless machine builds up a stack of updates,
it may be possible to perform two updates per cycle, exploiting the
dual-port heap.

The second benefit is that a spineless machine does not need to
introduce indirections in any circumstances.  This is attractive
because long indirection chains are sometimes built by the current
Reduceron [Memo 10].

7. Unshared applications
------------------------

To implement an efficient spineless machine, Peyton Jones proposes to
distinguish between two types of applications: unshared and
possibly-shared.  Updating can then be avoided after evaluating an
unshared application to head normal form.

Determining which applications are unshared can be done by statically
or dynamically.  The latter, which Peyton Jones calls "dashing" is
more precise (less conservative), but Peyton Jones writes "we strongly
suspect that the cost of dashing may greatly outweigh the advantage of
precision".  However, in hardware, we strongly suspect that dashing
has no (time) cost.

This technique could also be useful in the current Reduceron (not
spineless), for example to reduce indirection chains.

8. Compiling case expressions
-----------------------------

Matthew proposed an alternative way to compile case expressions.  The
current Reduceron may require time to (1) write closures for each case
alternative onto the stack --- possibly several cycles when there are
many case alternatives, (2) select one of the case alternatives --- 3
cycles, and (3) fetch one of these closures when selected --- 2 cycles.
The new method opens up the possibility for zero-cycle case-alternative
selection, and removes each of the above overheads.  The downside is
that some implementation simplicity is sacrificed.  (See Memo 13 for
details; Memo 6 presents an earlier view of a similar idea.)

9. Speculative evaluation
-------------------------

Colin proposes to do speculative evaluation of primitive redexes
during instantiation of function bodies [Memo 8].  For example, if we
are to instantiate 'x+y' in the body of 'f' where 'x' and 'y' are
already-evaluated arguments of 'f', then we can perform the addition
speculatively.  Colin thinks it will be best to use a static analysis
to discover such redexes at compile-time, whereas Matthew thinks that
it will be cheap to discover such redexes at run-time.  Either way,
the idea looks promising.

10. Garbage collection
----------------------

Garbage collection is one area in particular that needs more thought.
It would be nice to do garbage collection in parallel with the
mutator.  With this aim, Colin is interested in on-the-fly collection
(coarse-grained parallelism), whereas Matthew is more interested in a
hardware implementation of reference-counting (fine-grained
parallelism).