===============================
REDUCERON MEMO 31
Design proposal for speculative
evaluation of primitive redexes
Matthew N, 16 September 2009
===============================

Almost a year ago (in Memo 8) Colin proposed the idea of reducing
primitive redexes during instantiation of function bodies.  This idea
has become very attractive, now that, after improvements to the
Reduceron, primitive reductions account for large portions of the
runtimes of many programs, even "symbolic-intensive" ones.

There appear to be two main ways to do this: by static or dynamic
analysis.  In the static approach, primitive redexes are identified at
compile time.  The analysis is conservative and needs to be defined,
although Colin does say "I think this analysis should be fairly
simple".

In the dynamic approach, the analysis really is simple: just look at
the arguments in a primitive application; if they are evaluated apply
the primitive; if not, build the application on the heap as is
currently done.

One possible advantage of the static approach is that the circuitry
required to implement it will have a smaller logic delay than that for
the dynamic approach.  This is because all speculative evaluations are
decided at compile time - no runtime decisions are made.  However, it
is not clear how significant this saving is: it is perhaps just one
level of logic in the form of a multiplexer, and what's more, it is
not even on the current critical path.

Therefore, I propose we first try the simpler dynamic approach.  Below
I outline a fairly conservative design, but nevertheless, one that
should be quite effective.

Syntax
------

In the Reduceron, applications are sequences of atoms.

  data App = APP Bool [Atom]

The boolean flag states whether or not the application is in normal
form, and is used for update-avoidance.  Atoms are defined as follows.

  data Atom =
      INT Int           -- Primitive integer
    | ARG Shared Int    -- Argument index, can only occur in code memory 
    | PTR Shared Int    -- Pointer to an application
    | CON Arity Index   -- Constructor id
    | FUN Arity Int     -- Function id
    | PRI Arity PrimOp  -- Primitive id

  type Shared = Bool

  type Arity = Int

  type PrimOp = String

Note that ARG atoms can only appear in code memory, not on the heap.

To support evaluation of primitive applications during instantiation,
I propose to add a special register file to the machine, for storing
the results of speculative evaluations.  Other applications in the
body of the function may then refer to these results as many times as
required.  Remember, there may be many references to the result of a
primitive application, if it is let-bound.

  type RegId = Int

  data Atom = ... | REG RegId

Like ARG atoms, REG atoms can only appear in code memory.

Furthermore, I propose to add an optional register field to
applications.

  data App = APP (Maybe RegId) Bool [Atom]

Note that applications can appear both in code memory and in the heap.
However, when in the heap, the new register field always contains
Nothing.  When in code memory, the new register field *may* specify a
register that should be updated with the result of an *attempted*
on-the-fly speculative reduction of the application.

Semantics
---------

We must now modify the machine to deal with this new kind of
application.  As it can only occur in code memory, we need only
decide how to instantiate it.  There are two cases.

  1. If the application is that of a primitive to some arguments and
     all the arguments are fully-evaluated integers, then apply the
     primitive and store the resulting atom in the stated register.

  2. Otherwise, the application should be instantiated onto the heap
     as normal, and the stated register assigned to an atom
     that points to this application.

We must also define the semantics of the register file.  There are two
choices.

  1. Assignments to a register take one clock-cycle to come into
     effect.

  2. Assignments to a register come into effect immediately.

Option (2) allows applications being instantiated in the same clock
cycle as a primitive redex to refer to the result of the primitive
redex.  However, option (2) has a much more significant logic delay
than option (1), and I suspect it is not feasible from a clock-rate
point of view.

When a primitive redex is reduced speculatively, the result is
unboxed, that is, inserted directly in place of any reference to it.
Consequently:

  1. No unwinding step is required to fetch the result if is
     later required.

  2. A number of swaps may be avoided, which are otherwise be necessary
     to force evaluation of the primitive's arguments.

  3. There is increased scope for further speculative evaluation: the
     more unboxed values in the graph, the greater the opportunity for
     further speculative evaluations - it is a self-feeding process.

  4. The order in which speculative evaluations are attempted should
     be carefully decided at compile-time, to maximise the
     self-feeding effect.

Discussion
----------

The is all pretty much as Colin originally proposed in Memo 8, with
the register file taking care of primitive bindings.  One difference
is that the decision to perform speculative evaluation is decided at
runtime.  A few new details are:

  1. A finite register file is required - its maximum capacity is
     another machine parameter, like, for example, maximum application
     length.

  2. The results of primitive redexes can only be accessed *after* the
     clock-cycle in which they are reduced.

  3. If the maximum number of applications that can be instantiated
     per cycle is limited to 2, then so is the maximum number of
     speculative reductions.

  4. When a primitive redex is detected, the heap bandwidth allocated
     to instantiate that redex is not needed, and hence wasted.

On the positive side, when a primitive redex is detected we are likely
to avoid 2 swaps, 2 unwinds, and a primitive reduction - that's five
clock cycles - not to mention a possible avoided update.

Some open issues:

  1. What percentage of speculative reductions are in fact needed
     subsequently?

  2. Compiler should identify applications which cannot possibly be
     primitive redexes, in order to avoid possible overhead.  An
     example such application would be "1 + sum xs".

  3. Feedback for programmers: modify the emulator to tell the
     programmer which primitive applications could not be performed
     speculative (and how many there were).

  4. What about speculative evaluation of semi-strict operators such
     (&&) and (||)?