==============================
REDUCERON MEMO 25
Bounded template instantiation
Matthew N, 24 June 2009
==============================

This memo measures the effect of maximum spine length, maximum
application length, and maximum instantiations-per-cycle on reduction
speed and heap wastage in the Reduceron.  The precise semantics of the
Reduceron used to perform the measurements is not (yet) presented
here.

Templates
---------

In spineless template instantiation, a template is essentially a tuple
of the form

  (Pop, Push, App_1, ..., App_n)

The 'Pop' field states how many elements should be popped off the
stack.  The remaining fields are all applications, which may refer to
each other relatively (but may not refer to 'Push').  In other words,
together these applications represent a term graph rooted by the
'Push' application.  The difference between 'Push' and 'App_1' to
'App_n' is that 'Push' is instantiated and pushed onto the stack,
whereas 'App_1' to 'App_n' are instantiated and appended to the heap.

EXAMPLE 1. The following combinator handles the 'Cons' case in the
definition of 'map'.

  mapCons x xs f = Cons (f x) (map f xs)

As a template, it could be represented as

  mapCons = (3, Cons +0 +1, @3 @1, map @3 @2)

where '+i' is a pointer to the ith application beyond the current heap
pointer, and '@i' is the ith element from the top of the stack.

Our aim in the Reduceron is to instantiate whole templates in a single
clock-cycle.  On an FPGA, this is made possible by parallel RAMs which
can be built with more-or-less arbitrary bus widths.  But these bus
widths must be decided statically, before the FPGA is programmed.  In
particular, the following values must be bounded:

  1. The maximum length of the 'Push' application, referred to as
     "maximum spine length".  For example, the length of the
     application 'Cons (f x) (map f xs)' is 3.

  2. The maximum length of the applications in 'App_1 ... App_n',
     referred to as "maximum application length".  This need not be
     the same as the maximum spine length.

  3. The number of applications in 'App_1 ... App_n'.  This number
     (plus one to account for the spine application, if it has length
     larger than one) is referred to as "maximum instantiations per cycle".

No matter how we bound the length and number of applications, the
programmer will always be able to write a program that exceeds them.
The bounds should be relaxed enough to allow high speed reduction, but
constrained enough to avoid poor memory and hardware utilisation.
Below, we try to find "good" bounds by taking a quantitative approach.

Application splitting
---------------------

It is straightforward to split long applications in order to meet
maximum application length bound.  To illustrate, simply observe that
the expression 'f a b c d' which contains one application, is
equiavalent, for example, to '(f a b) c d' which contains two
applications.  The maximum spine length bound can also be enforced
using this principle.

Template splitting
------------------

Template splitting can be used when the bound on the number of
applications is exceeded.  One big template is split into a number of
smaller ones.

EXAMPLE 2.  If the maximum number of applications in a
template is 1, then 'mapCons' could be represted as follows.

  mapCons  = (0, mapCons2, @3 @1)
  mapCons2 = (3, Cons -1 +0, map @3 @2)

Here, 'mapCons' instantiates the '@3 @1' application on the heap and
replaces the top stack element with 'mapCons2', which instantiates the
rest.  Now '-1', not '+0', is used to refer to the application '@3 @1',
because the heap pointer has been incremented as a result of
instantiating 'mapCons'.

Measurements 1: two improvements to Memo 19
-------------------------------------------

First, we measure the impact of two improvements to the Reduceron
since Memo 19.  The first improvement is to take into account normal
forms in the update avoidance scheme; every application is tagged with
an extra bit stating whether or not it is a normal form, and when a
normal form is unwound, an update marker is not pushed onto the update
stack.  The second improvement is to inline all non-recursive
functions whose bodies contain just a single application, which often
avoids a function call in recursive case-analysing functions (recall
Memo 2).

PROGRAM     % UPDATES AVOIDED       % TIME SAVED DUE TO INLINING
            -----------------       ----------------------------
Adjoxo                     97                                8.3
Clausify                   85                               15.5
MSS                        96                               12.3
OrdList                    89                               29.8
Parts                      82                               17.9
PermSort                   87                               22.6
Queens                     95                               15.1
Queens2                    66                               21.8
Sudoku                     71                               16.6
While                      99                                6.8
            -----------------       ----------------------------
AVERAGE                    87                               17.8

According to Memo 15 (that is, before taking into account normal
forms), 60% of updates were avoided by the update-avoidance scheme.

Measurements 2: effect of bounds on reduction speed
---------------------------------------------------

The following tables contain "speed-up factors" that result from using
various values for the different bounds.  The speed-up is relative to
the time take by the Reduceron with a maximum spine length of 2, a
maximum application length of 2, a maximum number of instantiations
per cycle of 1.

        SPEED-UP      |        APPLICATION LENGTH
   (SPINE LENGTH = 3) |     2     3     4     5     6
  --------------------+------------------------------
                    1 |  1.20  1.58  1.65  1.71  1.72
   INSTANTIATIONS   2 |  1.45  1.87  1.93  1.99  2.00
      PER CYCLE     3 |  1.56  1.98  2.05  2.10  2.10
                    4 |  1.62  1.99  2.06  2.11  2.11
                    5 |  1.65  2.01  2.07  2.12  2.12

        SPEED-UP      |        APPLICATION LENGTH
   (SPINE LENGTH = 4) |     2     3     4     5     6
  --------------------+------------------------------
                    1 |  1.35  1.78  1.82  1.88  1.88
   INSTANTIATIONS   2 |  1.64  2.11  2.15  2.20  2.20
      PER CYCLE     3 |  1.76  2.23  2.27  2.33  2.33
                    4 |  1.82  2.25  2.30  2.34  2.34
                    5 |  1.84  2.26  2.30  2.36  2.36

        SPEED-UP      |        APPLICATION LENGTH
   (SPINE LENGTH = 5) |     2     3     4     5     6
  --------------------+------------------------------
                    1 |  1.45  1.86  1.89  1.94  1.94
   INSTANTIATIONS   2 |  1.73  2.17  2.20  2.25  2.25
      PER CYCLE     3 |  1.84  2.30  2.32  2.38  2.38
                    4 |  1.91  2.33  2.35  2.41  2.41
                    5 |  1.94  2.34  2.37  2.42  2.42

        SPEED-UP      |        APPLICATION LENGTH
   (SPINE LENGTH = 6) |     2     3     4     5     6
  --------------------+------------------------------
                    1 |  1.50  1.93  1.96  2.03  2.03
   INSTANTIATIONS   2 |  1.79  2.27  2.30  2.37  2.37
      PER CYCLE     3 |  1.92  2.41  2.44  2.50  2.50
                    4 |  2.00  2.44  2.47  2.53  2.53
                    5 |  2.03  2.45  2.48  2.54  2.54

        SPEED-UP      |        APPLICATION LENGTH
   (SPINE LENGTH = 7) |     2     3     4     5     6
  --------------------+------------------------------
                    1 |  1.52  1.96  2.00  2.06  2.06
   INSTANTIATIONS   2 |  1.83  2.31  2.35  2.41  2.41
      PER CYCLE     3 |  1.96  2.46  2.49  2.56  2.56
                    4 |  2.03  2.49  2.53  2.58  2.58
                    5 |  2.07  2.50  2.54  2.60  2.60


Measurements 3: effect of bounds on heap wastage
------------------------------------------------

The following table contains "deflation factors", that is, the number
of times smaller the heap.  The factor is relative to the heap
consumption (in words) of the Reduceron with a maximum spine length of
2, a maximum application length of 2, a maximum number of
instantiations per cycle of 1.

    DEFLATION FACTOR  |        APPLICATION LENGTH
                      |     2     3     4     5     6
  --------------------+------------------------------
                    3 |  1.31  1.30  1.06  0.91  0.76
                    4 |  1.58  1.63  1.29  1.01  0.91
    SPINE-LENGTH    5 |  1.76  1.78  1.38  1.17  0.98
                    6 |  1.90  1.95  1.51  1.30  1.09
                    7 |  1.97  2.04  1.60  1.37  1.14

Conclusions
-----------

Bear in mind that although the speed-up factors are quite small, they
are relative to something which is already a fair bit faster (at least
a factor of 2 faster) than the old Reduceron, which is at the moment
our main reference point. 

The speed-up increases quite readily with spine-length; a spine-length
of 6 has a comfortable advantage over 3, 4, and 5, but a spine-length
of 7 only has a small advantage over 6.

There is not a big win in having applications of lengths larger than
3.  Furthermore, applications of length 3 result in very good memory
utilisation.

There is reasonable win in having 3 instantiations per cycle rather
than 2, but there is only a small win in moving to 4 or 5.
Implenenting 2 is easy due to dual-port RAMs; implementing 3 is
certainly possible, but may incur memory wasting due to alignment
problems.

Appendix: note about the arity bound
------------------------------------

Another important bound, not considered above, is the maxmimum arity
of a function.  This note simply points out that a similar technique
to template splitting *can* be used when the arity bound is exceeded.

EXAMPLE 3.  If the maximum number of applications in a template is 2
and the maximum arity is 2, the template 'f'

  f = (4, g +0 +1 +2, h1 @1 @3, h1 @2 @3, h2 @4)

can be split as follows.

  f  = (2, f1, h1 @1, h1 @2)
  f1 = (0, f2, -2 @1, -1 @1)
  f2 = (2, g -2 -1 +0, h2 @2)

Here, 'f' deals with the two references to arguments 1 and 2, and pops
them off the stack; 'f1' and 'f2' deal with the remaining references
to arguments 3 and 4.

In the worst case, resorting to binary applications allows any arity
bound larger than 0 to be handled by template splitting.

However, there is a slight complication when using this method to do
with updating.  Normal forms may still arise which are longer that the
maximum application length, in which they must be split into chunks,
and written to the heap chunk-wise.  The only difficulty here is the
increased complexity of update operation.

An alternative way to deal with the arity bound is to use an
abstraction algorithm - see Memo 12.