===========================================================
REDUCERON MEMO 34
Expected clock-tick run-times for flite-reduceron programs:
a discussion based on micro-examples
Version 2 (note revised title)
Colin Runciman, 1 October 2009
===========================================================

As we examine the performance of the Reduceron for benchmark programs we
naturally want to include substantial examples of symbolic computing,
such as the Knuth-Bendix completion program that I recently added to
our collection of test programs.

However, when running a complex program, it is hard to say how many
clock ticks we *expect* a computation to take.  We may be pleased that it
takes fewer clock-ticks than before, or now runs faster on the Reduceron
than when compiled using ghc! Yet we may have no clear idea whether the
run-time in clock-ticks is now as good as we could reasonably hope for,
or is still some way short of that ideal.

Another class of benchmark tests, of complementary value, uses
micro-examples of simple computational tasks.  Because the task is
simple, we can more readily estimate how many clock cycles we think
it should need.  Or, at least, we can reason about plausible lower and
upper bounds for the number of cycles.

Examples where actual computation time is well above the estimated lower
bound may motivate and suggest improvements.  And any where it is well
above the estimated upper bound even more so!

The rest of the memo discusses three examples, then draws brief
conclusions.  As we are currently very interested in the potential impact
of primitive redex speculation, it will be considered as we go along.
As "primitive redex speculation" is a weighty phrase to use repeatedly,
I shall instead use the acronym PRS.

Example 1 (list length)
-----------------------

Consider this program:

{
length  =  lengthPlus 0 ;
 
lengthPlus a Nil          =  a ;
lengthPlus a (Cons x xs)  =  lengthPlus ((+) a 1) xs ;

main  =  emitInt (length "quick brown fox jumps over the lazy dog") 0 ;
}

How might the flite-reduceron programmer reason about the expected
performance of this program?

(1) The application of length will be in-lined in main.
(2) Reducing to the correct case alternative for the 2nd argument of
    lengthPlus will take just one cycle as even the Cons alternative
    contains just two small applications.
(3) The primitive applications ((+) a 1) will each require 3 cycles
    (1 unwind, 1 swap, 1 prim). 
(4) There are then no off-spine applications, so the recursive call
    will take just one cycle.
(5) A few extra cycles are needed to reduce main and emitInt.
(6) So if the string contains N characters the computation should
    take something like 5N+few cycles.
(7) The test string contains 39 characters, so perhaps the program will
    run in a bit more than 200 clock-cycles (or 1 microsecond on the
    current Xilinx implementation).

Indeed, if we compile to reduceron code by

$ Flite -i1 -r6:4:2:1 stringlength.hs > stringlength.red

and run stringlength.red using the C emulator, the report is as
follows.

==== EXECUTION REPORT ====
Result      =            0
Ticks       =          219
Swap        =          18%
Prim        =          18%
Unwind      =          35%
Update      =           0%
Apply       =          27%
==========================

The combinator code (with the big string constant in main elided) is:

length                    =  lengthPlus 0;
lengthPlus v0 v1          =  v1 [lengthPlus#1,lengthPlus#2] v0;
lengthPlus#1 v0 v1 v2 v3  =  v1 [lengthPlus#1,lengthPlus#2] ((+) v3 1);
lengthPlus#2 v0 v1        =  v1;
main                      =  emitInt (lengthPlus 0 (...))

The length definition has indeed been in-lined and it is dead code.
The core computation using lengthPlus needs 5 cycles per recursive step,
much as expected.  it takes 1 cycle to examine the case subject, 1 to
look up and apply the appropriate alternative (lengthPlus#1 every time
except the last), and 3 more cycles for the primitive (+) application
it contains (1 unwind, 1 swap, 1 prim).  The additional 20 cycles or
so are needed to apply main, including heap instantiation of the large
constant list of characters.

When PRS is implemented, we expect the cost per (+) application to fall
from 3 cycles to 1 cycle.  So the program will run in about 140 cycles,
about 37% less -- an attractive speed-up!

Example 2 (string comparison)
-----------------------------

Here's another program:

{
con True  x  =  x ;
con False x  =  False ;

equalStrings Nil         Nil          =  True ;
equalStrings Nil         (Cons y ys)  =  False ;
equalStrings (Cons x xs) Nil          =  False ;
equalStrings (Cons x xs) (Cons y ys)  =  con ((==) x y) (equalStrings xs ys) ;

bit False  =  0 ;
bit True   =  1 ;

main  =  emitInt (bit (equalStrings "quick brown fox jumps over the lazy dog"
                                    "quick brown fox jumps over the lazy don")) 0 ;
}

This one is a bit more tricky, but here's some outline reasoning:

* The applications of con and bit will be in-lined.
* Examining the two equalStrings arguments, each as a case subject, will take
  2 clock-cycles.
* In the double-Cons recursive case, the reduction of the (==) application
  without PRS will require 4 cycles (including 2 swaps because neither
  argument is syntactically constant).
* It will take another 1 cycle to examine the subject of the case inlined
  from con.
* Instantiating the recursive call alternative takes another 1 cycle.
* So it's something like N*8 + constant, where N is around 40 and the
  constant is roughly double the previous example (as there are two large
  string constants rather than one).
* In all, we might expect the program to run in around 360 clock-cycles.

If we compile to Reduceron code and run it using redemu, the report
shows that it takes rather longer than estimated:

==== EXECUTION REPORT ====
Result      =            0
Ticks       =          471
Swap        =          16%
Prim        =           8%
Unwind      =          32%
Update      =           0%
Apply       =          41%
==========================

It seems that there are 3 more clock cycles needed for each recursive
step that were not included in the reasoning for estimate.  The current
combinator code for this program, with both string constants in main
elided, is as follows.

con v0 v1                      =  v0 [con#1,con#2] v1;
con#1 v0 v1                    =  False;
con#2 v0 v1                    =  v1;
equalStrings v0 v1             =  v0 [equalStrings#5,equalStrings#6] v1;
equalStrings#1 v0 v1 v2 v3 v4  =  (==) v3 v0 [con#1,con#2]
                                             (v [equalStrings#5,equalStrings#6] v1);
equalStrings#2 v0 v1 v2        =  False;
equalStrings#3 v0 v1 v2        =  False;
equalStrings#4 v0              =  True;
equalStrings#5 v0 v1 v2 v3     =  v3 [equalStrings#1,equalStrings#2] v0 v1;
equalStrings#6 v0 v1           =  v1 [equalStrings#3,equalStrings#4];
bit v0                         =  v0 [bit#1,bit#2];
bit#1 v0                       =  0;
bit#2 v0                       =  1;
main                           =  emitInt (... [equalStrings#5,equalStrings#6]
                                           ... [bit#1,bit#2]) 0;

The definitions of con, bit and equalStrings itself are all dead code as
they are everywhere inlined.  In the equalStrings computation, alternative
#1 is the common-case choice rather than #2 (and less significantly
#5 rather than #6).  The body of equalStrings#1 is sufficiently large
that its instantiation could account for 2 more clock cycles. Perhaps
equalStrings#5 also takes another 1 cycle?

When PRS is implemented, the (==) applications should take 1 cycle rather
than 4, and so the total of 11 cycles per recursive step can be expected
to fall to 8.

Example 3 (nfib)
----------------

Finally, a classic example from the early days of functional-language
implementation.  The result of nfib n is the number of nfib applications
required to compute it.  Also, nfib n = 2 * fib n - 1.

{
nfib n  =  case (<=) n 1 of {
           True  -> 1 ;
           False -> (+) 1 ((+) (nfib ((-) n 2)) (nfib ((-) n 1))) ;
           } ;
main    =  emitInt (nfib 20) ;
}

The result of nfib 20 is 21891.  In nfib applications involved, the True
alternative is taken 10946 times, and the False alternative 10945 times.
So what preliminary estimate of the number of clock-cycles can we make?

* For each nfib call we need 1 cycle to unwind it (never in spine
  position) and 1 to instantiate it (small enough with case alternatives
  split off).  Then there's 2 further cycles (1 swap, 1 prim) to evaluate
  the (<=) comparison at the top of the spine, and another 1 cycle to
  dispatch on it as the case subject.

* In 10946 calls there is no further cost.

* In 10945 calls there is the cost of instantiating a large body: perhaps
  as much as 5 clock cycles if primitives applications are handled one
  argument at a time.

* Then in these same calls, there is the cost of 3 cycles for each of
  the two subtractions, 4 cycles for the inner (+) and 3 cycles for the
  outer one.  That's 13 cycles for primitive arithmetic!

* So in all we have 21891 * 5 + 10945 * 18 = 306465 cycles plus a few
  more for evaluation of main.  The total should be around 306470.

Compiling with Flite and running using redemu as before, we find
this preliminary estimate is not far out, but once again it is an
underestimate:

==== EXECUTION REPORT ====
Result      =            0
Ticks       =       339298
Swap        =          22%
Prim        =          19%
Unwind      =          32%
Update      =           6%
Apply       =          19%
==========================

The combinator code for this program, compiled with -i1, is as follows:

nfib v0       =  (<=) v0 1 [nfib#1,nfib#2] v0;
nfib#1 v0 v1  =  let { v2 = (-) v1 2; v3 = (-) v1 1 }
                 in (+) 1 ((+) ((<=) v2 1 [nfib#1,nfib#2] v2)
                               ((<=) v3 1 [nfib#1,nfib#2] v3));
nfib#2 v0 v1  =  1;
main          =  emitInt ((<=) 20 1 [nfib#1,nfib#2] 20) 0;

The definition of nfib itself is dead code, and its in-lining everywhere
is the factor that the initial estimate didn't take into account.
The double inlining of nfib makes an even more sizable body for nfib#1
than the estimate allowed for, accounting for another 3 cycles to build
each instance.  Another consequence of this inlining in nfib#1 is that
the (<=) applications are no longer in spine positions, adding another
2 cycles each time.  These factors together add around 55000 extra
clock cycles. The inlining of every nfib call only saves around 22000
cycles, or perhaps 44000 cycles at most.  So right now, it seems that
the inlining of nfib is not a win!

However, consider the situation as it will be with PRS.

Without nfib inlined: Each comparison takes 1 cycle rather than 2,
and the two subtractions together take 1 cycle rather than 6.  Also,
with the elimination of the curried (-) applications, the size of the
instantiated body for the larger alternative drops from 10 applications
to 6; so the number of clock cycles to form it fall from 5 to 3.  In all,
we can expect the nfib cycle count to reduce to 21891 * 4 + 10945 * 11 =
207959, a saving of 33%.

With nfib inlined: The two (-) applications and the two (<=) applications
in nfib#1 together take just 1 cycle rather than 12, and instantation of
the nfib#1 body requires just 4 cycles rather than 8.  So we expect to
avoid 10945 * 15 = 164175 cycles of work, 49% of the original run-time,
giving an even lower final cycle count of 175123.  (That's around 17
million "nfibs per second" with a 120MHz clock.)

Conclusions
-----------

Micro-examples can be instructive.

It would be nice to have a further option for the flite-reduceron
compiler that gives brief reports on the cost of applying each combinator.
For example, it might say "in-lined", or "clock-cycles per application
(including primitives in body): min <m>, max <M>".

For all three examples, we can expect significant gains from PRS.

Even greater benefits could be obtained, in all three examples, where
PRS gives boolean results that are case subjects.  This is a common form:
it corresponds to conditionals, and it also results from the inlining of
flite equivalents of (&&) and (||).  Could the primitive simplifying rules

False [alt#1,alt#2] ==> alt#1
True  [alt#1,alt#2] ==> alt#2

be built-in along with the PRS machinery?  And if alt#1 or alt#2 are
themselves True or False, could the effect be cascaded?

In Examples 2 and 3, instantiation of the largest (and most frequently
applied) bodies takes several clock cycles.  Even with PRS, instantiation
of nfib#1 will require 4 cycles.  And these are *micro*-examples.
Widening this bottleneck a little further could be all the more beneficial,
in terms of the fraction of run-time saved, after PRS.