================================================
REDUCERON MEMO 37
Determining the number of clock cycles needed to
instantiate a combinator body.
Version 2
Colin Runciman, 9 October 2009
================================================
Memos 34 and 36 made assumptions about the cost of instantiating
combinator bodies. Subsequent white-board discussion with Matt has
corrected some assumptions that were wrong or imprecise.
This memo gives a simple way to determine the number of clock cycles
per instantiation of a given combinator body, which should be accurate
for the current implementation.
Case-count, length and weight of an application
-----------------------------------------------
The case-count of an application e1 ... en is the number of expressions
ei that are case-alternative tables.
The length of a application e1 ... en is n minus the case-count of
the application if the application is spinal, or n if the application
is interior (non-spinal).
The weight of an application of a primitive binary function, with
neither of the first two arguments constant, is 2. The weight of all
other applications is 1.
Parameters
----------
There are four key parameters, corresponding to the Flite -r compile-time
values, which together limit the size and complexity of expressions that
can be instantiated in a single cycle:
SLMAX = maximum spinal application length
ILMAX = maximum interior (non-spinal) application length
CTMAX = maximum number of case-alternative tables
APMAX = maximum number of applications
The values in these parameters in the current Reduceron implementation
are SLMAX = 6, ILMAX = 4, CTMAX = 1, APMAX = 2.
The Method
----------
(1) Consider the spinal application of the combinator body. If the
application length exceeds SLMAX or the case-count exceeds CTMAX,
bracket the maximal partial application of length at most ILMAX and
case-count at most CTMAX, and proceed from (1).
(2) Consider each non-spinal application. If any length exceeds IMAX or
any case-count exceeds CTMAX, in each such case bracket the maximal
partial application of length at most ILMAX and case-count at most
CTMAX and proceed from (2).
(3) Let WSUM be the sum of the weights of all applications in the body.
Clock cycles per instantiation = ceiling (WSUM / APMAX).
Examples
--------
Consider two of the intermediate-stage combinators for the kingInCheckFrom
function, as in Memo 36.
kICF#5 v0 ... v6 = Pair Rook (Pair v1 v2) [kICF#9]
v6 v5 [dis#1,dis#2]
(Pair Bishop (Pair v1 v2) [kICF#9] v6 v5);
kICF#6 v0 ... v6 = (==) v1 v3 [con#1,con#2]
(v5 [emptyAtAll#1] (filePath v3 v2 v4)) [dis#1,dis#2]
((==) v2 v4 [con#1,con#2]
(v5 [emptyAtAll#1] (rankPath v4 v1 v3)));
(1) The length of the spinal application in #5 is 6, and in #6 it is 5.
But the case-count of both spinal applications is 2, exceeding CTMAX.
So we introduce brackets in each case:
kICF#5 v0 ... v6 = (Pair Rook (Pair v1 v2) [kICF#9] v6 v5)
[dis#1,dis#2]
(Pair Bishop (Pair v1 v2) [kICF#9] v6 v5);
kICF#6 v0 ... v6 = ((==) v1 v3 [con#1,con#2]
(v5 [emptyAtAll#1] (filePath v3 v2 v4)))
[dis#1,dis#2]
((==) v2 v4 [con#1,con#2]
(v5 [emptyAtAll#1] (rankPath v4 v1 v3)));
Now the spinal length is 2 in each case, and the spinal case-count is 1.
(2) In #5 both Pair applications have length 6, exceeding ILMAX, so
brackets must be introduced. In #6 both (==) applications have length
5, exceeding ILMAX, so more brackets are needed. In both #5 and #6, all other
application lengths are no more than ILMAX, and there is no problem
with case-counts.
kICF#5 v0 ... v6 = ((Pair Rook (Pair v1 v2) [kICF#9]) v6 v5)
[dis#1,dis#2]
((Pair Bishop (Pair v1 v2) [kICF#9]) v6 v5);
kICF#6 v0 ... v6 = (((==) v1 v3 [con#1,con#2])
(v5 [emptyAtAll#1] (filePath v3 v2 v4)))
[dis#1,dis#2]
(((==) v2 v4 [con#1,con#2])
(v5 [emptyAtAll#1] (rankPath v4 v1 v3)));
(3) The sum of the application weights in #5 is 7, so 4 cycles are needed
to instantiate it. The sum of the application weights in #6 is 11,
so 6 cycles are needed to instantiate it.
Let expressions
---------------
The instantiation costs for applications defined in let bodies need only
be paid once, of course. In applications where the let variable is used,
the costs are the same as for argument variables.
Impact of PRS
-------------
These se instantiation costs will be unaltered by the PRS scheme Matt
is currently implementing. The primitive applications with simple
arguments are *candidates* for evaluation. Each is only evaluated to
its result during instantiation if it is indeed found to be a redex.
So the instantiation timing must allow for instances of candidate
primitive applications that are non-redexes.
Conclusions
-----------
It is not so hard to formulate an exact clock-count for instantiation,
based on a combinatory form that is still quite readable for programmers.
It would be helpful if the Flite options -d and -r could be used together.
After each combinator dumped by -d there could be a report of the number
of clock cycles per instantiation, assuming the -r parameter values.
Currently the most severe of the *MAX limiting parameter values is
CTMAX = 1. In both examples, the spinal application had to be split
solely because it included two case-table arguments. In discussion,
Matt has said there should be no difficulty in raising CTMAX to 2.
The parameter APMAX = 2 is also a significant limiting factor. As Matt
showed in Memo 25, a significant speed-up could be expected if APMAX were
raised from 2 to 3. This is hardly surprising given the role of APMAX as
divisor in step (3): with APMAX = 3 the clock-cycles per instantiation
fall from 4 to 3 for #5 and from 6 to 4 for #6. With the introduction
of PRS the case for increasing APMAX is still stronger: the number of
primitive-redex evaluations per instantiation cycle is also limited
by APMAX. If the redesign at the Reduceron level would take a bit of
work, would it be worth doubling APMAX 4 to get a larger pay-off from
the effort, or would that risk lengthening the critical path?