======================== REDUCERON MEMO 21 Second quarterly review Matthew N, 28 April 2009 ======================== Memo 15 formalised the semantics of the "Spineless Reduceron", including a new heap layout allowing single-cycle access, and the dynamic update avoidance technique outlined by Peyton Jones's in his "Spineless G-Machine" paper. The memo predicted that dynamic update avoidance would be cheap to implement in hardware, that 60% of updates could be avoided, and that a factor of 2.8 speed-up should be possible over the PhD Reduceron. Subsequently, we felt that Memo 15 was slightly misleading: the true source of the speed-up is not due to spinelessness. So Memo 18 defined a new spinefull machine for comparison. It predicted that the spineless machine would be 15% faster, and suggested that spinelessness brings a number of small benefits, rather than one big benefit. (The possibility of dynamic update avoidance was not considered here; but it should be useful to both spineless and spinefull machines.) Reduceron II ------------ With Memo 15 as a basis, implementation of the "Reduceron II" on FPGA began. A somewhat rushed first-cut was finished in time for HFL'09. The new heap layout was straightforward to implement, but we discovered that a similar technique could not be used to obtain single-cycle access to the stack - unlike the heap, the stack cannot contain gaps. So we implemented the stack using registers connected to a wide RAM via spill-handling hardware. The resulting stack design is quite tricky; we did manage to find fairly nice design, but it does contain the critical path. Dynamic update avoidance is indeed simple and cheap to implement in Hardware. The new method of evaluating case expressions (Memo 13) was also implemented without difficulty. Indeed it is possible, but with added complexity, to do zero-cycle matching using a parallel case-table stack. In the rush, some short-cuts where taken. In particular, there is no garbage collector, and the maximum arity of functions is just 4! On the positive side, the new method of handling case expressions greatly reduces the arity of 'case-alternative' functions, and all the benchmarks programs met the restriction. The new implementation runs 3 times faster than the PhD Reduceron, in terms of clock-cycles, and its clock frequency is roughly the same. Using the new FPGA, both designs clock about 30% faster. The new implementation is written in Blarney, a new Lava clone supporting multi-output primitives, RAMs, easy creation of new primitives and back-ends, behavioural description, and sized-vectors. (To-do: write a memo about Blarney, perhaps by way of a short tutorial.) Talk at HFL 2009 ---------------- Memo 19 contains the slides I presented at HFL'09 in March. The talk was entitled "Three improvements to the Reduceron". My overall feeling is that the talk didn't go well. Judging by the nodding heads, I had the audience's attention in the beginning when discussing how and why the Reduceron widens the memory bottleneck. But after that I think I went too much into gory details, and some of my points were a bit laboured. I don't think I dealt well with audience negativity in two areas: 1. The use of block-RAMs rather than large off-chip memories. 2. The Reduceron still does not beat GHC-compiled programs running on a PC. I recall that the following two suggestions were made: 1. It would be interesting to see if the Reduceron can exploit the fine-grained parallelism that is abundant in functional programs, especially as conventional hardware has struggled here. 2. It would be interesting to compare the Reduceron with GHC-compiled programs running on an FPGA soft-core. Invited talk at Birmingham University ------------------------------------- Memo 20 contains the slides of my invited talk at Birmingham University in April. This talk went a bit better. I started out motivating the Reduceron and discussing the potential advantages, which went fine. I introduced FPGAs, and this led to some useful discussion. I gave a very quick overview of lazy evaluation, which I think was worthwhile, although it did cause one guy to fall asleep! The slides about widening the memory bottleneck again went down well. This time, I followed up naturally with a few punchier points about the parallel update-stack and case-table memories. I then discussed two dynamic analyses: update avoidance and speculative evaluation of primitive redexes. Perhaps the update avoidance slides were still a bit laboured. At the end, there was plenty of discussion and even the snoozer woke up to ask lots of questions. There were some positive comments about the ideas, the talk, and even the slides. The worries about poor memory capacity and performance arose again, but at least this time I was better prepared to discuss them. My general feeling was that they liked the work but were dissapointed by the current results. What next? ---------- To complete and document: 1. Expand the arity-limit of 4. 2. The stack design. 3. The Reduceron II. 4. Performance measurements & comparison. 5. Blarney. To explore: 1. Make the sequential reducer use only 1 port of the heap, freeing the other port for garbage collection and parallel reduction. 2. A garbage collector. 3. A parallel garbage collector. 4. A hybrid parallel garbage collector and parallel reducer.