============================= REDUCERON MEMO 15 Towards a spineless Reduceron Matthew N, 13 January 2009 ============================= We define the semantics of a spineless template-instantiation machine with vectored applications, and its compiler. We distinguish between non-shared and possibly-shared applications by a dynamic analysis, allowing updates of definitly non-shared applications to be avoided. We measure the number of updates saved accross a range of programs, and estimate the performance of a hardware implementation of the machine. Finally, we point out the main ways in which we intend to improve the machine and compiler. We begin with a few imports. > import List > import System.IO.Unsafe(unsafePerformIO) Use of 'unsafePerformIO' is confined to the implementation of the 'emit' and 'emitInt' primitives of F-lite. Semantics --------- The heap is modelled as a list of applications, and can be indexed by a heap-address. > type Heap = [App] > type HeapAddr = Int An application is a list of nodes. > type App = [Node] Typically, function application is represented as a binary operator. A list is a useful generalisation because we can make efficient use of a machine with a wide word size. A node is an application pointer, an integer, a function, a primitive function, or an argument. > data Node = > APP Shared HeapAddr > | INT Int > | FUN Arity [Instr] > | PRI PrimOp > | ARG Shared Int An application may be unshared or possibly-shared, represented by False and True respectively. > type Shared = Bool Similarly, arguments can be unshared or shared, depending on how many times they are referenced in the body of a function. Argument nodes should never appear on the heap, only in template memory. > type Arity = Int Function nodes contain an arity and a list of instructions to execute when the function is applied. So template memory is implicit. An instruction either builds an application on the heap, or pushes an application on the stack. > data Instr = > MKAP App > | PUSH App For example, the function 's' defined by s f g x = f x (g x) would be represented by the following node. FUN 3 [ MKAP [ARG False 1, ARG True 2] , PUSH [ARG False 0, ARG True 2, APP False 0] ] The application node 'APP False 0' is a relative address, referring to the application at address 'hp+0' where 'hp' is the value of the heap pointer before entering the code for 's'. The only difference between an application made using MKAP and one made using PUSH is that the latter application represents the spine of the function body. In a spineless machine, this application need only be built on the stack, not the heap. In the compilation scheme presented below, function nodes always contain exactly one PUSH instruction. However, in a machine with a finite application size, it may be useful to perform several PUSHes. For efficiency, it is useful to perform PUSHes after MKAPs as they do not access the heap. Consequently, if the new top-of-stack is an application node, then the application can be pre-fetched from the heap, saving a clock-cycle. What's more serious is that performing PUSHes before MKAPs would destroy the stack before the MKAPs have fetched the necessary arguments. F-lite supports the following 7 primitive operations: (+), (-), (==), (/=), (<=), emit, and emitInt. > type PrimOp = String The stack is modelled as a list of nodes, and can be indexed by a stack-address. > type Stack = [Node] > type StackAddr = Int A spineless machine also contains an update-stack. > type UStack = [Update] An update is a stack-address/heap-address pair. When a function demands arguments beyond the stack-address of the top update, the value at the corresponding heap-address is updated. See below for details. > data Update = Update StackAddr HeapAddr Here is a small-step operational semantics for the machine. It operates over a triple containing the stack, the update-stack and the heap. It assumes that the expression to be evaluated is of type integer. > eval :: Stack -> UStack -> Heap -> Int > eval (APP sh addr:s) u h = eval (dashApp sh (h !! addr) ++ s) u' h > where u' = if sh then Update (length s+1) addr:u else u > eval s (Update saddr haddr:u) h > | updateCheck s saddr = eval s' u h' > where (s', h') = performUpdate s h saddr haddr > eval [INT i] u h = i > eval (INT i:a:s) u h = eval (a:INT i:s) u h > eval (PRI p:s) u h = eval (applyPrim p s) u h > eval (FUN arity code:s) u h = eval s' u h' > where (args, rest) = splitAt arity s > (s', h') = foldl (exec args (length h)) (rest, h) code The following sections elaborate on each of the rewrite rules. Unfolding --------- To unfold a function, we execute its instructions. To do this, we need to know the arguments to the function, the heap pointer (to convert relative addresses in template-memory into absolute ones on the heap), the argument-less stack (on which the spine application will pushed) and the heap (where the other applications will be written to). During execution, the stack and the heap are modified. > exec :: [Node] -> HeapAddr -> (Stack, Heap) -> Instr -> (Stack, Heap) > exec args base (s, h) (MKAP app) = (s, h') > where h' = h ++ [map (inst args base) app] > exec args base (s, h) (PUSH app) = (map (inst args base) app ++ s, h) > inst :: [Node] -> HeapAddr -> Node -> Node > inst args base (APP shared offset) = APP shared (base + offset) > inst args base (ARG shared index) = dash shared (args !! index) > inst args base n = n When substituting for a shared argument we must ensure that, if the substituted value is an application, then it is marked as shared. Peyton Jones calls this operation "dashing". > dash :: Bool -> Node -> Node > dash True (APP _ addr) = APP True addr > dash _ n = n Unwinding --------- Unwinding an application is not simply a case of copying it from the heap to the stack. If it is a shared application, we must dash each node in order to propagate the sharing information. > dashApp :: Bool -> App -> App > dashApp b = map (dash b) Furthermore, if it is a shared application, we must push an update on the update-stack. This allows us to detect when evaluation of the application has reached head normal form, and tells us the address to write the result to. Updating -------- An update-check is performed before every reduction (but not before unwinding). The check returns true if the arity of the function to reduce is larger that the difference between the current stack-pointer and the stack-address of the top update on the update-stack. > updateCheck :: Stack -> StackAddr -> Bool > updateCheck s@(top:_) saddr = arity top > length s - saddr > arity :: Node -> Int > arity (FUN arity code) = arity > arity (PRI p) = 2 > arity (INT i) = 1 All primitives have an arity of two. Perhaps surprisingly, an integer has an arity of one, but indeed it is useful to treat integers as unary functions as we'll see. To perform the update, we compute the length of the normal-form, read it from the stack, and write it to the heap. Quite subtly, we must also dash the normal-form on the stack, because the process of writing it to the heap could duplicate an otherwise unshared reference. > performUpdate :: Stack -> Heap -> StackAddr -> HeapAddr -> (Stack, Heap) > performUpdate s h saddr haddr = (app ++ drop n s, update haddr app h) > where n = 1 + length s - saddr > app = dashApp True (take n s) > update :: Int -> a -> [a] -> [a] > update i a xs = [if i == j then a else x | (j, x) <- zip [0..] xs] Primitives ---------- To force evaluation of the arguments to strict binary-arithmetic primitives, primitive applications occuring in the F-lite program are transformed by the rule a x y --> y (x a) where 'a' is an arithmetic primitive and 'x' and 'y' are expressions. Emit primitives, which are binary operations requiring only their first arguments to be evaluated, are handled by the similar rule e x k --> x e k where 'e' is an emit primitive and 'x' and 'k' are expressions. These transformations treat integers as unary functions. Specifically, an integer 'i' is reduced by the following rule i k --> k i which corresponds to a swap operation in the reduction machine. Applying a primitive function is now rather straightforward because we know the arguments must be evaluated. > applyPrim :: PrimOp -> Stack -> Stack > applyPrim "(+)" (INT a:INT b:s) = INT (a + b):s > applyPrim "(-)" (INT a:INT b:s) = INT (a - b):s > applyPrim "(==)" (INT a:INT b:s) = bool (a == b):s > applyPrim "(/=)" (INT a:INT b:s) = bool (a /= b):s > applyPrim "(<=)" (INT a:INT b:s) = bool (a <= b):s > applyPrim "emit" (INT a:k:s) = emitStr [toEnum a] k:s > applyPrim "emitInt" (INT a:k:s) = emitStr (show a) k:s We assume that data constructors and case expressions have been compiled using Jansen's encoding. So the constructors 'True' and 'False' are represented by the following nodes. > bool :: Bool -> Node > bool False = FUN 2 [PUSH [ARG False 0]] > bool True = FUN 2 [PUSH [ARG False 1]] The 'emitStr' function takes two arguments and returns its second. It has the side-effect of printing its first argument. > emitStr :: String -> a -> a > emitStr s k = unsafePerformIO (putStr s) `seq` k This completes the small-step semantics. Now we define a compiler for the machine. Compilation ----------- Programs contain function definitions, > type Prog = [Defn] and a function definition consists of a function identifier, its arity, a list of let-bindings, and its body. > type Defn = (Id, Int, [Bind], Exp) > type Id = String > type Bind = (Id, Exp) Expressions are defined as follows. > data Exp = > Apply [Exp] > | Fun Id > | Prim Id > | Var Id > | Arg Shared Int > | Int Int Function arguments are referred to by position using the 'Arg' constructor; we assume they are already marked as unshared or shared. One reason for this is that Jansen's encoding may hide the true linearity of variables in case alternatives, so if Jansen's encoding is used, the sharing analysis, albeit simple, needs to be done prior to the removal of case expressions. In this memo, we do not wish to consider how best to implement case expressions, but see Memo 13 for some thoughts on this. Variables refer only to local let-bindings in a function's definition. The compiler turns a program into a node which, when evaluated, represents the result of the program. > compile :: Prog -> Node > compile p = head [FUN arity code | ("main", arity, code) <- p'] > where p' = [(f, arity, comp p' bs e) | (f, arity, bs, e) <- p] > comp p bs e = compSpine p binds code e > where (vs, es) = unzip bs > binds = zip vs ns > (code, ns) = mapAccumL (compExp p binds) [] es > compSpine p binds code (Apply es) = fst (compApp p binds True code es) > compSpine p binds code e = compSpine p binds code (Apply [e]) > compExp p binds code (Int i) = (code, INT i) > compExp p binds code (Var v) = (code, n) > where n = head [dash True n | (w, n) <- binds, v == w] > compExp p binds code (Arg shared i) = (code, ARG shared i) > compExp p binds code (Prim f) = (code, PRI f) > compExp p binds code (Fun f) = (code, n) > where n = head [FUN arity is | (g, arity, is) <- p, f == g] > compExp p binds code (Apply es) = compApp p binds False code es > compApp p binds spine code es = > (code' ++ [if spine then PUSH ns else MKAP ns], APP False (length code')) > where (code', ns) = mapAccumL (compExp p binds) code es Running a program ----------------- The result of a program is defined by the following function. > run :: Prog -> Int > run p = eval [compile p] [] [] Performance measurements: updates avoided ----------------------------------------- Here are the numbers of updates performed, applications unwound, instructions executed, swaps and primitives reductions accross a range of benchmark programs. PROGRAM UPDATES UNWINDS INSTRS SWAPS PRIMS Fib 16718 33471 75316 25092 20913 Sudoku 16839 39274 94298 8138 5391 OrdList 40400 227358 611949 0 1 Queens 55220 169355 98500 76546 43435 Queens2 70339 146681 374976 17122 9044 PermSort 65515 139635 364867 17325 8674 MSS 96805 187826 266116 83388 41726 Clausify 77848 243064 769995 24964 14317 While 68815 160735 447688 31451 15726 To see the impact of the dynamic sharing analysis, look at the differences between the numbers of updates and unwinds. Without any sharing analysis, these numbers would be the same. On average, the dynamic sharing analysis avoids 60% of updates. Performance estimate -------------------- How well would such a machine perform in hardware? We make the following estimate and compare it with the Reduceron from Matthew's PhD. PROGRAM PHD REDUCERON SPINELESS REDUCERON Fib 230135 129666 Sudoku 435996 97397 OrdList 1614678 455078 Queens 962788 443056 Queens2 1100675 373775 PermSort 932722 353090 MSS 1485660 468804 Clausify 1662671 639097 While 1019827 453986 What assumptions do we make about the hardware implementation of the spineless Reduceron? Well, many of these have been discussed in previous memos, but here is a recap: 1. Unwinding takes a single cycle, due to of memory-aligned applications. Indeed, in the spineless experiments, all applications have been transformed so they have a maximum length of 4. 2. Executing an instruction takes a single cycle, but two instructions in the same instruction sequence can be executed in parallel, thanks to the dual-port heap. Again, all PUSH and MKAP instructions deal with applications of maximum length 4. 3. A swap and a primitive reduction both take one cycle each. 4. Updating takes '1 + n `div` 4' cycles where 'n' is the length of the normal-form. Future developments ------------------- Before attempting a hardware implementation of this machine, there are three main improvements we'd like to explore: handling of case expressions, speculative evaluation of primitives, and garbage collection.