========================== REDUCERON MEMO 18 Spineless versus Spinefull Matthew N, 23 January 2008 ========================== In Memo 15, we presented a semantics for a spineless Reduceron and, at the end, some figures suggesting that it will be more efficient than the existing Reduceron. However, we did not intend to suggest that the efficiency improvement is entirely due to moving from a spinefull machine to a spineless one. The real cause of the speed-up is surely the use of memory-aligned applications, which is not specific to a spineless machine. Peyton Jones states that the spineless G-machine is more efficient that the original (spinefull) G-machine, but he is concerned with the PC architecture. If the heap and stack can be accessed in parallel (as in the Reduceron), then the spine can be written to both at the same time, absorbing the main inefficiency of a spinefull machine. So the question is: which is better, a spineless or a spinefull Reduceron? This memo presents a semantics and a compiler for spinefull Reduceron with memory-aligned applications, discusses its strengths and weaknesses compared to the spineless Reduceron, and finally presents some performance figures. > import List > import System.IO.Unsafe(unsafePerformIO) Semantics ========= > type Heap = [App] > type HeapAddr = Int > type App = [Node] The first difference from the spineless machine is that application and argument nodes do not contain a 'possibly-shared' flag. However, it does seem possible to utilise such sharing information in a spinefull machine, and we comment on this below. > data Node = > APP HeapAddr > | INT Int > | FUN Arity [Instr] > | PRI PrimOp > | ARG Int > type Shared = Bool > type Arity = Int > type PrimOp = String > data Instr = > MKAP App > | PUSH App > type Stack = [Node] > type StackAddr = Int The update stack is a still list of updates like in the spineless machine. But the update stack now stores the heap address of every node on the update stack. The update stack in the spineless machine contained only update records for the possibly-shared applications currently under evaluation. > type UStack = [Update] An update record now contains the heap address of the application to update, and an offset within that application. > data Update = Update HeapAddr Int The offset is needed to correctly update an over-saturated application. For example, if the application 'f a b' at address 'n' is unwound onto the stack, then the stack and address stack (which both grow downwards) look as follows. STACK USTACK b n,3 a n,2 f n,1 If 'f a' reduces to 'z', we must update the 'f a b' at address 'n' with 'z b'. This can be done efficiently if the nodes of an application are stored in reverse order on the heap. The spinefull evaluator differs in several places from the spinless one: 1. To unwind an application of length 'n', 'n' nodes must be written both onto the node and address stacks. To do this efficiently, we need a wide update stack as well as a wide node stack. In the spineless machine, we didn't need a wide update stack. 2. No "update-check" is required. Instead, an update is performed on every reduction, including primitive and non-primitive function application. 3. When executing a PUSH instruction, an update is performed, utilising one of the ports on the heap. If the number of instructions to execute is even, then both ports of the heap are in use on the final clock-cycle, so we must delay for a clock-cycle to read the next application. In the spineless machine, PUSH instructions never access the heap. 4. Although the spine can be written to the heap and stack in parallel, heap space is wasted when unnecessarily writing the spine to the heap. 5. Like in the spineless machine, updating is done by overwriting the redex with the result (not with an indirection). However, there is a special case where an indirection is required: when an application contained within an over-saturated application has been reduced. Such an update requires two heap accesses: one to write the indirection, and one to write the application. If 'possibly-shared' applications were implemented, as in the spineless machine, the spinefull machine could in sometimes avoid updating, freeing up the heap for other uses. > eval :: Stack -> UStack -> Heap -> Int > eval (APP addr:s) (_:u) h = eval (ap ++ s) (us ++ u) h > where ap = h !! addr > us = map (Update addr) [1..length ap] > eval [INT i] u h = i > eval (INT i:e:s) u h = eval (e:INT i:s) u h > eval (PRI p:s) u h = eval s' u' h' > where res = applyPrim p s > (s', u', h') = performUpdate (drop 2 s) (drop 3 u) h (u !! 2) [res] > eval (FUN arity code:s) u h = eval s' u'' h' > where (args, rest) = splitAt arity s > root = u !! arity > u' = drop (arity+1) u > (s', u'', h') = foldl (exec args (length h) root) (rest, u', h) code > exec args base root (s, u, h) (MKAP app) = (s, u, h') > where h' = h ++ [map (inst args base) app] > exec args base root (s, u, h) (PUSH app) = performUpdate s u h root app' > where app' = map (inst args base) app > performUpdate s u h (Update a i) app > | length app <= i = let addrs = map (Update a) [1..length app] in > (app ++ s, addrs ++ u, update a app h) > | otherwise = let addrs = map (Update (length h)) [1..length app] in > (app ++ s, addrs ++ u, update a [APP (length h)] h ++ [app]) > update :: Int -> a -> [a] -> [a] > update i a xs = [if i == j then a else x | (j, x) <- zip [0..] xs] > inst :: [Node] -> HeapAddr -> Node -> Node > inst args base (APP offset) = APP (base + offset) > inst args base (ARG index) = args !! index > inst args base n = n > applyPrim :: PrimOp -> Stack -> Node > applyPrim "(+)" (INT a:INT b:s) = INT (a + b) > applyPrim "(-)" (INT a:INT b:s) = INT (a - b) > applyPrim "(==)" (INT a:INT b:s) = bool (a == b) > applyPrim "(/=)" (INT a:INT b:s) = bool (a /= b) > applyPrim "(<=)" (INT a:INT b:s) = bool (a <= b) > applyPrim "emit" (INT a:k:s) = emitStr [toEnum a] k > applyPrim "emitInt" (INT a:k:s) = emitStr (show a) k > bool :: Bool -> Node > bool False = FUN 2 [PUSH [ARG 0]] > bool True = FUN 2 [PUSH [ARG 1]] > emitStr :: String -> a -> a > emitStr s k = unsafePerformIO (putStr s) `seq` k Compiler ======== > type Prog = [Defn] > type Defn = (Id, Int, [Bind], Exp) > type Id = String > type Bind = (Id, Exp) > data Exp = > Apply [Exp] > | Fun Id > | Prim Id > | Var Id > | Arg Shared Int > | Int Int > compile :: Prog -> Node > compile p = head [FUN arity code | ("main", arity, code) <- p'] > where p' = [(f, arity, comp p' bs e) | (f, arity, bs, e) <- p] > comp p bs e = compSpine p binds code e > where (vs, es) = unzip bs > binds = zip vs ns > (code, ns) = mapAccumL (compExp p binds) [] es > compSpine p binds code (Apply es) = fst (compApp p binds True code es) > compSpine p binds code e = compSpine p binds code (Apply [e]) > compExp p binds code (Int i) = (code, INT i) > compExp p binds code (Var v) = (code, n) > where n = head [n | (w, n) <- binds, v == w] > compExp p binds code (Arg shared i) = (code, ARG i) > compExp p binds code (Prim f) = (code, PRI f) > compExp p binds code (Fun f) = (code, n) > where n = head [FUN arity is | (g, arity, is) <- p, f == g] > compExp p binds code (Apply es) = compApp p binds False code es > compApp p binds spine code es = > (code' ++ [if spine then PUSH ns else MKAP ns], APP (length code')) > where (code', ns) = mapAccumL (compExp p binds) code es > run :: Prog -> Int > run p = eval [compile p] [Update 0 1] [undefined] Estimated performance ===================== Number of clock ticks taken by spineless and spinefull implementations on a range of benchmarks: +-----------+------------+-----------+ | PROGRAM | SPINELESS | SPINEFULL | +-----------+------------+-----------+ | Fib | 129666 | 158955 | | Sudoku | 97397 | 133665 | | OrdList | 455078 | 531051 | | Queens | 443056 | 473734 | | Queens2 | 373775 | 392223 | | PermSort | 353090 | 358934 | | MSS | 468804 | 516212 | | Clausify | 639097 | 793645 | | While | 453986 | 512452 | +-----------+------------+-----------+ Conclusion ========== The spineless Reduceron appears to have a few small advantages over a spinefull one. 1. It is slightly faster. 2. It uses less heap space. 3. It doesn't require sometimes introducing indirections when updating a redex. 4. Its update stack is simpler; only one element ever needs to be pushed or popped at a time. Its update stack also doesn't grow as large. 5. It is more modular, as it separates updating from application, making application quite a bit simpler.