=================================== REDUCERON MEMO 13 (Subsumes Memo 6) Compiling case expressions Matthew N, 8 January 2009 =================================== We consider case-elimination, a transformation that removes data constructors and case expressions from a program. We observe three weaknesses, and propose a new transformation in which these weaknesses are removed. The new version has a weakness of its own, but it also opens up a new possibility: the selection of a case alternative in zero clock-cycles. Standard case-elimination ------------------------- Case expressions can be removed by defining data constructors as functions and transforming case expressions to function applications. Specifically, for each constructor 'c_i' of a data type with 'n' constructors, a function definition c_i v_1 .. v_(#c_i) w_1 .. w_n = w_i v_1 v_(#c_i) is introduced, where '#c' denotes the arity of the constructor 'c'. Now case expressions of the form case e of c_1 v_1 .. v_(#c_1) -> e_1 .. c_n v_1 .. v_(#c_n) -> e_n are transformed to e (\v_1 .. v_(#c_1) -> e_1) .. (\v_1 .. v_(#c_n) -> e_n) Even the lambda expressions can be eliminated by introducing a function for each alternative alt_1 x_1_1 .. x_1_* v_1 .. v_(#c_1) = e_1 .. alt_n x_n_1 .. x_n_* v_1 .. v_(#c_n) = e_n and transforming the case expression to e (alt_1 x_1_1 .. x_1_*) .. (alt_n x_n_1 .. x_n_*) where 'x_i_1 .. x_i_*' are the non-pattern-bound variables in occurring in 'e_i'. The resulting programs contain no data constructors, no case expressions, and no lambda expressions (assuming they had no lambda expressions to begin with). Functional language implementors now only need to worry about how to do function application efficiently. EXAMPLE 1. The standard definition of list concatenation, append xs ys = case xs of Nil -> ys Cons x xs -> Cons x (append xs ys) is transformed to Nil n c = n Cons x xs n c = c x xs append xs ys = xs ys (consCase ys) consCase ys x xs = Cons x (append xs ys) A new function does not need to be introduced for the nil case because the data constructor 'Nil' has no arguments. EXAMPLE 2. Given the data type data Exp = X | Y | Neg Exp | Add Exp Exp | Sub Exp Exp the evaluator, eval x y X = x eval x y Y = y eval x y (Neg n) = 0 - eval x y n eval x y (Add n m) = eval x y n + eval x y m eval x y (Sub n m) = eval x y n - eval x y m is transformed to X x y neg add sub = x Y x y neg add sub = y Neg n m x neg add sub = neg n Add n m x y neg add sub = add n m Sub n m x y neg add sub = sub n m eval x y e = e x y (negAlt x y) (addAlt x y) (subAlt x y) negAlt x y n = 0 - eval x y n addAlt x y n m = eval x y n + eval x y m subAlt x y n m = eval x y n - eval x y m Weaknesses ---------- We have three main complaints about the above transformation. 1. Linear variables may become non-linear. Look at the definition of 'append' in Example 1; originally, it is clearly linear in 'ys', but not after the transformation is performed. Although 'append' passes 'ys' down to both case alternatives, we know that only one case alternative will be chosen at run-time. This knowledge is lost by the transformation. 2. The same value may be written onto the heap many times. We have just seen that if a variable is referenced in more than one case alternative, then it is passed down to each one separately. Look at the definition of 'eval' in Example 2; 'x' and 'y' are each referenced four times in the right-hand-side. It takes time to write these values onto the heap, space to store them, and more time to unwind them onto the stack when there are needed. 3. Case alternatives are pushed onto the stack. Look at the definition of 'eval' in Example 2; the spine contains a 6-node application. It takes time to write such a long application (containing the case alternatives) onto the stack, and stack space to store it. The new approach ---------------- For the case alternatives, let us instead introduce the functions alt_1 v_1 .. v_(#c_1) alts x_1 .. x_m = e_1 .. alt_n v_1 .. v_(#c_n) alts x_1 .. x_m = e_n and transform case expressions to e alts x_1 .. x_m where 'x_1 .. x_m' is the union of all the non-pattern-bound variables in occurring in each 'e_i' and 'alts' is the lookup table defined by alts = [| alt_1, .., alt_n |] When a constructor 'c_i' appears on top of the stack 's', it is reduced to s[sp+#c_i][i] assuming the stack 's' grows from index 0 onwards and 'sp' is the index of the top stack element. (Recall that '#c' denotes the arity of the constructor 'c'.) EXAMPLE 1 (REVISITED). The definition of 'append' is now transformed to append xs ys = xs appendAlts ys appendAlts = [| nilCase, consCase |] nilCase alts ys = ys consCase x xs alts ys = Cons x (append xs ys) EXAMPLE 2 (REVISITED). The definition of 'eval' is now transformed to eval x y e = e evalAlts x y evalAlts = [| xCase, yCase, negCase, addCase, subCase |] xCase x y alts = x yCase x y alts = y negCase x y alts n = 0 - eval x y n addCase x y alts n m = eval x y n + eval x y m subCase x y alts n m = eval x y n - eval x y m Comparison ---------- The new approach removes the three weaknesses of the old one. 1. Linear variables remain linear. All the variables needed in the case alternatives are pushed onto the stack once, and the case alternative simply picks out the variables it requires. Look at the definition of 'append'; it refers to 'ys' once. 2. The same variable is no longer written onto the heap once for each case alternative that refers to it. Look at the definition of 'eval'; it refers to 'x' and 'y' once. Also, no unwinding is needed when fetching a case alternative. 3. Each case alternative is no longer pushed onto the stack. The case alternatives are stored in the 'alts' table (for example, see 'evalAlts'), and are never written to the stack or the heap. There are also some potential drawbacks. 1. The biggest drawback is that functions are introduced even for case alternatives that have no pattern-bound variables. Look at 'xCase' and 'yCase' in the definition of 'eval'. It will take time to apply these functions, which aren't introduced in the original transformation. 2. The new approach requires new evaluation machinery, namely the special treatment of constructors. A double array lookup is not a big implementation challenge, but still, the obvious implementation of a double array lookup consumes two clock cycles. Can we do better? The next section suggests that it can be done in zero cycles. Zero-cycle case selection ------------------------- Recall that when a constructor 'c_i' appears on top of the stack 's', it is reduced to s[sp+#c_i][i] Here we essentially have a double indirection to follow: 1. Determine the address of the lookup table for the case alternatives by indexing the stack. 2. Determine the address of the appropriate case alternative by indexing the lookup table. Each of these steps consumes a clock-cycle. First observe that the second indirection can be removed. Instead of storing a lookup table of code addresses (for the case alternatives), we can align the case alternatives in code memory so that the Reduceron can jump straight to appropriate code. However, case alternatives may have different body sizes, so we must either: 1. Require that all the case alternatives occupy the same amount of space in code memory, or 2. Put 'next' pointers at appropriate places in code memory. At the moment the Reduceron reads the next block of nodes from code memory while instantiating the current block; the only difference would be that 'next' is not 'address of current block+1' but 'whatever the current block states next is'. The reason for the first indirection is that the address of the lookup table must come after the data construction. As different constructors have different arities, the constructor must be known before the offset of the lookup table on the stack can be determined. One way to remove this indirection is to introduce a new stack, let's call it the LUT stack, for storing the addresses of the lookup tables. Since we always have the top of LUT stack to hand, we don't need to index the main stack anymore. Now when a constructor 'c_i' appears on top of the main stack, it can be reduced to the code address 'lut+i' where 'lut' is the value on top of the LUT stack. Such a simple addition can be performed without consuming any clock-cycles.