=======================
REDUCERON MEMO 27
Design of the Octostack
Matthew N, 6 July 2009
=======================

This memo presents various hardware designs of a stack data structure
allowing:

  * the top N elements of the stack to be observed;
  * up to N elements to be popped off the stack;
  * and up to N elements to be pushed onto the stack.

In the Reduceron, where N is defined to be 8, this data structure is
referred to as the "Octostack".  The intention is that, on FPGA, the
pushing and popping operations each take a single clock-cycle to
perform (and can be performed in parallel), and the top elements can
always be observed in zero clock-cycles.

> import List
> import Maybe
> import Array

Interface
---------

The stack is parameterised by N, now referred to as 'blockSize'.

> blockSize :: Int
> blockSize = 8

A block is a container storing exactly 'blockSize' elements, although
this constraint is not enforced by the type system.

> type Block a = [a]

The data structure provides four operations.

> class Stack s where
>   empty  :: s a
>   size   :: s a -> Int
>   tops   :: s a -> Block (Maybe a)
>   update :: Int -> Block (Maybe a) -> s a -> s a

The 'Maybe' type is used to represent partial-definedness.  If the top
block of stack elements is requested and the stack contains fewer than
'blockSize' elements, then the returned block will contain some
undefined elements.  Similarly, when writing a new block of
elements to the stack, only the defined elements in the block are
written, not the undefined ones.

The 'update' operation is the only slightly unusual one.  It takes
three arguments:

  * an integer offset in the range '-blocksize' to 'blocksize';
  * a block of elements to write onto the top of the stack;
  * and a stack.

And it returns a new stack whose size is equal to the size of the
input stack plus the offset, and whose top 'blockSize' elements are
overwritten with the defined values in the write-block.

Example
-------

To illustrate the stack interface, let us start with the empty stack
and perform a few pushes and pops.

> s1, s2, s3, s4, s5 :: Stack s => s Int
> s1 = empty

We will use the following shorthands for 'Nothing' and 'Just'.

> n :: Maybe a
> n = Nothing

> j :: a -> Maybe a
> j a = Just a

To push the elements 1 to 5:

> s2 = update 5 [j 1,j 2,j 3,j 4,j 5,n,n,n] s1

To push the elements 10 to 13:

> s3 = update 4 [j 10,j 11,j 12,j 13,n,n,n,n] s2

To pop two elements:

> s4 = update (-2) [n,n,n,n,n,n,n,n] s3

To pop three elements, and in parallel, push the element 20:

> s5 = update (-2) [j 20,n,n,n,n,n,n,n] s4

Notice that the offset (passed as the first argument to 'update') is
in general equal to the number of pushes minus the number of pops.

Implementation 1
----------------

Since we are targeting FPGAs, we will make use of RAMs to implement
the stack.  The fact that the locations in a RAM may be uninitialised
is captured by the 'Maybe' type.

> type RAM a = Array Int (Maybe a)

It is important to account for the fact that RAMs have a capacity.

> ramCapacity :: Int
> ramCapacity = 1024

The actual value doesn't really matter here, but it is needed to
compute correct array indexes.  In particular, if a negative number,
say -n, is placed on the address bus of a RAM then it is reasonable to
expect the nth element from the end of the RAM appear on the data bus.
This is because, on FPGA, RAM addresses are in twos complement form.

> wrap :: Int -> Int
> wrap n = (if n < 0 then n + ramCapacity else n) `mod` ramCapacity

We represent a stack by a size/contents pair.

> data Stack1 a =
>   Stack1 {
>     size1 :: Int
>   , contents1 :: RAM a
>   }

Given the size of a stack, the following function computes the
addresses of the top 'blockSize' elements.

> addresses1 :: Int -> [Int]
> addresses1 n = [wrap (n-i) | i <- [1..blockSize]]

Now for the implementation.

> instance Stack Stack1 where
>   empty = Stack1 { size1 = 0, contents1 = ram }
>     where ram = listArray (0, ramCapacity-1) (replicate ramCapacity Nothing)
>
>   size s = size1 s
>
>   tops s = map (contents1 s !) (addresses1 (size1 s))
>
>   update offset writes s =
>     Stack1 { size1 = newSize, contents1 = contents1 s // updates }
>     where
>      newSize = size1 s + offset
>      updates = [u | u@(_, Just _) <- zip (addresses1 newSize) writes]

> example1 = tops (s5 :: Stack1 Int)

As a quick test, 'example1' has the following value.

  [Just 20,Just 2,Just 3,Just 4,Just 5,Nothing,Nothing,Nothing]

Nasty property of Implementation 1
----------------------------------

If the top 'blockSize' elements are to be read/written in a single
clock-cycle, then the above design requires a RAM with 'blockSize'
read/write ports.  Implementing a RAM with an arbitrary number of read
ports is easy on an FPGA, but to implement one with more than two
write ports is very difficult - in fact, it seems only possible using
flip-flops, which is not very scalable.

Nice property of Implementation 1
---------------------------------

An attractive property is that the same addresses are used both when
reading from and writing to the top 'blockSize' locations of the RAM.
This need not necessarily be the case: to write three elements onto
the stack, we only need to set three addresses correctly; it doesn't
matter what the other five addresses are set to.

However, if the addresses are the same when reading as when writing
then reading and writing can be done in parallel.  In particular,
after writing to the stack, we do not have to wait until the next
clock cycle in order to schedule a read of the new top elements.  We
can, for example, pop four elements from the stack, push two new ones
and, *at the same time*, schedule a read the top eight elements.  

(All this is thanks to the fact that FPGA block RAMS have separate
data busses for reading and writing, with a "WRITE_FIRST" semantics.)

Rotation functions
------------------

The following general-purpose functions will be useful in the next
stack implementations.

Rotate a list 'n' elements to the left; 'n' must be non-negative.

> left :: Int -> [a] -> [a]
> left n xs = drop n xs ++ take n xs

Rotate a list 'n' elements to the right; 'n' must be non-negative.

> right :: Int -> [a] -> [a]
> right n = reverse . left n . reverse

A few algebraic properties.

    reverse . right n
  = { definition of right }
    reverse . reverse . left n . reverse
  = { reverse . reverse = id }
    left n . reverse

Observing that

  left n = reverse . right n . reverse

we also have 

  reverse . left n = right n . reverse

Rotate a list 'n' elements to the right; if 'n' is negative the list is
rotated '|n|' elements to the left.

> rotate :: Int -> [a] -> [a]
> rotate n = if n < 0 then left (-n) else right n

(See Memo 26 for circuit-level implementations of these functions.)

Implementation 2
----------------

To avoid the need for a single RAM with 'blockSize' ports, we now use
'blockSize' RAMs, each with a single port.

> data Stack2 a =
>   Stack2 {
>     size2 :: Int
>   , contents2 :: Block (RAM a)
>   }

If the RAMs are numbered 0 to 7 then the RAM numbered 'n' stores the
elements with the following stack indices.  (Note about the term
'stack index': the element at the bottom of the stack has index 0, and
the element at the top has index 'm-1' where 'm' is the number of
elements on the stack.)

  n, n+blockSize, n+2*blockSize, n+3*blockSize, ...

For example, when 'blockSize' = 8:

  RAM   Stack indices of elements stored
  0     0,  8, 16, ...
  1     1,  9, 17, ...
  2     2, 10, 18, ...
  3     3, 11, 19, ...
  4     4, 12, 20, ...
  5     5, 13, 21, ...
  6     6, 14, 22, ...
  7     7, 15, 23, ...

Given the index of an element on the stack, we can determine which RAM
it is stored in.

> lower :: Int -> Int
> lower i = i `mod` blockSize

We can also determine the RAM location where an element is stored.

> upper :: Int -> Int
> upper i = i `div` blockSize

Given the size of the stack, the following function computes, for each
RAM 0 to 7, the address of that RAM's topmost element.

> addresses2 :: Int -> [Int]
> addresses2 n = [upper (wrap (n-i)) | i <- [1..blockSize]]

> instance Stack Stack2 where
>   empty = Stack2 { size2 = 0, contents2 = replicate blockSize ram }
>     where ram = listArray (0, ramCapacity-1) (replicate ramCapacity Nothing)
>
>   size s = size2 s

For example, considering a stack of size 19, the addresses of the
topmost element in RAMs 0 to 7 are 2,2,2,1,1,1,1,1 respectively.
Indexing the RAMs by these addresses gives the stack elements with
indices 16,17,18,11,12,13,14,15.  To obtain these top elements in
order, we rotate them by 'n' positions (here 3) to the left, where 'n'
is the number of the RAM holding of the top element.  More formally,
'n = lower m' where 'm' is the size of the stack.  Finally, we can
reverse this sequence to give the top elements in reverse order, so
that the top stack element comes first.  Recall that rotating left and
then reversing is equivalent to reversing and then rotating right.  We
prefer the latter.

>   tops s = right (lower (size2 s)) (reverse ramOuts)
>     where ramOuts = zipWith (!) (contents2 s) (addresses2 (size2 s))

A similar rotation method is used when updating the stack.  The main
difference is that instead of reversing and rotating the RAM outputs,
we reverse and rotate the RAM inputs.

>   update offset writes s =
>     Stack2 { size2 = newSize, contents2 = zipWith (//) (contents2 s) uss }
>     where
>       newSize = size2 s + offset
>       writes' = reverse (left (lower newSize) writes)
>       updates = zip (addresses2 newSize) writes'
>       uss     = [[u | isJust x] | u@(_,x) <- updates]

> example2 = tops (s5 :: Stack2 Int)

Implementation 3
----------------

The problem with the above implementation is that there are large
logic delays on the input and output busses of the RAMs --- observing
and updating the top stack elements incurs the RAM-access delay plus
the delay of the rotation logic.  One way to reduce this delay is to
store the top elements in registers.  The following is a wrapper
around the above implementation which caches the top 'blockSize'
elements in registers.

The cache is a block of registers, storing the top elements, with the
top element coming first.

> type Cache a = Block (Maybe a)

> data Stack3 a =
>   Stack3 {
>     cache :: Cache a
>   , stack :: Stack2 a
>   }

> instance Stack Stack3 where
>   empty = Stack3 { cache = replicate blockSize Nothing, stack = empty }
>
>   size s = size (stack s)
>
>   tops s = cache s

To update the stack:

  * the defined elements in the write-block are written into the cache
    registers;

  * if 'offset' is positive, then the values of the upper 'offset'
    cache registers are pushed onto the RAM stack;

  * if 'offset' is negative, then the top '|offset|' values on the RAM
    stack are popped and written into the upper '|offset|' cache
    registers;

  * any cache register 'n' not modified by the above three cases is
    updated to contain the value of cache register 'wrap (n+offset)'.

>   update offset writes s =
>     Stack3 {
>       cache = newCache
>     , stack = update offset push (stack s)
>     }
>     where
>       rotated  = rotate offset (cache s)
>       push     = [ if i <= offset then x else Nothing
>                  | (i, x) <- zip [1..blockSize] rotated ]
>       pushMask = [isJust x | x <- writes]
>       popMask  = reverse [(-i) >= offset | i <- [1..blockSize]]
>       ramTops  = rotate offset (tops (stack s))
>       newCache = zipWith5 choose writes pushMask ramTops popMask rotated

> choose push pushBit pop popBit rot
>   | pushBit   = push
>   | popBit    = pop
>   | otherwise = rot

> example3 = tops (s5 :: Stack3 Int)

Fusion
------

If we inline the 2nd implementation into the 3rd, two fusion
possibilities arise.

  1. The list 'ramTops' is defined to be the top RAM elements rotated
     by 'offset' positions.  The values returned by 'tops' in the 2nd
     implementation are rotated right by 'n' positions, where 'n' is
     the size of the stack.  These two rotations may be fused into a single
     rotation by 'n + offset' positions.

  2. The list 'writes'' is computed by left rotating a list 'writes'
     by 'n+offset' positions.  And the list 'writes' is computed by
     right rotating the cache by 'offset' positions.  The left and
     right rotations to some extent cancel each other out; the same
     result can be obtained by simply left rotating the cache by 'n'
     positions.
  
Testing
-------

> test :: Stack s => [(Int, Block (Maybe a))] -> s a
> test = foldl f empty 
>   where f s (offset, writes) = update offset writes s

> block :: [a] -> Block (Maybe a)
> block xs = take blockSize (map Just xs ++ repeat Nothing)

> test1 :: Stack s => s Char
> test1 = test [ (3, block "one")        -- Push 3
>              , (3, block "two")        -- Push 3
>              , (5, block "three")      -- Push 5
>              , (4, block "four")       -- Push 4
>              , (4, block "five")       -- Push 4
>              , (3, block "six")        -- Push 3
>              , (-7, block "")          -- Pop 7
>              , (-5, block "12345")     -- Pop 5
>              , (1, block "ok!")        -- Pop 2, Push 3
>              , (4, block "I am")       -- Push 4
>              , (-1, block "XY")        -- Pop 2, Push 1
>              ]

Final remarks
-------------

The FPGA implementation of the Octostack closely follows the above
code for 'Stack3' - see 'CachingOctostack.hs' in the Reduceron source
code.

For high performance, the user of the stack should compute the
'offset' argument to 'update' with as little delay as possible.  This
is because the 'offset' input is connected directly to the address
busses of the internal block RAMs.  In our experience, this is where
the critical path in a design often lies.

In future, it may be helpful to draw a diagram of this stack design.
A diagram of Implementation 2 is however already available in section
4.3 of [1].

References
----------

[1] Matthew Naylor and Colin Runciman.  Widening the von Neumann
Bottlneck for Graph Reduction using an FPGA.  In Implementation and
Application of Functional Languages (Revised Selected Papers),
Freiberg, Germany, 2007.