INCREMENTAL ARCHIVING OF REDEX TRAILS
ART Memo 3 (Version 1)
Colin Runciman, 15 June 2000, revised 3 October 2000

INTRODUCTION

This memo is about incrementally archiving redex trails to file during
a traced computation. Such archiving is desirable for three main reasons:

1. The size of redex trails is no longer limited by the capacity of heap
   memory.
   
2. Redex trails stored in files can be browsed as often as required without
   repeating the execution of slow trail-building programs.

3. Archived trails should be more prtable than the programs that
   generate them.
   
There may even be a slight gain in speed for the trail-building programs:
taking trails out of heap should greatly reduce GC time in the copying
collector, and this may more than offset the additional time for writing to
file.

IMPLEMENTATION METHODS

There are at least two alternative approaches to the implementation
of archiving:

1. Archive trail structure to file at each garbage collection.

2. Change the trail-building combinators into trail-filing combinators.

This memo concentrates on the first alternative.  If archiving is viewed
as a way to relieve memory pressure, it is natural to incorporate it as
part of garbage collection.  As we shall see, the required mechanisms
can be obtained by adapting the pruning routines already implemented.

The alternative of trail-filing combinators has the apparent advantage
of working at a higher level, and therefore being more portable.  But I
think the combinators would be hard to get right (there are very subtle
issues of laziness that do not sit easily with the need for sequential
I/O) and it would be harder to make them efficient.

USING THE PRUNING MECHANISM

There is already a routine in the garbage collector used for pruning
redex trails.  For a given parameter k, this routine prunes away all parts
of the trail to which the shortest path is longer than k.  Each pruned
section of the trail is replaced by a nullary trail constructor `Pruned'.

A variant of this pruning routine could perform much of the work required
to write trails to an archive file. In outline:

* The routine is called with k=0, so that at each GC all the  trail structure
  built since the last GC is archived.

* All the pruned sections of trail are traversed, writing an archival copy to
  file. 

* Instead of the constructor Pruned, the residual trails for archived sections
  are new constructions `Archive fp' where `fp' is a byte-offset pointer into
  the archive file.

* At the end of the computation, a special final call to the garbage collector
  is made. The trails for any expressions still under evaluation (in the case
  of interruption or failure) are closed with an `Evaluating' construction --
  displayed as undefined by the browser.

FORWARD AND BACKWARD POINTERS

The `Trace' type defined for redex trails is recursive: nodes in the
trail may incorporate links to other nodes.  Backward links (ie. links
from new nodes to older ones) do not pose any problems for archiving.
But forward links do pose a problem: it is hardly attractive to update
links in nodes already archived to file in order to track movements or
updates in heap memory!

Happily, almost all links in redex trails go backwards.  The one
exception is the `Sat' construction representing a saturated application
not yet reduced.  But this construction is transient, rather like
an indirection. When the result of the application it represents is
available, a `Sat' node is cut out of the trail by the garbage collector:
indeed, at each garbage collection, most `Sat' nodes encountered can be
eliminated in this way.

HANDLING UNSHORTABLE SATS

But what about the ones that cannot be eliminated?  These can be handled
by recording them in a filed-Sat table (FST), analogous to the exception
tables used for old-to-new pointers in generational garbage collection.

1. If an archiving collection encounters a trail node `Sat app res' that
   cannot be shorted (because the `res' trail is not yet available) then
   a representation of the Sat node is written to file as part of the
   archived trace.  Let fps be the position of the Sat in the file.

2. In addition, an entry (fps, res) is added to the FST.  

3. All such `res' pointers in the FST must be treated as GC-roots. They
   point to heap-expressions for unevaluated result trails.  If the
   heap-expression is moved, or reduced, the `res' pointer must continue
   to refer to it.
   
4. If a `res' trail becomes evaluated, it will itself be archived, to
   some file position fpr.  Its associated entry in the FST can now be
   flushed to a sat-resolution file (SRF) which records indirections of
   the form fps->fpr.

5. When the computation is over, a post-process should apply the SRF
   as a transaction file to the archived trail as master file.
   
The form in which Sats are written to the archive file in the first
place should allow unresolved Sats to be interpreted appropriately.
Otherwise when the computation ends, any unresolved entries in the
FST would have to be flushed to the SRF as explicit markers (fps->?),
and it would be more awkward to support interrupts and resumptions.

It may be most convenient to have the FST itself allocated in heap,
so that it can grow by need, and attaches to a single extra root.

If the pruning distance used for archiving was great than zero, then
Sats might be more likely to have resolved before they are archived to
file.  But pruning back to zero seems simpler.
  
ARCHIVING PARTIAL TRAILS

Because some functions are trusted and others untrusted, it will actually be
necessary to have two distinct `Archive' constructors. By keeping this
distinction in the residual trail stub, trustedness can be tested without
reading back from the file.  (There should really be two distinct
`Pruned' constructors for similar reasons.)

It is not clear how archiving and k-bounded pruning could be used together to
construct an archive of a pruned trail. But file archiving should make current
uses of k-bounded pruning unnecessary.

AUXILIARY TABLES

Apart from the trail structure itself, other pieces of information are
need as part of a trace:

1. a table of source references;

2. a table of identifiers;

3. a table of strings;

4. program output partitioned into sections with distinct trails;

5. any other initial points giving access into the trail, such as a run-time
   error, or interruption point.

Tables 1--3 are almost static (identifers have one dynamic attribute
indicating whether they are trusted).  It should not be necessary to
write static table entries to file every time a traced program is run.
They can be stored in a separate file to which shared reference is made
by dynamic trace files.  At the moment these tables are added by the
compiler to the end of byte-code files.  The format of the information
for each module allows the following linked structure to be constructed
in C at the start of program execution:

MODULE_<name>:
  char *name; char *sourcefile; impentry *imptable; identry *idtable;
  
The `imptable' for a module is a null-terminated sequence of MODULE_<>
references for each of its imported modules.

The `idtable' is a null-terminated sequence of identifier details:

  Trust/Susp; char *name; char *modulename; SR *srcref;
  
Redex trails are hybrid Haskell/C data structures. A four-attribute
entry in the C identifier table lies behind the Haskell type `NameInfo'
occurring in some constructions of type `Trace'.

The `modulename' is a shared reference to a string in the relevant
MODULE.  A `srcref' points to a one of a null-terminated sequence of
descriptions in a source reference table

  char *srcfile;  int pos;
  
where the `srcfile' name is again a shared reference into the MODULE
description.

As the byte-code itself is not a necessary part of the trace structure,
the information for this linked structure of tables could usefully
be split off into a separate file. This should be kept with program
sources, as source references are among the static identifier and string
attributes.

DYNAMIC INFORMATION

Items 4 and 5, the program output and access points to the trail, are
clearly dynamic. It is necessary to write this dynamic information to
file at run-time.

It is not appropriate to store program output in the same file as the
trail structure.  Information such as the partitioned output could
usefully be held in a format people can read, whereas trail structure
archives would be best represented in a compressed binary format.  Indeed,
the output from a program would be best held exactly as it is produced by
a normal untraced program.  Partitioning and trail reference information
can go in separate files.  This scheme generalises easily to named files.
(We must also remember to address the problem of IO actions that generate
empty output.)

BROWSING ARCHIVED TRAILS

The redex trail browser (rtb) is a Java program that works in conjunction
with a halted  Haskell executable.  There is a stream interface between
the two: requests from the browser are met 1-1 by responses from routines
in the run-time system for the halted Haskell program.  Requests and
responses resemble the S-expressions of LISP and incorporate reference
numbers for nodes.  For example, the browser might issue the request (G
5 3) meaning `get me details of node 5, with information to depth 3'.
Both the run-time system and the browser maintain their own caches of
node information to avoid needless repetition.

When trails are archived the halted executable should no longer be needed. 
There are two alternatives:

1. Write a new program to provide the interface between archive files and
   browser requests. One advantage of this approach is that the browser can
   remain just as it is; another is that if the new program is in Haskell it
   might be possible to support programming over the trail structure.  The
   disadvantage is that yet another program is needed in addition to the
   trail-building executable and the browser.
      
2. Change the browser to access the archive file directly instead of putting
   out requests. The advantage of this approach is its minimal architecture: a
   single program reads and displays trail information.  The disadvantage is
   that the browser is more complex (particularly if it continues to support
   message-passing mode as well) and that it is much harder to support any
   programming over the trail as a Haskell data structure.
   
The first alternative seems preferable for implementation reasons.
The user could still be given the impression of a simple architecture:
`connecting' the browser to an archived trail could automatically
spawn the necessary archive reader.