INCREMENTAL ARCHIVING OF REDEX TRAILS ART Memo 3 (Version 1) Colin Runciman, 15 June 2000, revised 3 October 2000 INTRODUCTION This memo is about incrementally archiving redex trails to file during a traced computation. Such archiving is desirable for three main reasons: 1. The size of redex trails is no longer limited by the capacity of heap memory. 2. Redex trails stored in files can be browsed as often as required without repeating the execution of slow trail-building programs. 3. Archived trails should be more prtable than the programs that generate them. There may even be a slight gain in speed for the trail-building programs: taking trails out of heap should greatly reduce GC time in the copying collector, and this may more than offset the additional time for writing to file. IMPLEMENTATION METHODS There are at least two alternative approaches to the implementation of archiving: 1. Archive trail structure to file at each garbage collection. 2. Change the trail-building combinators into trail-filing combinators. This memo concentrates on the first alternative. If archiving is viewed as a way to relieve memory pressure, it is natural to incorporate it as part of garbage collection. As we shall see, the required mechanisms can be obtained by adapting the pruning routines already implemented. The alternative of trail-filing combinators has the apparent advantage of working at a higher level, and therefore being more portable. But I think the combinators would be hard to get right (there are very subtle issues of laziness that do not sit easily with the need for sequential I/O) and it would be harder to make them efficient. USING THE PRUNING MECHANISM There is already a routine in the garbage collector used for pruning redex trails. For a given parameter k, this routine prunes away all parts of the trail to which the shortest path is longer than k. Each pruned section of the trail is replaced by a nullary trail constructor `Pruned'. A variant of this pruning routine could perform much of the work required to write trails to an archive file. In outline: * The routine is called with k=0, so that at each GC all the trail structure built since the last GC is archived. * All the pruned sections of trail are traversed, writing an archival copy to file. * Instead of the constructor Pruned, the residual trails for archived sections are new constructions `Archive fp' where `fp' is a byte-offset pointer into the archive file. * At the end of the computation, a special final call to the garbage collector is made. The trails for any expressions still under evaluation (in the case of interruption or failure) are closed with an `Evaluating' construction -- displayed as undefined by the browser. FORWARD AND BACKWARD POINTERS The `Trace' type defined for redex trails is recursive: nodes in the trail may incorporate links to other nodes. Backward links (ie. links from new nodes to older ones) do not pose any problems for archiving. But forward links do pose a problem: it is hardly attractive to update links in nodes already archived to file in order to track movements or updates in heap memory! Happily, almost all links in redex trails go backwards. The one exception is the `Sat' construction representing a saturated application not yet reduced. But this construction is transient, rather like an indirection. When the result of the application it represents is available, a `Sat' node is cut out of the trail by the garbage collector: indeed, at each garbage collection, most `Sat' nodes encountered can be eliminated in this way. HANDLING UNSHORTABLE SATS But what about the ones that cannot be eliminated? These can be handled by recording them in a filed-Sat table (FST), analogous to the exception tables used for old-to-new pointers in generational garbage collection. 1. If an archiving collection encounters a trail node `Sat app res' that cannot be shorted (because the `res' trail is not yet available) then a representation of the Sat node is written to file as part of the archived trace. Let fps be the position of the Sat in the file. 2. In addition, an entry (fps, res) is added to the FST. 3. All such `res' pointers in the FST must be treated as GC-roots. They point to heap-expressions for unevaluated result trails. If the heap-expression is moved, or reduced, the `res' pointer must continue to refer to it. 4. If a `res' trail becomes evaluated, it will itself be archived, to some file position fpr. Its associated entry in the FST can now be flushed to a sat-resolution file (SRF) which records indirections of the form fps->fpr. 5. When the computation is over, a post-process should apply the SRF as a transaction file to the archived trail as master file. The form in which Sats are written to the archive file in the first place should allow unresolved Sats to be interpreted appropriately. Otherwise when the computation ends, any unresolved entries in the FST would have to be flushed to the SRF as explicit markers (fps->?), and it would be more awkward to support interrupts and resumptions. It may be most convenient to have the FST itself allocated in heap, so that it can grow by need, and attaches to a single extra root. If the pruning distance used for archiving was great than zero, then Sats might be more likely to have resolved before they are archived to file. But pruning back to zero seems simpler. ARCHIVING PARTIAL TRAILS Because some functions are trusted and others untrusted, it will actually be necessary to have two distinct `Archive' constructors. By keeping this distinction in the residual trail stub, trustedness can be tested without reading back from the file. (There should really be two distinct `Pruned' constructors for similar reasons.) It is not clear how archiving and k-bounded pruning could be used together to construct an archive of a pruned trail. But file archiving should make current uses of k-bounded pruning unnecessary. AUXILIARY TABLES Apart from the trail structure itself, other pieces of information are need as part of a trace: 1. a table of source references; 2. a table of identifiers; 3. a table of strings; 4. program output partitioned into sections with distinct trails; 5. any other initial points giving access into the trail, such as a run-time error, or interruption point. Tables 1--3 are almost static (identifers have one dynamic attribute indicating whether they are trusted). It should not be necessary to write static table entries to file every time a traced program is run. They can be stored in a separate file to which shared reference is made by dynamic trace files. At the moment these tables are added by the compiler to the end of byte-code files. The format of the information for each module allows the following linked structure to be constructed in C at the start of program execution: MODULE_: char *name; char *sourcefile; impentry *imptable; identry *idtable; The `imptable' for a module is a null-terminated sequence of MODULE_<> references for each of its imported modules. The `idtable' is a null-terminated sequence of identifier details: Trust/Susp; char *name; char *modulename; SR *srcref; Redex trails are hybrid Haskell/C data structures. A four-attribute entry in the C identifier table lies behind the Haskell type `NameInfo' occurring in some constructions of type `Trace'. The `modulename' is a shared reference to a string in the relevant MODULE. A `srcref' points to a one of a null-terminated sequence of descriptions in a source reference table char *srcfile; int pos; where the `srcfile' name is again a shared reference into the MODULE description. As the byte-code itself is not a necessary part of the trace structure, the information for this linked structure of tables could usefully be split off into a separate file. This should be kept with program sources, as source references are among the static identifier and string attributes. DYNAMIC INFORMATION Items 4 and 5, the program output and access points to the trail, are clearly dynamic. It is necessary to write this dynamic information to file at run-time. It is not appropriate to store program output in the same file as the trail structure. Information such as the partitioned output could usefully be held in a format people can read, whereas trail structure archives would be best represented in a compressed binary format. Indeed, the output from a program would be best held exactly as it is produced by a normal untraced program. Partitioning and trail reference information can go in separate files. This scheme generalises easily to named files. (We must also remember to address the problem of IO actions that generate empty output.) BROWSING ARCHIVED TRAILS The redex trail browser (rtb) is a Java program that works in conjunction with a halted Haskell executable. There is a stream interface between the two: requests from the browser are met 1-1 by responses from routines in the run-time system for the halted Haskell program. Requests and responses resemble the S-expressions of LISP and incorporate reference numbers for nodes. For example, the browser might issue the request (G 5 3) meaning `get me details of node 5, with information to depth 3'. Both the run-time system and the browser maintain their own caches of node information to avoid needless repetition. When trails are archived the halted executable should no longer be needed. There are two alternatives: 1. Write a new program to provide the interface between archive files and browser requests. One advantage of this approach is that the browser can remain just as it is; another is that if the new program is in Haskell it might be possible to support programming over the trail structure. The disadvantage is that yet another program is needed in addition to the trail-building executable and the browser. 2. Change the browser to access the archive file directly instead of putting out requests. The advantage of this approach is its minimal architecture: a single program reads and displays trail information. The disadvantage is that the browser is more complex (particularly if it continues to support message-passing mode as well) and that it is much harder to support any programming over the trail as a Haskell data structure. The first alternative seems preferable for implementation reasons. The user could still be given the impression of a simple architecture: `connecting' the browser to an archived trail could automatically spawn the necessary archive reader.