INCREMENTAL ARCHIVING OF REDEX TRAILS ART Memo 3 (Version 0) Colin Runciman, 15 June INTRODUCTION This memo is about incrementally archiving redex trails to file during a traced computation. Such archiving is desirable for two main reasons: 1. The size of redex trails is no longer limited by the capacity of heap memory. 2. Redex trails stored in files can be browsed as often as required without repeating the execution of slow trail-building programs. There may even be a slight gain in speed for the trail-building programs: taking trails out of heap should greatly reduce GC time in the copying collector, and this may more than offset the additional time for writing to file. USING THE PRUNING MECHANISM Since archiving can be seen primarily as a way to relieve memory pressure, it is natural to incorporate it as part of garbage collection. There is already a routine in the garbage collector used for pruning redex trails. For a given parameter k, this routine prunes away all parts of the trail to which the shortest path is longer than k. Each pruned section of the trail is replaced by a nullary trail constructor `Pruned'. A variant of this pruning routine could perform much of the work required to write trails to an archive file. In outline: * The routine is called with k=0, so that at each GC all the trail structure built since the last GC is archived. * All the pruned sections of trail are traversed, writing an archival copy to file. * Instead of the constructor Pruned, the residual trails for archived sections are new constructions `Archive fp' where `fp' is a byte-offset pointer into the archive file. * At the end of the computation, a special final call to the garbage collector is made. The trails for any expressions still under evaluation (in the case of interruption or failure) are closed with an `Evaluating' construction -- displayed as undefined by the browser. FORWARD AND BACKWARD POINTERS The `Trace' type defined for redex trails is recursive: nodes in the trail may incorporate links to other nodes. Backward links from new nodes to older ones do not pose any problems for archiving. But forward links do pose a problem: it is hardly attractive to update links in nodes already archived to file in order to track movements or updates in heap memory! Happily, almost all links in redex trails go backwards. The one exception is the `Sat' construction representing a saturated application not yet reduced. But this construction is transient, rather like an indirection. When the result of the application it represents is available, a `Sat' node is cut out of the trail by the garbage collector: indeed, at each garbage collection, most `Sat' nodes encountered can be eliminated in this way. So in a normal archiving increment, we cannot appropriately archive the trail structure in its entirety. Unshortable `Sat' nodes should be retained in heap memory. However, in the final archiving GC the whole trail must be dealt with, `Sat's and all; for example by recording the `Sat' result as `Evaluating'. ARCHIVING PARTIAL TRAILS Because some functions are trusted and others untrusted, it will actually be necessary to have two distinct `Archive' constructors. By keeping this distinction in the residual trail stub, trustedness can be tested without reading back from the file. (There should really be two distinct `Pruned' constructors for similar reasons.) It is not clear how archiving and k-bounded pruning could be used together to construct an archive of a pruned trail. But file archiving should make current uses of k-bounded pruning unnecessary. AUXILIARY TABLES Apart from the trail structure itself, other pieces of information are need as part of a trace: 1. a table of source references; 2. a table of identifiers; 3. a table of strings; 4. program output partitioned into sections with distinct trails; 5. any other initial points giving access into the trail, such as a run-time error, or interruption point. Tables 1--3 are almost static (identifers have one dynamic attribute indicating whether they are trusted). It should not be necessary to write static table entries to file every time a traced program is run. They can be stored in a separate file to which shared reference is made by dynamic trace files. At the moment these tables are added by the compiler to the end of byte-code files. The format of the information for each module allows the following linked structure to be constructed in C at the start of program execution: MODULE_: char *name; char *sourcefile; impentry *imptable; identry *idtable; The `imptable' for a module is a null-terminated sequence of MODULE_<> references for each of its imported modules. The `idtable' is a null-terminated sequence of identifier details: Trust/Susp; char *name; char *modulename; SR *srcref; Redex trails are hybrid Haskell/C data structures. A four-attribute entry in the C identifier table lies behind the Haskell type `NameInfo' occurring in some constructions of type `Trace'. The `modulename' is a shared reference to a string in the relevant MODULE. A `srcref' points to a one of a null-terminated sequence of descriptions in a source reference table char *srcfile; int pos; where the `srcfile' name is again a shared reference into the MODULE description. As the byte-code itself is not a necessary part of the trace structure, the information for this linked structure of tables could usefully be split off into a separate file. This should be kept with program sources, as source references are among the static identifier and string attributes. Items 4 and 5, the program output and access points to the trail, are clearly dynamic. It is necessary to write this dynamic information to file at run-time, but it may be better not to use the same file as for trail structure. Information such as the partitioned output could usefully be held in a format people can read, whereas trail structure archives would be best represented in a compressed binary format. BROWSING ARCHIVED TRAILS The redex trail browser (rtb) is a Java program that works in conjunction with a halted Haskell executable. There is a stream interface between the two: requests from the browser are met 1-1 by responses from routines in the run-time system for the halted Haskell program. Requests and responses resemble the S-expressions of LISP and incorporate reference numbers for nodes. For example, the browser might issue the request (G 5 3) meaning `get me details of node 5, with information to depth 3'. Both the run-time system and the browser maintain their own caches of node information to avoid needless repetition. When trails are archived the halted executable should no longer be needed. There are two alternatives: 1. Write a new program to provide the interface between archive files and browser requests. One advantage of this approach is that the browser can remain just as it is; another is that if the new program is in Haskell it might be possible to support programming over the trail structure. The disadvantage is that yet another program is needed in addition to the trail-building executable and the browser. 2. Change the browser to access the archive file directly instead of putting out requests. The advantage of this approach is its minimal architecture: a single program reads and displays trail information. The disadvantage is that the browser is more complex (particularly if it continues to support message-passing mode as well) and that it is much harder to support any programming over the trail as a Haskell data structure.