THE TROUBLE WITH TRUSTING -- PART I: UNREACHABLE NAMES ART Memo 18 (Version 0) Colin Runciman, 4 April 2001 INTRODUCTION The intention of `trusting' some modules, such as the Prelude, is that details from computations within them need not be recorded in a trace. This sounds good, but in practice the current scheme has a number of deficiencies. This memo focuses on the problem of unreachable names -- names from the bodies of trusted functions that still give rise to nodes in the trace, even though they are unreachable in the final graph. WHY ARE THERE UNREACHABLE NAMES? Although the recording of non-constructor applications is suppressed in the bodies of trusted functions, nodes are still generated for constructors and names, because these may form part of the trusted function's result which must be traceable. ARE THERE MANY OF THEM? In short, yes! Here are the reachability statistics for one of the larger trace files we have been working with -- for the computation of the mate program with the Ellerman problem as input. % Reach Number Description Bytes % Space 100.0 1 header 16 0.0 50.0 4517849 TR application nodes 99089474 38.4 37.0 8809048 TR name nodes 114517624 44.4 49.5 638699 TR indirection nodes 5748291 2.2 9.6 857522 TR hidden nodes 4287610 1.7 5.1 1295800 TR type A SAT nodes 6479000 2.5 0.0 5 TR type B SAT nodes 25 0.0 52.2 5049701 TR type C SAT nodes 25248505 9.8 0.0 109 MD nodes 2093 0.0 42.9 502103 NT nodes 2492260 1.0 66.2 1277 SR nodes 11493 0.0 ------------------------------------------ 40.8 21672114 whole trace file 257876391 100.0 Names take up almost half the space but little more than a third of them are reachable. That is, approaching one-third of the space is occupied by unreachable names! Although the generation of names from trusted function bodies does not account for all these unreachable names, there is some evidence that it is a major factor. Because names in trusted function bodies may be incorporated as part of the trusted function's result, unless they occur in a projective context (forming the entire result) their parents are `hidden' nodes with the trusted applications as parents. Hidden nodes are only generated in trusted function bodies, and fewer than 10% of them are reachable. DETECTING UNREACHABLE NAMES If only we could determine which names from a function body can never be incorporated in the function's result then the generation of nodes for these names could be suppressed. Consider a simple example: f x = if x < 0 then x + 1 else x At the time of writing, I believe that fresh names are generated for every application of f for each of the symbols <, 0,+ and 1, although none of them will be reachable from the trace of the result. A. BY DATA-FLOW ANALYSIS AT COMPILE-TIME? Determining the result-incorporation property is a data-flow problem that could be solved by static analys at compile-time. But that could be quite a complicated business. B. BY EXCLUDING APPLIED FUNCTIONS? Olaf has an improved scheme that could eliminate the < and the + on the grounds that they occur only as the function symbols in reducible applications. But what about the 0 and the 1? C. BY CONSIDERING TYPES? With a moment's thought it should be clear that no name in the boolean condition of an if-expression will be incorporated in its value, so there is no need for names corresponding to instances of 0 in our example. Significantly, we can see this just by examining the *type* of if-then-else considered as a function: if-then-else :: Bool -> a -> a -> a No part of the result type is an instance of any part of the Bool argument type. So for conditionals in a trusted context names in the condition argument need not be recorded. This principle extends to the types of all trusted functions. D. BY DISTINGUISHING PRIMITIVES? What about the 1 in our example? We know that this will not occur in the result because + is (I am assuming) a primitive operation, for which results are generated in C-land. E. BY ONCE-ONLY HAND DATA-FLOW ANALYSIS OF STANDARD FUNCTIONS? The commonest trusted functions are those in the prelude -- and standard libraries. Although examples such as take :: Int -> [a] -> [a] yield to the type-inspection approach (so where take is applied in a trusted context, there is no need to record names in the Int argument), other examples such as (\\) :: Eq a => [a] -> [a] -> [a] do not. We know from the abstract specification of (\\) that no part of its second argument should occur in the result, but the type does not show this. It may be worth doing the once-only job of adding simple data-flow annotations for prelude and standard library functions. The annotations could be just sets of argument positions. SUMMARY * Unreachable names occupy up to a third of current .hat files. * Many of these names are unreachable because they correspond to occurrences in trusted function bodies of names that are not part of the trusted function's result. * A combination of comparatively simple techniques could eliminate many of these unreachable names without the development and running costs of compile-time data-flow analysis.