SIGDA, Super Compendium, ISSS 1999, Abstracts

ISSS'99 ABSTRACTS

Sessions: [1] [2] [Panel] [3] [4] [5] [6] [7] [8] [9]

Session 1: Invited Talks

Chair: N. Dutt

Design of a Set-Top Box System on a Chip [p. 2]

E. Foster

This presentation will review system-level issues associated with integrating the major blocks of a Set-Top Box onto a single die. In addition to the challenges of merging several powerful functions into a single chip, the goal of integration is to yield a composite design that is not only more cost effective but also provides more function than the sum of discrete parts. This is accomplished through consolidated and shared memory, improved system bandwidth and efficiency, and additional inter-macro signals to facilitate improved communication.

On the Rapid Prototyping and Design of a Wireless Communication System on a Chip [p. 3]

B. Kelley

The evolutionary convergence of computing, integrated circuit technology, and advances in wireless communications has led to an explosive growth of personal communication devices and services (PCS). In fact, the dramatic "Moore's Law" shrinkage of IC devices, itself, has lead to an unprecedented ability to place increasingly complex systems on a chip (SoC). In a wireless communication environment, the integration task is made more difficult by the need to integrate RF, mixed signal, and digital systems. Furthermore, the digital system design task generally requires a mapping of heterogeneous stacks of software processes onto a similarly diverse collection of digital signal processors, microprocessors and application-specific integrated circuits. In this presentation, we give an overview of a modern wireless communication device and describe advanced system-level design methodologies utilized for rapid prototyping and design of current and next generation systems.

Session 2: Embedded Tutorial: Java Compilation Technology

Embedded Java: Techniques and Applications [p. 6]

B. Barry, J. Duimovich

Java is an ideal language for developing embedded applications. However, most Java implementations and tools were designed for workstations and have limitations due to that heritage. Special tools are required to support deployment and effect better integration with target hardware. This talk will be in two parts. The first part will provide an overview of pervasive computing with a special focus on embedded Java, and describe typical applications drawn from several different market segments. The second half will delve more deeply into the architecture of an embedded Java runtime and discuss technical issues relating to dynamic compilation, optimization, and deployment.

Panel: System-Level Design: Designers' Wish List vs. Reality

Organizers/Moderators: D. Gajski, R. Bergamaschi
Panelists: M. Franz, G. Hellestrand A. Horak, J. Kunkel, W. Lee, G. Martin, K. Vissers,

Panel Statement [p. 8]

System level design has brought together a number of formidable challenges, such as methodology, software and hardware design and design automation, to name a few. More than ever, the successful design of a system requires all these challenges to be addressed - by both the designers and the design automation tools. Designers, better and anyone else, know what the problems are. Design automation companies claim to know how to solve them and have the products to prove it. Is this really true? Are the design automation tools really solving the hard problems or skimming over the real challenges. This panel addresses exactly that by confronting the views of distinguished designers and tools developers. The panelists belong to two teams. The designer team will present the main problems in doing system design including verification, IP use, integration and synthesis among others, and try to show that many of the real problems are not being addressed by current tools. The tools team will explain how the tools are indeed tackling the real problems and how the designers can make the best use out of them. The attendees can expect a very interesting, informative and technical debate. At the end, the audience will be the judge and a verdict will be passed on what the real problems are, which ones can be solved with existing tools, and what needs to be done in the future to address the system design challenges.

Session 3: Invited Talk

Microelectromechanical Systems (MEMS): Miniaturization Beyond Microelectronics [p. 10]

N. Maluf

The concept of a "system-on-a-chip" quickly invokes in our minds integrated microelectronic circuits. It took the semiconductor industry over 30 years to reach this level of integration, putting many millions of transistors on the same chip to perform extremely complex digital functions. Throughout this "system-level" revolution, the core element remained the MOS transistor, and the interface between the electronics and society has changed little. The same microfabrication methods of the electronics industry are now being adapted to design components and systems that integrate multiple physical functions including mechanics, fluid flow, optics, biology, etc on the same substrate. The net result is systems for complex non-digital applications, and sophisticated interfacing with the "real world." This technology is only beginning to emerge and the level of integration is in its early stages, yet the enabled functionality has already been phenomenal. For example, the integration of a small accelerometer and gyroscope with electronics is now at the center of modern vehicle stability systems. In another example, the DMD? display contains nearly one million little mirrors to control the intensity of individual pixels. In biology and biochemistry, on-going efforts aim at miniaturizing genetic analysis and diagnostic systems. In this presentation, we will review the basic fundamentals of microelectromechanical systems. The presentation will also include a brief survey of existing microsystems and provide a peek into the future.

Session 4: Embedded Tutorial

Middleware Techniques and Optimizations for Real-Time, Embedded Systems [p. 12]

D. Schmidt

Due to constraints on footprint, performance, and weight/power consumption, real-time, embedded system software development has historically lagged mainstream software development methodologies. As a result, real-time, embedded software systems are costly to evolve and maintain. Moreover, they are often so specialized that they cannot adapt readily to meet new market opportunities or technology innovations. To further exacerbate matters, a growing class of real-time, embedded systems require end-to-end support for various quality of service (QoS) aspects, such as bandwidth, latency, jitter, and dependability. These applications include telecommunication systems (e.g., call processing and switching), avionics control systems (e.g., operational flight programs for fighter aircraft), and multimedia (e.g., Internet streaming video and wireless PDAs). In addition to requiring support for stringent QoS requirements, these systems are often targeted at highly competitive markets, where deregulation and global competition are motivating the need for increased software productivity and quality. Requirements for increased software productivity and quality motivate the use of Distributed Object Computing (DOC) middleware [1]. Middleware resides between client and server applications and services in complex software systems. The goal of middleware is to integrate reusable software components to decrease the cycle-time and effort required to develop high-quality real-time and embedded applications and services.

Session 5: Real-Time and Low Power System Design

Chair: P. Chou

Event-Driven Power Management of Portable Systems [p. 18]

T. Simunic, G. De Micheli, L. Benini

The policy optimization problem for dynamic power management has received considerable attention in the recent past. We formulate policy optimization as a constrained optimization problem on continuous-time Semi-Markov decision processes (SMDP). SMDPs generalize the stochastic optimization approach based on discrete-time Markov decision processes (DTMDP) presented in the earlier work by relaxing two limiting assumptions. In SMDPs, decisions are made at each event occurrence instead of at each discrete time interval as in DTMDP, thus saving power and giving higher performance. In addition, SMDPs can have general inter-state transition time distributions, allowing for greater generality and accuracy in modeling real-life systems where transition times between power states are not geometrically distributed.

Real-Time Task Scheduling for a Variable Voltage Processor [p. 24]

T. Okuma, T. Ishihara, H. Yasuura

This paper presents a real-time task scheduling technique with a variable voltage processor which can vary its supply voltage dynamically. Using such a processor, running tasks with a low supply voltage leads to drastic power reduction. However, reducing the supply voltage may violate real-time constraints. In this paper, we propose a scheduling technique which simultaneously assigns both CPU time and a supply voltage to each task so as to minimize total energy consumption while satisfying all real-time constraints. Experimental results demonstrate effectiveness of the proposed technique.

Path-Based Edge Activation for Dynamic Run-Time Scheduling [p. 30]

V. Mooney III

We present a tool that performs real-time analysis and dynamic execution of software tasks in a mixed hardware-software system with a custom run-time scheduler. The tasks in hardware and software have control-flow constraints (precedence and alternative execution), resource constraints, relative timing constraints, and rate constraint. The custom run-time scheduler dynamically executes tasks in different orders, based on the conditional execution path, such that a hard real-time rate constraint can be predictably met. We describe the task modelling, run-time scheduler implementation, and real-time analysis. We introduce the concept of path-based edge activation utilizing conditional edges. We show how our approach fits into an overall tool flow and target architecture. Finally, we conclude with a sample application of the system to a design example.

Session 6: Performance Issues in System Design

Chair: Loganath Ramachandran

Optimized System Synthesis of Complex RT Level Building Blocks from Multirate Dataflow Graphs [p. 38]

J. Horstmannshoff, H. Meyr

In order to cope with the ever increasing complexity of todays application specific integrated circuits, a building block based design methodology is established. The system is composed of high level building blocks of which some are reused from previous designs while others might have been created by behavioral synthesis. In data flow oriented designs, these blocks usually have complex non-matching interface properties, making it necessary to generate additional interfacing and controlling hardware to integrate them into an operable system. In this paper, an RTL-HDL code generation from a synchronous data flow representations is introduced, that efficiently automates the generation of the required additional hardware. While existing code generation approaches provide strong limitations concerning the building block interfacing properties, our method enables the integration of components that access their ports periodically with arbitrary patterns. In order to reduce interface register cost, a minimum-area retiming approach is taken to determine optimum building block activation times, which is known to have polynomial time complexity. The code generation methodology is compared to an existing approach using a simple case study.

RTGEN: An Algorithm for Automatic Generation of Reservation Tables from Architectural Descriptions [p. 44]

P. Grun, A. Halambi, N. Dutt, A. Nicolau

Reservation Tables (RTs) have long been used to detect conflicts between operations that simultaneously access the same architectural resource. Traditionally, these RTs have been specified explicitly by the designer. However, the increasing complexity of modern processors makes the manual specification of RTs cumbersome and error-prone. Furthermore, manual specification of such conflict information is infeasible for supporting rapid architectural exploration. In this paper we present an algorithm to automatically generate RTs from a high-level processor description, with the goal of avoiding manual specification of RTs, resulting in more concise architectural specifications and also supporting faster turn-around time in Design Space Exploration. We demonstrate the utility of our approach on a set of experiments using the TI C6201 VLIW DSP and DLX processor architectures, and a suite of multimedia and scientific applications.

Pre-fetching for Improved Core Interfacing [p. 51]

R. Lysecky, F. Vahid, R. Patel, T. Givargis

Reuse of cores can reduce design time for systems-on-a-chip. Such reuse is dependent on being able to easily interface a core to any bus. To enable such interfacing, many propose separating a core's interface from its internals. However, this separation can lead to a performance penalty when reading core's internal registers. We introduce pre-fetching, which is analogous to caching, as a technique to reduce or eliminate this performance penalty, involving a tradeoff with power and size. We describe the pre-fetching technique, classify different types of registers, describe our initial pre-fetching architectures and heuristics for certain classes of registers, and highlight experiments demonstrating the performance improvements and size/power tradeoffs.
Keywords: Cores, system-on-a-chip, interfacing, on-chip bus, intellectual property.

Compressed Code Execution on DSP Architectures [p. 56]

P. Centoducatte, R. Pannain, G. Araujo

Decreasing the program size has become an important goal in the design of embedded systems target to mass production. This problem has led to a number of efforts aimed at designing processors with shorter instruction formats (e.g. ARM Thumb and MIPS16), or that can execute compressed code (e.g. IBM CodePack PowerPC). Much of this work has been directed towards RISC architectures though. This paper proposes a solution to the problem of executing compressed code on embedded DSPs. The experimental results reveal an average compression ratio of 75% for typical DSP programs running on the TMS320C25 processor. This number includes the size of the decompression engine. Decompression is performed by a state machine that translates codewords into instruction sequences during program execution. The decompression engine is synthesized using the AMS standard cell library and a 0.6um 5V technology. Gate level simulation of the decompression engine reveals minimum operation frequencies of 150MHz.

Session 7: Memory Design for Embedded Systems

Chair: Walid Najjar

Loop Scheduling and Partitions for Hiding Memory Latencies [p. 64]

F. Chen, E. Sha

Partition Scheduling with Prefetching (PSP) is a memory latency hiding technique which combines the loop pipelining technique with data prefetching. In PSP, the iteration space is first divided into regular partitions. Then two parts of the schedule, the ALU part and the memory part, are produced and balanced to produce an overall schedule with high throughput. These two parts are executed simultaneously, and hence the remote memory latency are overlapped. We study the optimal partition shape and size so that a well balanced overall schedule can be obtained. Experiments on DSP benchmarks show that the proposed methodology consistently produces optimal or near optimal solutions.

Loop Alignment for Memory Accesses Optimization [p. 71]

A. Fraboulet, G. Huard, A. Mignotte

Portable or embedded systems allow more and more complex applications like multimedia today. These applications and submicronic technologies have made the power consumption criterium crucial. We propose new techniques thanks to which we can optimize the behavioral description of an integrated system before the hardware/software partitioning (Codesign). These transformations are performed on "for" loops that constitute the main parts of the multimedia code which handle the arrays. We present in this paper two new (polynomial) techniques for minimizing memory accesses in loop nests by data temporal locality optimization.

A Buffer Merging Technique for Reducing Memory Requirements of Synchronous Dataflow Specifications [p. 78]

P. Murthy, S. Bhattacharyya

Synchronous Dataflow, a subset of dataflow, has proven to be a good match for specifying DSP programs. Because of the limited amount of memory in embedded DSPs, a key problem during software synthesis from SDF specifications is the minimization of the memory used by the target code. We develop a powerful formal technique called buffer merging that attempts to overlay buffers in the SDF graph systematically in order to drastically reduce data buffering requirements. We give a polynomial-time algorithm based on this formalism, and show that code synthesized using this technique results in more than a 60% reduction of the buffering memory consumption compared to existing techniques.

Exploration and Synthesis of Dynamic Data Sets in Telecom Network Applications [p. 85]

C. Ykman-Couvreur, J. Lambrecht, D. Verkest, F. Catthoor, H. De Man

We present a new exploration and optimization method to select customized implementations for dynamic data sets, as encountered in telecom network, database and multimedia applications. Our method fits the context of embedded system synthesis for such applications, and enables to further raise the abstraction level of the initial specification, where dynamic data sets can be specified without low-level details. Our method is suited for hardware and software implementations. In this paper, it mainly aims at minimizing the memory power consumption, although it can also be driven by other cost functions such as area or performance. Compared with existing methods, it can save up to 2/3 of the memory power consumption and 3/4 of the memory area.

Session 8: Architectural Synthesis

Chair: Lev Markov

A Graph Theoretic Approach for Design and Synthesis of Multiplierless FIR Filters [p. 94]

K. Muhammad, K. Roy

We present a novel approach which can be used to obtain multiplierless implementations of finite impulse response (FIR) digital filters. The main idea is to reorder filter coefficients such that an implementation based on differential coefficients requires only a few adders. We represent this problem using a graph in which vertices represent the coefficients and edges represent the resources required when the differential coefficient corresponding to the edge is used in a computation. We also present a graph model for an implementation based on second-order coefficient differences. The optimal solution to the coefficient reordering problem is the well known problem of finding the Hamiltonian path of smallest weight in this graph. We use two approaches to find the smallest weight Hamiltonian cycle; a greedy approach, and, the heuristic algorithm proposed by Lin and Kernighan. The power and potential of this approach is demonstrated by presenting results for large filters (lengths up to > 300) which show that, in general, for 16-bit coefficients, the total number of adders required per coefficient is less than 2. Hence, high performance and/or low power filters can be designed and synthesized using the proposed approach.

Efficient Scheduling of DSP Code on Processors with Distributed Register Files [p. 100]

B. Mesman, C. Alba Pinto, K. van Eijk

Code generation methods for digital signal processors are increasingly hampered by the combination of tight timing constraints imposed by the algorithms and the limited capacity of the available register files. Traditional methods that schedule spill code to satisfy storage capacity have difficulty satisfying the timing constraints. The method presented in this paper analyses the combination of limited register file capacity, resource- and timing constraints during scheduling. Value lifetimes are serialized until all capacity constraints are guaranteed to be satisfied after scheduling. Experiments in the FACTS environment show that we efficiently obtain high quality instruction schedules for inner-most loops of DSP algorithms.

Automatic Architectural Synthesis of VLIW and EPIC Processors [p. 107]

S. Aditya, B. Ramakrishna Rau, V. Kathail

This paper describes a mechanism for automatic design and synthesis of very long instruction word (VLIW), and its generalization, explicitly parallel instruction computing (EPIC) processor architectures starting from an abstract specification of their desired functionality. The process architecture design makes concrete decisions regarding the number and types of functional units, number of read/write ports on register files, the datapath interconnect, the instruction format, its decoding hardware, and the instruction unit datapath. The processor design is then automatically synthesized into a detailed RTL-level structural model VHDL along with an estimate of its area. The system also generates the corresponding detailed machine description and instruction format description that can be used to re-target a compiler and an assembler respectively. All this part of an overall design system, called Program-In-Chip-Out (PICO), which has the ability to perform automatic exploration of the architectural design space while customizing the architecture to a given application and making intelligent, quantitative, cost-performance tradeoffs.

Bit-Width Selection for Data-Path Implementations [p. 114]

C. Carreras, J. López, O. Nieto-Taladriz

Specifications of data computations may not necessarily describe the ranges of the intermediate results that can be generated. However, such information is critical to determine the bit-widths of the resources required for a data-path implementation. In this paper, we present a novel approach based on interval computations that provides, not only guaranteed range estimates that take into account dependencies between variables, but estimates of their probability density functions that can be used when some truncation must be performed due to constraints in the specification. Results show that interval-based estimates are obtained in reasonable times and are more accurate than those provided by independent range computation, thus leading to substantial reductions in area and latency of the corresponding data-path implementation.

Session 9: System Design Methodologies

Chair: Giovanni De Micheli

Catalyst: A DSIP Design Flow Development in Industry [p. 122]

W. De Rammelaere, K. Eckert, E. Hilkens, T. Lawell, R. McGarity, P. Le Moenner, F. Steininger

The Motorola System on Chip Design Technologies (SoCDT) team aims at providing a system design environment for its customers. The Toulouse branch concentrates on design efforts incorporating DSP functionality. This is referred to as the Catalyst methodology. We found that in current systems very often the software development cycle is longer than that of the silicon development. To ease the software burden, we have changed the silicon architecture and its flow to permit the DSP software to be written in the C language instead of assembler code, as is normally done. The resulting architecture is domain specific; it is smaller, has a reduced design cycle and is simpler to implement because it is tuned to the application software we are providing. This paper will describe the methodology which we are developing to create domain specific architectures, it shows one example architecture and aspects which are critical for industry acceptance.

System Synthesis of Synchronous Multimedia Applications [p. 128]

G. Qu, M. Mesarina, M. Potkonjak

Modern system design is being increasingly driven by applications such as multimedia and wireless sensing and communications, which all have intrinsic quality of service (QoS) requirements, such as throughput, error-rate, and resolution. One of the most crucial QoS guarantees that the system has to provide is the timing constraints among the interacting media (synchronization) and within each media (latency). We have developed the first framework for systems design with timing QoS guarantees, latency and synchronization. In particular, we address how to design system-on-chip with minimal silicon area to meet timing constraints. We propose the two-phase design methodology. In the first phase, we select an architecture which facilitates the needs of synchronous low latency applications well. In the second phase, for a given processor configuration, we use our new scheduler in such a way that storage requirements are minimized. We have develop scheduling algorithms that solve the problem optimally for a-priori specified applications. The algorithms have been implemented and their effectiveness demonstrated on a set of simulated MPEG streams from popular movies.

A Framework for Scheduling and Context Allocation in Reconfigurable Computing [p. 134]

R. Maestre, M. Fernandez, R. Hermida, N. Bagherzadeh

Reconfigurable computing is emerging as a viable design alternative to implement a wide range of computationally intensive applications. The scheduling problem becomes a really critical issue in achieving the high performance that these kind of applications demands. This paper describes the different aspects regarding the scheduling problem in a reconfigurable architecture. We also propose a general strategy in order to perform at compilation time a scheduling that includes all possible optimizations regarding context (configuration) and data transfers. In particular, we focus especially on the methodology and mechanisms to solve the context scheduling. Some experimental results are presented to validate our assumptions. Finally, the problem of data transfers is only formulated and will be addressed in future work.