SIGDA, Super Compendium, CODES 2001, Abstracts

CODES'01 ABSTRACTS

Topics: [1] [2] [3] [4] [5] [6] [7] [8] [9]

"s" indicates short paper

System Modeling and Specification [p. 4]

The Usage of Stochastic Processes in Embedded System Specifications [p. 5]

Axel Jantsch, Ingo Sander, Wenbiao Wu

We review the use of nondeterminism and identify two different purposes. The descriptive purpose handles uncertainties in the behaviour of existing entities. The constraining purpose is used in specifications to constrain implementations. For the specification of embedded systems we suggest a stochastic process sigma instead of nondeterminism. It serves mostly the descriptive purpose but can also be used to constrain the system. We carefully distinguish different interpretations of these concepts by the different design activities simulation, synthesis and verification.

Modeling and Evaluation of Hardware/Software Designs [p. 11]

Neal K. Tibrewala, JoAnn M. Paul, Donald E. Thomas

We introduce the foundation of a system modeling environment targeted at capturing the anticipated interactions of hardware and software behaviors - not just their co-execution. Key to our approach is the separation of external and internal design testbenches. We use a frequency interleaved scheduling foundation ideally suited to our approach because it allows unrestricted hardware and software modeling, a mix of untimed and timed software, and a layered approach using software schedulers and protocols to resolve software to resource time budgets. We illustrate our approach by discussing how architectural corner cases that arise due to interacting hardware and software behaviors can be a meaningful digital modeling concept. In addition to characterizing the response of a system when viewed as a black box, we characterize the response of the design to anticipated design changes. We include examples and simulation results.
Keywords: Hardware/Software Codesign, Computer System Modeling and Simulation, Digital System Design

SystemC: A Homogenous Environment to Test Embedded Systems [p. 17]

Alessandro Fin, Franco Fummi, Maurizio Martignano, Mirko Signoretto

The SystemC language is becoming a new standard in the EDA field and many designers are starting to use it to model complex systems. SystemC has been mainly adopted to define abstract models of hardware/software components, since they can be easily integrated for rapid prototyping. However, it can also be used to describe modules at a higher level of detail, e.g., RT-level hardware descriptions and assembly software modules. Thus, it would be possible to imagine a System-C-based design flow, where the system description is translated from one abstraction level to the following one by always using SystemC representations. The adoption of a SystemC-based design flow would be particularly efficient for testing purpose as shown in this paper. In fact , it allows the definition of a homogeneous testing procedure, applicable to all design phases, based on the same error model and on the same test generation strategy. Moreover, test patterns are indifferently applied to hardware and software components, thus making the proposed testing methodology particularly suitable for embedded system. Test patterns are generated on the SystemC description modeling the system at one abstraction level, then, they are used to validate the translation of the system to a lower abstraction level. New test patterns are then generated for the lower abstraction level to improve the quality of the test set and this process is iterated for each translation (synthesis) step.
Keywords: functional testing, C++ models, embedded systems verification

"s" Embedded UML: a merger of real-time UML and co-design [p. 23]

Grant Martin, Luciano Lavagno, Jean Louis-Guerin

In this paper, we present a proposal for a UML profile called "Embedded UML". Embedded UML represents a synthesis of various ideas in the real-time UML community, and concepts drawn from the Hardware-Software co-design field. Embedded UML first selects from among the competing real-time UML proposals, the set of ideas which best allow specification and analysis of mixed HW-SW systems. It then adds the necessary concept of underlying deployment architecture that UML currently lacks in complete form, using the notion of an embedded HW-SW �platform�. It supplements this with the concept of a �mapping�, which is a platform-dependent refinement mechanism that allows efficient generation of an optimised implementation of the executable specification in both HW and SW. Finally, it provides an approach which supports the development of automated analysis, simulation, synthesis and code generation tool capabilities which can be provided for design usage even while the embedded UML standardisation process takes place.
Keywords: UML, embedded systems, real-time systems, HW-SW co-design, function-architecture co-design, platforms

Hardware/Software Partitioning and Design Environments [p. 29]

Hardware/Software Partitioning of embedded system in OCAPI-xl [p. 30]

G. Vanmeerbeeck, P. Schaumont, S. Vernalde, M.Engels, I. Bolsens

The implementation of embedded networked appliances requires a mix of processor cores and HW accelerators on a single chip. When designing such complex and heterogeneous SoCs, the HW / SW partitioning decision needs to be made prior to refining the system description. With OCAPI-xl, we developed a methodology in which the partitioning decision can be made anywhere in the design flow, even just prior to doing code-generation for both HW and SW. This is made possible thanks to a refinable, implementable, architecture independent system description. The OCAPI-xl model was used to develop a stand alone, networked camera, with onboard GIF engine and network layer.

HW/SW Partitioning of an Embedded Instruction Memory Decompressor [p. 36]

Shlomo Weiss, Shay Beren

We introduce a new PLA-based decoder architecture for random-access run-time decompression of compressed instruction memory in embedded systems. The compression method employs class-based coding. We show that this new decoder architecture can be extended to provide high throughput decompression. The design of the decompressor is based on the following HW/SW tradeoff: decoding is done in hardware to provide high throughput, yet the codebook used for decompression is fully programmable.
Keywords: embedded systems, compressed instruction memory.

"s" MAGELLAN: Multiway Hardware-Software Partitioning and Scheduling for Latency Minimization of Hierarchical Control Dataflow Task Graphs [p. 42]

Karam S. Chatha, Ranga Vemuri

The paper presents MAGELLAN, a heuristic technique for mapping hierarchical control-dataflow task graph, specifications on heterogeneous architecture templates. The architecture can consist of multiple hardware and software processing elements as specified by the user. The objective of the technique is to minimize the worst case latency of the task graph subject to the area constraints on the architecture. The technique uses an iterative approach consisting of closely linked hardware-software partitioner and scheduler. Both the partitioner and scheduler operate on the task graph in a hierarchical top down manner. The technique optimizes deterministic loop constructs by applying clustering, unrolling and pipelining. The technique considers speculative execution for conditional constructs. The number of actual hardware/software implementations of a function in the task graph are also optimized by the technique. The effectiveness of the technique is demonstrated by a case study of an image compression algorithm.

A Practical Toolbox for System Level Communication Synthesis [p. 48]

Denis Hommais, Frédéric Pétrot, Ivan Augé

This paper presents a practical approach to communication synthesis for hardware/software system specified as tasks communicating through lossless blocking channels. It relies on a limited set of templates that characterize the way data are exchanged between tasks realized either in software or in hardware. The templates are highly portable because their software part is implemented using the POSIX thread functions, and their hardware part is a hand crafted synthesizable module with a System VCI interface. These Interface Modules allow simple Virtual Component reuse since they not only implement protocol compatibility through the use of the System VCI/OCB standard but also system level communications through semantics widely accepted in the design community.

"s" System Canvas: A New Design Environment for Embedded DSP and Telecommunication Systems [p. 54]

Praveen K. Murthy, Etan G. Cohen, Steve Rowland

We present a new design environment, called System Canvas, targeted at DSP and telecommunication system designs. Our environment uses an easy-to-use block-diagram syntax to specify systems at a very high level of abstraction. The block diagram syntax is based on formal semantics, and uses a number of different models of computation including cyclo-static dataflow, dynamic dataflow, and a discrete-event model. A key feature of our tool is that the user does not need to have an awareness of which model is being used; the models can be freely mixed and matched and a simulation can consist of an arbitrary combination of models. The blocks are written in �C� or �C++� and it is straightforward to write custom blocks and incorporate them into custom libraries. Other key features include the ability to control simulations via language-neutral scripts, and a powerful optimization engine that enables optimization of the system over arbitrarily specified parameters, constraints, and cost functions. Fixed-point analysis capability allows any signal or variable in the system to be set to any type of number system before the simulation proceeds. The tool is available on the Windows NT platform and incorporates modern and ubiquitous Windows GUI look and feel.

Architectures for Co-Design [p. 60]

Designing Domain-Specific Processors [p. 61]

Marnix Arnold, Henk Corporaal

We present a semi-automated method for the detection and exploitation of application domain specific instruction set extensions for embedded (VLIW) processors. It consists of three steps: the first step detects frequently occurring operation patterns, in the second step, the patterns are grouped and implemented in a number of Special Function Units (SFUs) and the third step incorporates the custom operations into the code generation process. Experimental show that the SFUs generated and exploited with our methodology can result in architectures that perform up to 30% better than architectures of the same cost without SFUs.
Keywords: Instruction Set Synthesis, Design Space Exploration

RS-FDRA: A Register Sensitive Software Pipelining Algorithm for Embedded VLIW Processors [p. 67]

Cagdas Akturan, Margarida F. Jacome

The paper proposes a novel software-pipelining algorithm, Register Sensitive Force Directed Retiming Algorithm (RSFDRA), suitable for optimizing compilers targeting embedded VLIW processors. The key difference between RS-FDRA and previous approaches is that our algorithm can handle code size constraints along with latency and resource constraints. This capability enables the exploration of pareto "optimal" points with respect to code size and performance. RS-FDRA can also minimize the increase in "register pressure" typically incurred by software pipelining. This ability is critical since, the need to insert spill code may result in significant performance degradation. Extensive experimental results are presented demonstrating that the extended set of optimization goals and constraints supported by RS-FDRA enables a thorough compiler-assisted exploration of trade�offs among performance, code size, and register requirements, for time critical segments of embedded software components.
Keywords Software pipelining, optimizing compilers, embedded systems, VLIW processors, retiming

A Novel Parallel Deadlock Detection Algorithm and Architecture [p. 73]

Pun Hang Shiu, Yudong Tan, Vincent John Mooney III

A novel deadlock detection algorithm and its hardware implementation are presented in this paper. The hardware deadlock detection algorithm has a run time complexity of O_hw(min(m,n)), where m and n are the number of processors and resources, respectively. Previous algorithms based on a Resource Allocation Graph have O_sw(mxn) run time complexity for the worst case. We simulate a realistic example in which the hardware deadlock detection unit is applied, and demonstrate that the hardware implementation of the novel deadlock detection algorithm reduces deadlock detection time by 99.5%. Furthermore, in a realistic example, total execution time is reduced by 68.9%.
Keywords: Deadlock Detection, Parallel Algorithm, Hardware/Software Codesign, Real-time Operating System.

Towards Effective Embedded Processors in Codesigns: Customizable Partitioned Caches [p. 79]

Peter Petrov, Alex Orailoglu

This paper explores an application-specific customization technique for the data cache, one of the foremost area/power consuming and performance determining microarchitectural features of modern embedded processors. The automated methodology for customizing the processor microarchitecture that we propose results in increased performance, reduced power consumption and improved determinism of critical system parts while the fixed design ensures processor standardization. The resulting improvements help to enlarge the significant role of embedded processors in modern hardware/ software codesign techniques by leading to increased processor utilization and reduced hardware cost. A novel methodology for static analysis and a field-reprogrammable implementation of a customizable cache controller that implements a partitioned cache structure is proposed. The simulation results show significant decrease of miss ratio compared to traditional cache organizations.
Keywords: embedded processors, data cache, reprogrammable customizations

Design Space Exploration and Evaluation Techniques [p. 85]

Development Cost and Size Estimation Starting from High-Level Specifications [p. 86]

William Fornaciari, Fabio Salice, Umberto Bondi, Edi Magini

This paper addresses the problem of estimating cost and development effort of a system, starting from its complete or partial high-level description. In addition, some modifications to evaluate the cost-effectiveness of reusing VHDL-based designs, are presented. The proposed approach has been formalized using an approach similar to the COCOMO analysis strategy, enhanced by a project size prediction methodology based on a VHDL function point metric. The proposed design size estimation methodology has been validated through a significant benchmark, the LEON-1 microprocessor, whose VHDL description is of public domain
Categories for Codes�01 reviewers System development processes, Applications
Keywords Concurrent engineering, process management, project size estimation, design reuse, VHDL

Exploring Design Space of Parallel Realizations:MPEG-2 Decoder Case Study [p. 92]

Basant K. Dwivedi, Jan Hoogerbrugge, Paul Stravers, M. Balakrishnan

Many applications lend them to parallelism at different levels of granularity. We first identify the key issues involved in creating a parallel model of an application. These are done with a view to estimate performance and explore the "parallel" design space to select a suitable design point. The framework presented provides an opportunity to perform this exploration both in the target architecture independent and target architecture dependent manner. An MPEG-2 decoder model in YAPI has been presented which has more parallelism and improved performance. This model has further been mapped into SpaceCAKE architecture to study its architectural parameters. Detailed results obtained with YAPI simulation (target architecture independent) and TSS simulation (after process-processor binding) on MPEG-2 decoder application establish the effectiveness of our approach.
Keywords: MPEG-2 Decoder, YAPI, Parallel realization, Process, Thread, FIFO

"s" Source-Level Execution Time Estimation of C Programs [p. 98]

Carlo Brandolese, William Fornaciari, Fabio Salice, Donatella Sciuto

In this paper a comprehensive methodology for software execution time estimation is presented. The methodology is supported by rigorous mathematical models of C statements in terms of elementary operations. The deterministic contribution is combined with a statistical term accounting for all those aspects that cannot be quantified exactly. The methodology has been validated by realizing a complete prototype toolset, used to carry out the experiments.

"s" STARS of MPEG decoder: a case study in worst-case analysis of discrete-event systems [p. 104]

Felice Balarin

STARS (STatic Analysis of Reactive Systems) is a methodology for worst-case analysis of discrete systems. Theoretical foundations of STARS have been laid down [1,2,3], but no implementation has been presented so far. We introduce an implementation of STARS as an extension of YAPI, a programming interface used to model signal processing applications as process networks [7]. We apply STARS to a YAPI model of an MPEG decoder. We show that worst-case bounds computed by STARS are quite close to simulated values (within 15%). We also show that additional effort by the designer required to build STARS models is very small compared to effort of building the YAPI simulation model, and that the run times of STARS are negligible compared to the simulation run times.
KEY WORDS: system verification, worst-case analysis, static analysis

"s" Evaluating Register File Size in ASIP Design [p. 109]

Manoj Kumar Jain, Lars Wehmeyer, Stefan Steinke, Peter Marwedel, M. Balakrishnan

Interest in synthesis of Application Specific Instruction Set Processors or ASIPs has increased considerably and a number of methodologies have been proposed for ASIP design. A key step in ASIP synthesis involves deciding architectural features based on application requirements and constraints. In this paper we observe the effect of changing register file size on the performance as well as power and energy consumption. Detailed data is generated and analyzed for a number of application programs. Results indicate that choice of an appropriate number of registers has a significant impact on performance.
Keywords Register file, Synthesis, Instruction set, Instruction power model, Register spill, Application specific instruction set processor

Synthesis and Transformation Techniques [p. 115]

Generating Mixed Hardware/Software Systems from SDL Specifications [p. 116]

Frank Slomka, Matthias Dörfel, Ralf Münzenberger

A new approach for the translation of SDL specifications to a mixed hardware/software system is presented. Based on the computational model of communicating extended finite state machines (EFSM) the control flow is separated from data flow of the SDL process. Hence for the first time it is possible to generate a mixed hardware/ software implementation of an SDL process. This technique also reduces the complexity for high-level and register-transfer synthesis tools for the hardware parts of the system. The advantage of this methodology is shown by a design example of a wireless communication chip.

Area-Efficient Buffer Binding Based on a Novel Two-Port FIFO Structure [p. 122]

Kyoungseok Rha, Kiyoung Choi

In this paper, we address the problem of minimizing buffer storage requirement in buffer binding for SDF (Synchronous Dataflow) graphs. First, we propose a new two-port FIFO buffer structure that can be efficiently shared by two producer/consumer pairs. Then we propose a buffer binding algorithm based on this two-port buffer structure for minimizing the buffer size requirement. Experimental results demonstrate 9.8%~37.8% improvement in buffer requirement compared to the conventional approaches.
Keywords Buffer binding, buffer sharing, SDF, scheduling

Deriving Hard Real-Time Embedded Systems Implementations directly from SDL Specifications [p. 128]

J.M. Alvarez, M Diaz, L. Llopis, E. Pimentel, J.M. Troya

Object-Oriented methodologies together with Formal Description Techniques (FDT) are a promising way to deal with the increasing complexity of hard real-time embedded systems. However, FDTs do not take into account non-functional aspects as real-time constraints. Based on a new real-time execution model for FDT SDL proposed in previous works, a way to derive implementations of hard real-time embedded systems directly from SDL specifications is presented. In order to get it we propose a middleware that supports this model to organize the execution of the tasks generated from SDL system specification. Additionally, a worst case real-time analysis, including the middleware overhead, is presented. Finally, an example to generate the implementation from the SDL specification and a performance study is developed.
Keywords: SDL, real-time, scheduler, embedded system

A Trace Transformation Technique for Communication Refinement [p. 134]

Paul Lieverse, Pieter van der Wolf, Ed Deprettere

Models of computation like Kahn and dataflow process networks provide convenient means for modeling signal processing applications. This is partly due to the abstract primitives that these models offer for communication between concurrent processes. However, when mapping an application model onto an architecture, these primitives need to be mapped onto architecture level communication primitives. We present a trace transformation technique that supports a system architect in performing this communication refinement. We discuss the implementation of this technique in a tool for architecture exploration named SPADE and present examples.

"s" A Systematic Approach to Software Peripherals for Embedded Systems [p. 140]

Dimitrios Lioupis, Apostolos Papagiannis, Dionysia Psihogiou

The continued growth of microprocessors� performance and the need for better CPU utilization, has led to the introduction of the software peripherals� approach: By this term we refer to software modules that can successfully emulate peripherals that, until now, were traditionally implemented in hardware. Software implementations offer great flexibility in product design and in functional upgrades, while they have high contribution in the cost/performance ratio optimization. We focus on embedded applications, where the cost and the short time to market are the leading issues. In this paper, we study the hardware and software requirements for developing a generic microprocessor with support for software peripherals. Additionally, we present three software peripherals, a Universal Asynchronous Receiver Transmitter, a keypad controller and a dot matrix LCD controller, and we analyze their impact in CPU occupation. Finally, we explore the impact of using a software UART on system power dissipation.
Keywords Software peripherals, embedded processors, reconfigurable architectures

Scheduling Techniques [p. 146]

A Constructive Algorithm for Memory-Aware Task Assignment and Scheduling [p. 147]

Radoslaw Wlodzimierz Szymanek, Krzysztof Kuchcinski

This paper presents a constructive algorithm for memory-aware task assignment and scheduling, which is a part of the prototype system MATAS. The algorithm is well suited for image and video processing applications which have hard memory constraints as well as constraints on cost, execution time, and resource usage. Our algorithm takes into account code and data memory constraints together with the other constraints. It can create pipelined implementations. The algorithm finds a task assignment, a schedule, and data and code memory placement in memory. Infeasible solutions caused by memory fragmentation are avoided. The experiments show that our memory-aware algorithm reduces memory utilization comparing to greedy scheduling algorithm which has time minimization objective. Moreover, memory-aware algorithm is able to find task assignment and schedule when time minimization algorithm fails. MATAS can create pipelined implementations, therefore the design throughput is increased.
keywords: task scheduling, task assignment, memory constraints, constraint programming

A Constraint-based Application Model and Scheduling Techniques for Power-aware Systems [p. 153]

Jinfeng Liu, Pai H. Chou, Nader Bagherzadeh, Fadi Kurdahi

New embedded systems must be power-aware, not just low-power. That is, they must track their power sources and the changing power and performance constraints imposed by the environment. Moreover, they must fully explore and integrate many novel power management techniques. Unfortunately, these techniques are often incompatible with each other due to overspecialized formulations or they fail to consider system-wide issues. This paper proposes a new graph-based model to integrate novel power management techniques and facilitate design-space exploration of power-aware embedded systems. It captures min/max timing and min/max power constraints on computation and non-computation tasks through a new constraint classification and enables derivation of flexible systemlevel schedules. We demonstrate the effectiveness of this model with a power-aware scheduler on real mission-critical applications. Experimental results show that our automated techniques can improve performance and reduce energy cost simultaneously. The application model and scheduling tool presented in this paper form the basis of the IMPACCT system-level framework that will enable designers to aggressively explore many power-performance tradeoffs with confidence.
Keywords constraint modeling, power-aware real-time scheduling, embedded systems software, system-level design

"s" Optimal Acyclic Fine-Grain Scheduling with Cache Effects for Embedded and Real Time Systems [p. 159]

Sid-Ahmed-Ali Touati

To sustain the increases in processor performance, embedded and real-time systems need to find the best total schedule time when compiling their application. The optimal acyclic scheduling problem is a classical challenge which has been formulated using integer programming in lot of works. In this paper, we give a new formulation of acyclic instruction scheduling problem under registers and resources constraints in multiple instructions issuing processors with cache effects. Given a direct acyclic graph G = (V,E), the complexity of our integer linear programming model is bounded by O(|V|²) variables and O(|E|+|V|²) constraints. This complexity is better than the complexity of the existing techniques which includes a worst total schedule time factor.
Keywords optimal acyclic schedule, registers constraints, resources constraints, cache effects, integer programming

"s" Scheduling-based Code Size Reduction in Processors with Indirect Addressing Mode [p. 165]

Sungtaek Lim, Jihong Kim, Kiyoung Choi

DSPs are typically equipped with indirect addressing modes with auto-increment and auto-decrement, which provide efficient address arithmetic calculations. Such an addressing mode is maximally utilized by careful placement of variables in storage, thereby reducing the amount of address arithmetic instructions. Finding proper placement of variables in storage is called storage assignment problem and the result highly depends on the access sequence of variables. This paper suggests statement scheduling as a compiler optimization step to generate a better access sequence. Experimental results show 3.6% improvement on the average over naive storage assignment.
Keywords Code generation, indirect addressing mode, storage assignment, code size reduction

"s" Task concurrency management methodology to schedule the MPEG4 IM1 player on a highly parallel processor platform [p. 170]

Chun Wong, Paul Marchal, Peng Yang, Aggeliki Prayati, Francky Catthoor, Rudy Lauwereins, Diederik Verkest, Hugo De Man

This paper addresses the concurrent task management of complex multi-media systems, like the MPEG4 IM1 player, with emphasis on how to derive energy-cost vs time-budget curves through task scheduling on a multi-processor platform. Starting from the original �standard� specification, we extract the concurrency originally hidden by implementation decisions in a �grey-box� model. Then we have applied two high-level transformations on this model to improve the task-level concurrency. Finally, by scheduling the transformed task-graph, we have derived energy-cost vs time-budget curves. These curves will be used to get globally optimized design decisions when combining subsystems into one complete system or to be used by a dynamic scheduler. The results on the MPEG4 IM1 player confirm the validity of our assumptions and the usefulness of our approach.
Keywords concurrency, scheduling, MPEG-4, embedded system, cost-efficiency

Parameterized System Design and Simulation Approaches [p. 176]

Parameterized System Design Based on Genetic Algorithms [p. 177]

Giuseppe Ascia, Vincenzo Catania, Maurizio Palesi

A recent reduction in the time to market has led to the development of a new approach to IP-based design in which a high parametric pre-designed system-on-a-chip is configured according to the application it will have to execute. The greatest problems in this area regard exploration of the range of possible system configurations in search of the optimal configuration for a given system. There are, in fact, a number of parameters involved (bus sizes, cache configurations, software algorithms, etc.), each of which has a great impact on design constraints such as area, power and performance. An exhaustive analysis of all possible configurations is thus computationally unfeasible. In this paper we propose using genetic algorithms to determine the optimal configuration for a highly parametric system. The approach is applied to the search for the optimal configuration (in terms of area, power and mean access time) of a memory hierarchy involved in a given application.
Keywords: Parameterised Systems, Genetic algorithms, Exploration of system configurations

"s" Minimizing System Modification in an Incremental Design Approach [p. 183]

Paul Pop, Petru Eles, Traian Pop, Zebo Peng

In this paper we present an approach to mapping and scheduling of distributed embedded systems for hard real-time applications, aiming at minimizing the system modification cost. We consider an incremental design process that starts from an already existing system running a set of applications. We are interested to implement new functionality so that the already running applications are disturbed as little as possible and there is a good chance that, later, new functionality can easily be added to the resulted system. The mapping and scheduling problem are considered in the context of a realistic communication model based on a TDMA protocol.
Keywords: design space exploration, design reuse, distributed real-time systems, process mapping and scheduling, methodology.

"s" High-level architectural co-simulation using Esterel and C [p. 189]

Andre Chatelain, Yves Mathys, Giovanni Placido, Alberto La Rosa, Luciano Lavagno

This paper introduces an architectural simulation environment, aimed at defining the best SOC architecture for complex system-level applications. The application is modeled using an abstract Timing Modeling Language, that describes the requests (e.g., memory accesses, I/Os, etc.) that the application makes to the architecture. The abstract architecture is modeled at the cycle-accurate level using a mixture of Esterel (a synchronous language) and C. We discuss the results of the application of this tool to a GSM/GPRS application, including a dramatic speed-up of the architectural exploration phase.

"s" A Generic Wrapper Architecture for Multi-Processor SoC Cosimulation and Design [p. 195]

Sungjoo Yoo, Gabriela Nicolescu, Damien Lyonnard, Amer Baghdadi, Ahmed A. Jerraya

In communication refinement with multiple communication protocols and abstraction levels, the system specification is described by heterogeneous components in terms of communication protocols and abstraction levels. To adapt each heterogeneous component to the other part of system, we present a generic wrapper architecture that can adapt different protocols or different abstraction levels, or both. In this paper, we give a detailed explanation of applying the generic wrapper architecture to mixed-level cosimulation. As preliminary experiments, we applied it to mixed-level cosimulation of an IS-95 CDMA cellular phone system.

"s" The TACO Protocol Processor Simulation Environment [p. 201]

Seppo Virtanen, Johan Lilius

Network hardware design is becoming increasingly challenging because more and more demands are put on network bandwidth and throughput requirements, and on the speed with which new devices can be put on the market. Using current standard techniques (general purpose microprocessors, ASIC's) these goals are difficult to reach simultaneously. One solution to this problem that has recently attracted interest is the design of programmable processors with network-optimized hardware, that is, network or protocol processors. In this paper a simulation framework for a family of TTA protocol processor architectures is proposed. The protocol processors consist of a number of buses with functional units that encapsulate protocol specific operations. The TACO protocol processor simulator is a C++ framework based on SystemC. Functional units are created as C++ classes, which makes it easy to experiment with different configurations of the processor to see its performance.
Keywords microprocessor, protocol, simulation, codesign

Code Generation and Software Issues [p. 207]

Formal Synthesis and Code Generation of Embedded Real-Time Software [p. 208]

Pao-Ann Hsiung

Due to rapidly increasing system complexity, shortening time-tomarket, and growing demand for hard real-time systems, formal methods are becoming indispensable in the synthesis of embedded systems, which must satisfy stringent temporal, memory, and environment constraints. There is a general lack of practical formal methods that can synthesize complex embedded real-time software (ERTS). In this work, a formal method based on Time Free-Choice Petri Nets (TFCPN) is proposed for ERTS synthesis. The synthesis method employs quasi-static data scheduling for satisfying limited embedded memory requirements and uses dynamic realtime scheduling for satisfying hard real-time constraints. Software code is then generated from a set of quasi-statically and dynamically scheduled TFCPNs. Finally, an application example is given to illustrate the feasibility of the proposed TFCPN-based formal method for ERTS synthesis.
Keywords Embedded real-time software, Petri Nets, scheduling, code generation

Whole program compilation for embedded software: the ADSL experiment [p. 214]

Johan Cockx

The increasing complexity and decreasing time-to-market of embedded software forces designers to write more modular and reusable code, using for example object-oriented techniques and languages such as C++. The resulting memory and runtime overhead cannot be removed by traditional optimizing compilers; a global, whole program analysis is required. To evaluate the potential of whole program optimization techniques, we have manually optimized the embedded software of a commercial ADSL modem. Using only techniques that can be automated, a memory footprint reduction of nearly 60% has been achieved. We conclude that a consistent and aggressive use of whole system optimization techniques is feasible and worthwhile, and that the implementation of such techniques in a compiler for embedded software will allow software designers to write more modular and reusable code without suffering the associated implementation overhead.
Keywords: Whole program compilation, embedded software, interprocedural optimization, C++

Compiler-Directed Selection of Dynamic Memory Layouts [p. 219]

Mahmut Taylan Kandemir, Ismail Kadayif

Compiler technology is becoming a key component in the design of embedded systems, mostly due to increasing participation of software in the design process. Meeting system-level objectives usually requires flexible and retargetable compiler optimizations that can be ported across a wide variety of architectures. In particular, source-level compiler optimizations aiming at increasing locality of data accesses are expected to improve the quality of the generated code. Previous compiler-based approaches to improving locality have mainly focused on determining optimal memory layouts that remain in effect for the entire execution of an application. For large embedded codes, however, such static layouts may be insufficient to obtain acceptable performance. The selection of memory layouts that dynamically change over the course of a program's execution adds another dimension to data locality optimization. This paper presents a technique that can be used to automatically determine which layouts are most beneficial over specific regions of a program while taking into account the added overhead of dynamic (runtime) layout changes. The results obtained using two benchmark codes show that such a dynamic approach brings significant benefits over a static state-of-the-art technique.
Keywords. Software Compilation, Data Locality, Memory Layout Optimizations, Array Reuse, Data Dependence.

"s" Logic Optimization and Code Generation for Embedded Control Applications [p. 225]

Yunjian Jiang, Robert Brayton

We address software optimization for embedded control systems. The Esterel language is used as the front-end specification; Esterel compiler v6 is used to partition the control circuit and data path; the resulting intermediate representation of t he design is a control-data network. This paper emphasizes the optimization of the control circuit portion and the code generation of the logic network. The new control-data network representation has four types of nodes: control, multiplexer, predicate and data expression; the control portion is a multi-valued logic network optimization package called MVSIS for the control optimization. It includes algebraic methods to perform multi-valued algebraic devision, factorization and decomposition and logic simplification methods based on observability don't cares. We have developed methods to evaluate a control-data network based on both an MDD and sum-of-products representation of the multi-valued logic functions. The MDD-based approach uses multi-valued intermediate variables and generates code according to the internal BDD structure. The SOP-base code is proportional to the number of cubes in the logic network. Preliminary results compare the two approaches and the optimization effectiveness.
Keywords: Esterel, Logic optimization, MDD, Code generation, Multiple-valued.

"s" Empirical Comparison of Software-Based Error Detection and Correction Techniques for Embedded Systems [p. 230]

R.H.L. Ong, M.J. Pont

"Function Tokens" and "NOP Fills" are two methods proposed by various authors to deal with Instruction Pointer corruption in microcontrollers, especially in the presence of high electromagnetic interference levels. An empirical analysis to assess and compare these two techniques is presented in this paper. Two main conclusions are drawn: [1] NOP Fills are a powerful technique for improving the reliability of embedded applications in the presence of EMI, and [2] the use of Function Tokens can lead to a reduction in overall system reliability.
Keywords Instruction Pointer Corruption, Electromagnetic Interference, EMI, Function Token, NOP Fill, Software-based Error Detection Techniques, Embedded systems

Low Power Design [p. 236]

Dynamic I/O Power Management for Hard Real-time Systems [p. 237]

Vishnu Swaminathan, Krishnendu Chakrabarty, S. S. Iyengar

Power consumption is an important design parameter for embedded and portable systems. Software-controlled (or dynamic) power management (DPM) has recently emerged as an attractive alternative to inflexible hardware solutions. DPM for hard real-time systems has received relatively little attention. In particular, energy-driven I/O device scheduling for real-time systems has not been considered before. We present the first online DPM algorithm, which we call Low Energy Device Scheduler (LEDES), for hard realtime systems. LEDES takes as inputs a predetermined task schedule and a device-usage list for each task and it generates a sequence of sleep/working states for each device. It guarantees that real-time constraints are not violated and it also minimizes the energy consumed by the I/O devices used by the task set. LEDES is energy-optimal under the constraint that the start times of the tasks are fixed. We present a case study to show that LEDES can reduce energy consumption by almost 50%.

"s" Hybrid Global/Local Search Strategies for Dynamic Voltage Scaling in Embedded Multiprocessors [p. 243]

Neal K. Bambha, Shuvra S. Bhattacharyya, Jürgen Teich, Eckart Zitzler

In this paper, we explore a hybrid global/local search optimization framework for dynamic voltage scaling in embedded multiprocessor systems. The problem is to find, for a multiprocessor system in which the processors are capable of dynamically varying their core voltages, the optimum voltage levels for all the tasks in order to minimize the average power consumption under a given performance constraint. An effective local search approach for static voltage scaling based on the concept of a period graph has been demonstrated in [1]. To make use of it in an optimization problem, the period graph must be integrated into a global search algorithm. Simulated heating, a general optimization framework developed in [19], is an efficient method for precisely this purpose of integrating local search into global search algorithms. However, little is known about the management of computational (compile-time) resources between global search and local search in hybrid algorithms, such as those coordinated by simulated heating. In this paper, we explore various hybrid search management strategies for power optimization under the framework of simulated heating. We demonstrate that careful search management leads to significant power consumption improvement over add-hoc global search / local search integration, and explore alternative approaches to performing hybrid search management for dynamic voltage scaling.
Keywords simulated heating, dynamic voltage scaling

"s" Processor Frequency Setting for Energy Minimization of Streaming Multimedia Applications [p. 249]

Andrea Acquaviva, Luca Benini, Bruno Ricco

In this paper, we describe a software-controlled approach for adaptively minimizing energy in embedded systems for real-time multimedia processing. Energy is optimized by clock speed setting: the software controller dynamically adjusts processor clock speed to the frame rate requirements of the incoming multimedia stream. The speed-setting policy is based on a system model that correlates clock speed with best-case, average-case and worst-case sustainable frame rate, accounting for data-dependency in multimedia streams. Experiments on an MP3 decoding application show that computational energy can be drastically reduced with respect to fixed-frequency operation.

"s" Retargetable Compilation for Low Power [p. 254]

Wen-Tsong Shiue

Most research to date on energy minimization in DSP processors has focuses on hardware solution. This paper examines the software-based factors affecting performance and energy consumption for architecture-aware compilation. In this paper, we focus on providing support for one architectural feature of DSPs that makes code generation difficult, namely the use of multiple data memory banks. This feature increases memory bandwidth by permitting multiple data memory accesses to occur in parallel when the referenced variables belong to different data memory banks and the registers involved conform to a strict set of conditions. We present novel instruction scheduling algorithms that attempt to maximize the performance, minimize the energy, and therefore, maximize the benefit of this architectural feature. Experimental results demonstrate that our algorithms generate high performance, low energy codes for the DPS architectural features with multiple data memory banks. Our algorithm led to improvements in performance and energy consumption of 48.3% and 66.6% respectively in our benchmark examples.
Keywords Architecture-aware compiler design, high performance and low power design, instruction scheduling, register allocation.

"s" A Design Framework to Efficiently Explore Energy-Delay Tradeoffs [p. 260]

William Fornaciari, Donatella Sciuto, Cristina Silvano, Vittorio Zaccaria

Comprehensive exploration of the design space parameters at the system-level is a crucial task to evaluate architectural tradeoffs accounting for both energy and performance constraints. In this paper, we propose a system-level design methodology for the e.cient exploration of the memory architecture from the energy-delay combined perspective. The aim is to .nd a sub-optimal con.guration of the memory hierarchy without performing the exhaustive analysis of the parameters space. The target system architecture includes the processor, separated instruction and data levelone caches, the main memory, and the system buses. The methodology is based on the sensitivity analysis of the optimization function with respect to the tuning parameters of the cache architecture (mainly cache size, block size and associativity). The e.ectiveness of the proposed methodology has been demonstrated through the design space exploration of a real-world example: a MicroSPARC2-based system running the Mediabench suite. Experimental results have shown an optimization speedup of 329 times with respect to the full search, while the near-optimal system-level con.guration is characterized by a distance from the optimal full search con.guration in the band of 10%.