DAC 2002 ABSTRACTS

Sessions: [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55]


Session 1: PANEL: Wall Street Evaluates EDA

Chair: Aart de Geus
Organizers: Sharon Turnoy, Deirdre Hanford
Panel Members: Moshe Gavrielov, Richard Goering, Lucio Lanza, Vishal Saluja, Jay Vleeschhouwer [p. 1]

The EDA sector is capturing unprecedented attention on Wall Street. With seven IPOs in 2001 alone and strong performance by the EDA "blue chips," the industry has gained new prominence with the capital markets. In this panel, Aart de Geus will moderate a discussion between representatives of the various constituencies who play a role in shaping Wall Street's opinion of EDA: financial analysts, portfolio managers, venture capitalists, CEOs, and the press.
Questions discussed will include: How do investors and analysts currently view EDA? What contributes to that perception? What factors drive EDA's current favor with Wall Street, and why is the sector "hot" compared to two to three years ago? How do we sustain that favor? Given the highly complex nature of our industry, how do investors decipher the strength of a particular EDA firm? Is EDA tied to the semiconductor industry's performance? What role does the press play in shaping the view?


Session 2: Web and IP Based Design

Chair: Gang Qu
Organizers: Ahmed A. Jerraya, Krzysztof Kuchcinski
2.1 IP Delivery for FPGAs Using Applets and JHDL [p. 2]
Michael J. Wirthlin, Brian McMurtrey

This paper introduces an FPGA IP evaluation and delivery system that operates within Java applets. The use of such applets allows designers to create, evaluate, test, and obtain FPGA circuits directly within a web browser. Based on the JHDL design tool, these applets allow structural viewing, circuit simulation, and netlist generation of application-specific circuits. Applets can be customized to provide varying levels of IP visibility and functionality as needed by both customer and vendor.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids - Simulation, Hardware description languages
General Terms
Design
Keywords
Intellectual Property, JHDL, Applet, FPGA

2.2 Watermarking Integer Linear Programming Solutions [p. 8]
Seapahn Megerian, Milenko Drinic, Miodrag Potkonjak

Linear programming (LP) in its many forms has proven to be an indispensable tool for expressing and solving optimization problems in numerous domains. We propose the first set of generic watermarking techniques for integer-LP (ILP). The proof of authorship by watermarking is achieved by introducing additional constraints to limit the solution space and can be used as effective means of intellectual property protection (IPP) and authentication. We classify and analyze the types of constraints in the ILP watermarking domain and show how ILP formulations provide more degrees of freedom for embedding signatures than other existing approaches. To demonstrate the effectiveness of the proposed ILP watermarking techniques, the generic discussion is further concretized using two examples, namely Satisfiability and Scheduling.
Categories and Subject Descriptors
K.6.5 [Management of Computing and Information Systems]: Security and Protection -- Digital Watermarking
General Terms
Algorithms, Economics, Theory, Legal Aspects.
Keywords
Digital Watermarking, Intellectual Property Protection

2.3 Model Design Using Hierarchical Web-Based Libraries [p. 14]
Fabrice Bernardi, Jean-FranÙois Santucci

Design tools can be profitably associated with libraries of reusable modeling components that will make the description and also the validation of the models much easier. Furthermore, applications of today and tomorrow will be increasingly based on three fundamental technologies: Object Orientation, Client/Server and Internet. We propose in this article an object-oriented architecture for the definition of Web-based hierarchical models libraries. The originality of our approach lies in the facts that it is based on : (i) a notion of genericity of use, (ii) notions like inheritance and abstraction links between the stored models and (iii) Web-based storing and consulting libraries procedures.
Categories and Subject Descriptors
J.6 [Computer Applications]: Computer-Aided Design; D.2.11 [Software]: Software Engineering|Software Architectures ; D.2.13 [Software]: Software Engineering|Reusable Software
General Terms
Design, Management
Keywords
models reuse, models libraries, Web-based access, abstraction hierarchy

2.4 Behavioral Synthesis via Engineering Change [p. 18]
Milenko Drinic, Darko Kirovski

Engineering change (EC) is a technique that enables a designer to rapidly perform minor specification alternations while minimally resynthesizing only small portions of the specification throughout several levels of design abstraction. In this paper, we introduce the first EC-based synthesis technique for coordinated design optimization in multiple steps. The technique has four phases: optimization region identification,feedback formulation,resynthesis in first step, and finally resynthesis in the second design step. To demonstrate the technique,we focus on behavioral synthesis and transformation, scheduling, and register assignment steps. We developed a generic EC-based approach for design optimization during multiple consecutive synthesis steps. Next, we show how one can use EC to enhance coordinated application of transformations and scheduling,and scheduling and register assignment.
Categories and Subject Descriptors
B.5.2 [Register-Transfer-Level Implementation]: Design Aids Optimization
General Terms
Design
Keywords
Engineering change, transformations, scheduling, register assignment


Session 3: Design Innovations for Embedded Processors

Chair: Gang Qu
Organizers: Grant E. Martin, Majid Sarrafzadeh
3.1 A Universal Technique for Fast and Flexible Instruction-Set Architecture Simulation [p. 22]
Achim Nohl, Gunnar Braun, Oliver Schliebusch, Rainer Leupers, Heinrich Meyr, Andreas Hoffmann

In the last decade, instruction-set simulators have become an essential development tool for the design of new programmable architectures. Consequently, the simulator performance is a key factor for the overall design efficiency. Based on the extremely poor performance of commonly used interpretive simulators, research work on fast compiled instruction-set simulation was started ten years ago. However, due to the restrictiveness of the compiled technique, it has not been able to push through in commercial products. This paper presents a new retargetable simulation technique which combines the performance of traditional compiled simulators with the flexibility of interpretive simulation. This technique is not limited to any class of architectures or applications and can be utilized from architecture exploration up to end-user software development. The work-flow and the applicability of the so-called just-in-time cache compiled simulation (JIT-CCS) technique will be demonstrated by means of state of the art real world architectures.
Categories and Subject Descriptors
I.6.3 [Simulation and Modeling]: Simulation Support Systems; I.6.3 [Simulation and Modeling]: Model Validation and Analysis; D.3.2 [Programming Languages]: Design Languages - LISA; C.0 [General]: Modeling of Computer Architecture
General Terms
Design, Languages, Performance
Keywords
Retargetable simulation, compiled simulation, instruction set architectures

3.2 A Fast On-Chip Profiler Memory [p. 28]
Roman Lysecky, Susan Cotterell, Frank Vahid

Profiling an application executing on a microprocessor is part of the solution to numerous software and hardware optimization and design automation problems. Most current profiling techniques suffer from runtime overhead, inaccuracy, or slowness, and the traditional non-intrusive method of using a logic analyzer doesn't work for today's system-on-a-chip having embedded cores. We introduce a novel on-chip memory architecture that overcomes these limitations. The architecture, which we call ProMem, is based on a pipelined binary tree structure. It achieves single-cycle throughput, so it can keep up with today's fastest pipelined processors. It can also be laid out efficiently and scales very well, becoming more efficient the larger it gets. The memory can be used in a wide-variety of common profiling situations, such as instruction profiling, value profiling, and network traffic profiling, which in turn can be used to guide numerous design automation tasks.
Keywords
Profiling, system-on-a-chip, platform tuning, adaptive architectures, low power, embedded CAD, binary tree, memory design, embedded systems.

3.3 Design of an One-cycle Decompression Hardware for Performance Increase in Embedded Systems [p. 34]
Haris Lekatsas, Jûrg Henkel, Venkata Jakkula

Code compression is known as an effective technique to reduce instruction memory size on an embedded system. However, code compression can also be very effective in increasing processor-to-memory bandwidth and hence provide increased system performance. In this paper we describe our design and design methodology of the first running prototype of a one-cycle code decompression unit that decompresses compressed instructions on-the-fly. We describe in detail the architecture that enables decompression of multiple instructions in one cycle and we present the design methodologies and tools used. The stand-alone decompression unit does not require any modifications on the processor core. We observed up to 63% performance increase with 25% in average over a wide variety of applications running on the hardware prototype under various system configurations.
Categories and Subject Descriptors
B.3 [Hardware]: Memory Structures; C.3 [Computer Systems Organization]: Special-purpose and Application-based Systems-Real-time and embedded systems
General Terms
Algorithms, Design, Performance


Session 4: Passive Model Order Reduction

Chair: Jacob White
Organizers: Jaijeet Roychowdhury, Mustafa Celik
4.1 A Factorization-Based Framework for Passivity-Preserving Model Reduction of RLC Systems [p. 40]
Q. Su, V. Balakrishnan, C.-K. Koh

We present a framework for passivity-preserving model reduction for RLC systems that includes, as a special case, the well-known PRIMA model reduction algorithm. This framework provides a new interpretation for PRIMA, and offers a qualitative explanation as to why PRIMA performs remarkably well in practice. In addition, the framework enables the derivation of new error bounds for PRIMA-like methods. We also show how the framework offers a systematic approach to computing reduced-order models that better approximate the original system than PRIMA, while still preserving passivity.
Categories and Subject Descriptors
G.1.3 [NUMERICAL ANALYSIS]: Numerical Linear AlgebraÖ linear systems; F.2.1 [ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY]: Numerical Algorithms and Problems-computations on matrices
General Terms
Algorithms
Keywords
Model Reduction, Large Scale Systems, RLC interconnect, Passivity Preserving, Factorization

4.2 Model Order Reduction for Strictly Passive and Causal Distributed Systems [p. 46]
Luca Daniel, Joel Phillips

This paper presents a class of algorithms suitable for model reduction of distributed systems. Distributed systems are not suitable for treatment by standard model-reduction algorithms such as PRIMA, PVL, and the Arnoldi schemes because they generate matrices that are dependent on frequency (or other parameters) and cannot be put in a lumped or state-space form. Our algorithms build on well-known projection-based reduction techniques, and so require only matrix-vector product operations and are thus suitable for operation in conjunction with electromagnetic analysis codes that use iterative solution methods and fast-multipole acceleration techniques. Under the condition that the starting systems satisfy system-theoretic properties required of physical systems, the reduced systems can be guaranteed to be passive. For distributed systems, we argue that causality of the underlying representation is as important a consideration as passivity has become.
Categories and Subject Descriptors: B.7.2 Simulation, B.8.2 Performance Analysis and Design Aids, G.1.1 Interpolation G.1.2, Approximations, I.6 Simulation and Modeling.
General Terms: Algorithms, Performance, Design
Keywords: Passive reduced order modeling, Distributed systems.

4.3 Guaranteed Passive Balancing Transformations for Model Order Reduction [p. 52]
Joel Phillips, Luca Daniel, L. Miguel Silveira

The major concerns in state-of-the-art model reduction algorithms are: achieving accurate models of sufficiently small size, numerically stable and efficient generation of the models, and preservation of system properties such as passivity. Algorithms such as PRIMA generate guaranteed-passive models, for systems with special internal structure, using numerically stable and efficient Krylov-subspace iterations. Truncated Balanced Realization (TBR) algorithms, as used to date in the design automation community, can achieve smaller models with better error control, but do not necessarily preserve passivity. In this paper we show how to construct TBR-like methods that guarantee passive reduced models and in addition are applicable to state-space systems with arbitrary internal structure.
Categories and Subject Descriptors: B.7.2 Simulation, B.8.2 Performance Analysis and Design Aids, I.6 Simulation and Modeling.
General Terms: Algorithms, Performance, Design
Keywords: Passive reduced order modeling, Truncated balanced realization.


Session 5: New Perspectives in Physical Design

Chair: Steven Teig
Organizers: Ralph Otten, Timothy Kam
5.1 Uncertainty-Aware Circuit Optimization [p. 58]
Xiaoliang Bai, Chandu Visweswariah, Philip N. Strenski, David J. Hathaway

Almost by definition, well-tuned digital circuits have a large number of equally critical paths, which form a so-called "wall" in the slack histogram. However, by the time the design has been through manufacturing, many uncertainties cause these carefully aligned delays to spread out. Inaccuracies in parasitic predictions, clock slew, mode-to-hardware correlation, static timing assumptions and manufacturing variations all cause the performance to vary from prediction. Simple statistical principles tell us that the variation of the limiting slack is larger when the height of the wall is greater. Although the wall may be the optimum solution if the static timing predictions were perfect, in the presence of uncertainty in timing and manufacturing, it may no longer be the best choice. The application of formal mathematical optimization in transistor sizing increases the height of the wall, thus exacerbating the problem. There is also a practical matter that schematic restructuring down-stream in the design methodology is easier to conceive when there are fewer equally critical paths. This paper describes a method that gives formal mathematical optimizers the incentive to avoid the wall of equally critical paths, while giving up as little as possible in nominal performance. Surprisingly, such a formulation reduces the degeneracy of the optimization problem and can render the optimizer more effective. This "uncertainty-aware" mode has been implemented and applied to several high-performance microprocessor macros. Numerical results are included.

5.2 Congestion-Driven Codesign of Power and Signal Networks [p. 64]
Haihua Su, Jiang Hu, Sachin S. Sapatnekar, Sani R. Nassif

We present a global wire design methodology that simultaneously considers the performance needs for both signal lines and power grids under congestion considerations. An iterative procedure is employed in which the global routing is performed according to a congestion map that includes the resource utilization of the power grid, followed by a step in which the power grid is adjusted to relax the congestion in crowded regions. This adjustment is in the form of wire removal in noncritical regions, followed by a wire sizing step that overcomes the effects of wire removal. Experimental results show that the overall routability can be significantly improved while the power grid noise is maintained within the voltage droop constraint.
Categories and Subject Descriptors
B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids
General Terms
Algorithms
Keywords
wire congestion, codesign, signal routing, power grid noise

5.3 On Metrics for Comparing Routability Estimation Methods for FPGAs [p. 70]
Parivallal Kannan, Shankar Balachandran, Dinesh Bhatia

Interconnect management is a critical design issue for large FPGA based designs. One of the most important issues for planning interconnection is the ability to accurately and efficiently predict the routability of a given design on a given FPGA architecture. The recently proposed routability estimation procedure, fGREP [6], produced estimates within 3 to 4% of an actual detailed router. Other known routability estimation methods include RISA [5], Lou's [7] method and Rent's rule based methods [1] [12] [9]. Comparing these methods has been difficult because of the different reporting methods used by the authors. We propose a uniform reporting metric based on comparing the estimates produced with the results of an actual detailed router on both local and global levels. We compare all the above methods using our reporting metric on a large number of benchmark circuits and show that the enhanced fGREP method produces tight estimates that outperform most other techniques.
Categories and Subject Descriptors B.7.2 [Design Aids for Integrated Circuits]:
General Terms
Algorithms, Measurement, Experimentation
Keywords
FPGA, fGREP, routability estimation, congestion, RISA, Rent's rule


Session 6: PANEL: Tools or Users: Which is the Bigger Bottleneck?

Chair: Andrew B. Kahng
Organizer: Bob Dahlberg
Panel Members: Ron Collett, Patrick Groeneveld, Lambert van den Hoven, Lavi Lev, Nancy Nettleton, Paul Rodman [p. 76]

As chip design becomes ever more complex, fewer design teams are succeeding. Who's to blame? On one hand, tools are hard to use, buggy, not interoperable, and have missing functionality. On the other hand, there is a wide range of engineering skills within the user population, and tools can be abused within flawed methodologies. This panel will quantify and prioritize the key gaps, including interoperability, that must be addressed on both sides.


Session 7: Special Session: Life after CMOS: Imminent or Irrelevant?

Chairs: Dennis Sylvester, Kaustav Banerjee
Organizers: Dennis Sylvester, Kaustav Banerjee
7.1 Life Is CMOS: Why Chase the Life After? [p. 78]
George Sery, Shekhar Borkar, Vivek De

This paper discusses potential solutions to the CMOS device technology scaling at gate lengths approaching 10nm. Promising circuit and design techniques to control leakage power are described. Energy-efficient microarchitecture trends for general purpose microprocessors are elucidated.
Categories and Subject Descriptors
B.7 INTEGRATED CIRCUITS
B.7.1 Types and Design Styles š Microprocessors and microcomputers, VLSI.
General Terms
Performance, Design
Keywords
Technology scaling, Leakage control, Microarchitecture

7.2 The Next Chip Challenge: Effective Methods for Viable Mixed Technology SoCs [p. 84]
H. Bernhard Pogge

The next generation of computer chips will continue the trend for more complexity than their predecessors. Many of them will contain different chip technologies and are termed SoCs (System on a Chip). They present to the process community, the system and circuit communities, as well as to the design and test communities major new challenges. On the other hand they also offer at the same time also new opportunities!. For one, the desire to bring more functionality onto a single chip tends to require additional processing, which in turn results in various degrees of device compromises. The chips will also tend to become larger due to the added device content, and this generally will impact the yieldability of the final chip. And such chips will require potentially new approaches to validate the intended design performances. Chip sector reuse must also be brought into the discussion and wherever possible into practice. The net effect implies higher chip costs. Much of the industry's efforts are therefore focused in addressing these challenges; however, so far, not yet very successfully. The alternative has been to continue in the placement of chips onto substrate modules. Yet, this solution creates practical limits on achievable wiring densities and bandwidth, due to the spacing requirements of the C4 interconnection. Furthermore, every C4 joint is associated with a signal delay of about 50 psec. All of these handicaps would potentially benefit greatly from new SoC methods, starting with the fabrication methodology and extending it into the chip design and test areas.
Such a direction has been set in motion. The opportunity for a uniquely new chip fabrication method has emerged by combining a set of somewhat diverse processes. It is based on a judicious selection of process elements from the traditional chip area and combined with those of a somewhat more recent chip packaging process methodology. This approach results in overcoming simultaneously all of the key current process limitations as experienced with today's SoC chip designs, as well as eliminates certain chip packaging technology handicaps. Yet, it does not require the need for new process tooling. It relies on currently existing process tooling and process methodologies.
This new process direction has been found to be quite applicable to a number of desirable SoC device designs, and offers new opportunities for yet another expansion of the current semiconductor technology base over the next few years. However, effective SoC designs and fabrications require a much closer and earlier collaboration between the process, design and test communities.
General Terms: Design
Key Words: SoCs (System on a Chip); Chip Fabrication methods; Chip/Packing integration; Chip Subsector concepts

7.3 Few Electron Devices: Towards Hybrid CMOS-SET Integrated Circuits [p. 88]
Adrian M. Ionescu, Michel J. Declercq, Santanu Mahapatra, Kaustav Banerjee, Jacques Gautier

In this paper, CMOS evolution and their fundamental and practical limitations are briefly reviewed, and the working principles, performance, and fabrication of single-electron transistors (SETs) are addressed in detail. Some of the unique characteristics and functionality of SETs, like unrivalled integration and low power, which are complementary to the sub-20 nm CMOS1, are demonstrated. Characteristics of two novel SET architectures, namely, C-SET and R-SET, aimed at logic applications are compared. Finally, it is shown that combination of CMOS and SET in hybrid ICs appears to be attractive in terms of new functionality and performance, together with better integrability for ULSI, especially because of their complementary characteristics. It is envisioned that efforts in terms of compatible fabrication processes, packaging, modeling, electrical characterization, co-design and co-simulation will be needed in the near future to achieve substantial advances in both memory and logic circuit applications based on CMOS-SET hybrid circuits.
Categories and Subject Descriptors
B.7 INTEGRATED CIRCUITS
B.7.1 Types and Design Styles š Advanced Technologies
General Terms
Design, Experimentation, Measurement, Performance
Keywords
Nanoelectronics, Single-Electron Transistors, Ultimate CMOS, Hybrid CMOS-SET Circuits, Low power, Inverter, Quantizer.

7.4 Carbon Nanotube Field-Effect Transistors and Logic Circuits [p. 94]
R. Martel, V. Derycke, J. Appenzeller, S. Wind, and Ph. Avouris

In this paper, we present recent advances in the understanding of the properties of semiconducting single wall carbon nanotube and in the exploration of their use as field-effect transistors (FETs). Both electrons and holes can be injected in a nanotube transistor by either controlling the metal-nanotube Schottky barriers present at the contacts or simply by doping the bulk of the nanotube. These methods give complementary nanotube FETs that can be integrated together to make inter- and intra-nanotube logic circuits. The device performance and their general characteristics suggest that they can compete with silicon MOSFETs. While this is true when considering simple prototype devices, several issues remain to be explored before a nanotube-based technology is possible. They are also discussed.
Categories and Subject Descriptors
B.6.0 [Logic Design]: General š novel logic devices.
General Terms
Measurement, Performance, Design, Experimentation.
Keywords
Nanoelectronics, Carbon Nanotube, Semiconductor, Field-Effect Transistor, FET, Schottky Barrier, Circuits, Inverter, Logic Gate, SWNT.


Session 8: Formal Verification

Chair: Yaron Wolfsthal
Organizers: Carl Pixley, Karem Sakallah
8.1 Efficient State Representation for Symbolic Simulation [p. 99]
Valeria Bertacco, Kunle Olukotun

Symbolic simulation is attracting increasing interest for the validation of digital circuits. It allows the verification engineer to explore all, or a major portion of the circuit's state space without having to design specific and time-consuming test stimuli. However, the complexity and unpredictable run-time behavior of symbolic simulation have limited its scope to small-to-medium circuits. In this paper, we propose a novel approach to symbolic simulation that reduces the size of the BDDs of the state vector while maintaining an exact representation of the set of states visited. The method exploits the decomposition properties of Boolean functions. By restructuring the next-state functions in their disjoint support components, we gain a better insight in the role of each input variable. Consequently, we can simplify the next-state functions without significantly sacrificing the simulation accuracy. Our experimental results shows that this approach can be used in effectively reducing the memory requirements of symbolic simulation while surrendering only a small portion of the design's state space.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids - Verification, Simulation; B.8 [Hardware]: Performance and Reliability
General Terms
Design, Verification, Performance, Theory
Keywords
Formal Verification, Symbolic Simulation, BDDs

8.2 Handling Special Constructs in Symbolic Simulation [p. 105]
Alfred Kûlbl, James Kukula, Kurt Antreich, Robert Damiano

Symbolic simulation is a formal verification technique which combines the flexibility of conventional simulation with powerful symbolic methods. Some constructs, however, which are easy to handle in conventional simulation need special consideration in symbolic simulation. This paper discusses some special constructs that require unique treatment in symbolic simulation such as the symbolic representation of arrays, an efficient symbolic method for storing arrayed instances and the handling of symbolic data-dependent delays. We present results which demonstrate the effectiveness of our symbolic array model in the simulation of highly regular structures like FPGAs, memories or cellular automata.
Categories and Subject Descriptors
B.5.2 [Hardware]: Register-Transfer-Level ImplementationÖDesign Aids, Verification; B.6.3 [Hardware]: Logic Design - Design Aids, Verification; B.7.2 [Hardware]: Integrated Circuits - Design Aids, Verification
General Terms
Verification
Keywords
Symbolic Simulation, Formal Verification

8.3 A Hybrid Verification Approach: Getting Deep into the Design [p. 111]
Scott Hazelhurst, Gila Kamhi, Osnat Weissberg, Limor Fix

One method of handling the computational complexity of the verification process is to combine the strengths of different approaches. We propose a hybrid verification technology combining symbolic trajectory evaluation with either symbolic model checking or SAT-based model checking. This reduces significantly the cost (both human and computing) of verifying circuits with complex initialisation, as well as simplifying proof development by enhancing verification productivity. The approach has been tested on current Intel designs.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design AidsÖverification; F.3.1 [Specifying and Verifying and Reasoning about Programs]: mechanical verification
General Terms
Verification, Theory
Keywords
symbolic model checking, symbolic trajectory evaluation, hybrid verification

8.4 Can BDDs Compete with SAT Solvers on Bounded Model Checking? [p. 117]
Gianpiero Cabodi, Paolo Camurati, Stefano Quer

The usefulness of Bounded Model Checking (BMC) based on propositional satisfiability (SAT) methods has recently proven its efficacy for bug hunting. BDD based tools are able to verify broader sets of properties (e.g. CTL formulas) but recent experimental comparisons between SAT and BDDs in formal verification lead to the conclusion that SAT approaches are more robust and scalable than BDD techniques. In this work we extend BDD-based verification to larger circuit and problem sizes, so that it can indeed compete with SAT-based tools. The approach we propose solves Bounded Model Checking problems using BDDs. In order to cope with larger models it exploits approximate traversals, yet it is exact, i.e. it does not produce false negatives or positives. It reaps relevant performance enhancements from mixed forward and backward, approximate and exact traversals, guided search, conjunctive decompositions and generalized cofactor based BDD simplifications. We experimentally compare our tool with BMC in NuSMV (using mchaff as SAT engine), and we show that BDDs are able to accomplish large verification tasks, and they can better cope with increasing sequential depths.


Session 9: High Level Specification and Design

Chair: Andreas Kanstein
Organizers: Limor Fix, Shin-ichi Minato
9.1 RTL C-Based Methodology for Designing and Verifying a Multi-Threaded Processor [p. 123]
Luc SÚmÚria, Andrew Seawright, Renu Mehra, Daniel Ng, Arjuna Ekanayake, Barry Pangrle

A RTL C-based design and verification methodology is presented which enabled the successful high speed validation of a 7 million gate simultaneous multi-threaded (SMT) network processor. The methodology is centered on statically scheduled C-based coding style, C to HDL translation, and a novel RTL-C to RTL-Verilog equivalence checking flow. It leverages improved simulation performance combined with static techniques to reduce the amount of RTL-Verilog and gate-level verification required during development.
Categories - B.5.2 [Register-Transfer-Level Implementation] Design Aids: Automatic synthesis, Hardware description languages, Optimization, Simulation,Verification.
General Terms - Design, Verification, Performance, Languages.
Keywords - C/C++, RTL, design, verification, formal equivalence checking.

9.2 High-Level Specification and Automatic Generation of IP Interface Monitors [p. 129]
Marcio T. Oliveira, Alan J. Hu

A central problem in functional verification is to check that a circuit block is producing correct outputs while enforcing that the environment is providing legal inputs. To attack this problem, several researchers have proposed monitor-based methodologies, which offer many benefits. This paper presents a novel, high-level specification style for these monitors, along with a linear-size, linear-time translation algorithm into monitor circuits. The specification style naturally fits the complex, but well-specified interfaces used between IP blocks in systems-on-chip. To demonstrate the advantage of our specification style, we have specified monitors for various versions of the Sonics OCP protocol as well as the AMBA AHB protocol, and have developed a prototype tool that automatically translates specifications into Verilog or VHDL monitor circuits.
Categories and Subject Descriptors B.5.2 [Register-Transfer Level Implementation]: Design Aids; B.6.3 [Logic Design]: Design Aids; C.0 [Computer Systems Organization]: General - Systems specification methodology; J.6 [Computer-Aided Engineering]: Computer-aided design (CAD)
General Terms
Documentation, Languages, Verification
Keywords
Formal Verification, Regular Expressions, Pipelining, Alternation

9.3 Achieving Maximum Performance: A Method for the Verification of Interlocked Pipeline Control Logic [p. 135]
Kerstin Eder, Geoff Barrett

Getting the interlock logic which controls pipeline flow correct is an important prerequisite for maximising pipeline performance. Unnecessary pipeline stalls can only be eliminated when they can be distinguished from those stalls which are necessary to preserve functional correctness. We propose a method for deriving a maximum pipeline performance specification from a complete functional specification of the pipeline control logic. The performance specification can be used to generate simulation testbench assertions. On the other hand, the specification can serve as a basis for formal property checking. The most promising aspect of our work is, however, the potential to synthesise the actual control logic from its formal description.
Categories and Subject Descriptors
B.5.2 [Register-Transfer-Level Implementation]: Design Aids - Verification; B.5.1 [Register-Transfer-Level Implementation]: Design - Control Design, Pipeline
General Terms
Performance, Verification
Keywords
Pipeline Stall, Interlock Logic, Verification

9.4 Formal Verification of Module Interfaces against Real Time Specifications [p. 141]
Arindam Chakrabarti, Pallab Dasgupta, P. P. Chakrabarti, Ansuman Banerjee

One of the main concerns of the designer of a circuit module is to guarantee that the interface of the module conforms to specific protocols (such as PCI Bus or Ethernet) by which it interacts with its environment. The computational complexity of verifying such open systems under all possible environments has been shown to be very hard (EXPTIME complete [10]). On the other hand, designers are typically required to guarantee correct behavior only for specific valid behaviors of the environment (such as a valid PCI Bus environment). Designers attempt to model these behaviors through an appropriate test bench for the module. In this paper we present a module verifier tool based on a proposed real time temporal logic called Open-RTCTL, which allows combined specification of the correctness properties and the input environments. The tool accepts the design in a subset of Verilog. By making the designer specify the environment constraints, we are able to verify a module in isolation, and thereby avoid the state explosion problem due to composition of modules. We present experimental results on modules from the Texas-97 Benchmark circuits [14] to demonstrate the space/time efficiency of the tool.
Categories and Subject Descriptors
B.7.2. [Hardware]: Integrated Circuits - Verification
General Terms
Verification
Keywords
Formal Verification, Temporal Logic


Session 10: Timing Abstraction

Chair: Mark Hahn
Organizers: Chandu Visweswariah, Narendra V. Shenoy
10.1 Automated Timing Model Generation [p. 146]
Ajay J. Daga, Loa Mize, Subramanyam Sripada, Chris Wolff, Qiuyang Wu

The automated generation of timing models from gate-level netlists facilitates IP reuse and dramatically improves chip-level STA runtime in a hierarchical design flow. In this paper we discuss two different approaches to model generation, the design flows they lend themselves to and results from the application of these model generation solutions to large customer designs. Categories and Subject Descriptors
J.6: Computer Application.CAD.
General Terms
Design, Performance, Algorithm, Verification
Keywords
Static Timing Analysis, Model Generation. EDA.

10.2 Timing Model Extraction of Hierarchical Blocks by Graph Reduction [p. 152]
Cho W. Moon, Harish Kriplani, Krishna P. Belkhale

Timing model extractor builds a timing model of a digital circuit for use with a static timing analyzer. This paper proposes a novel method of generating a gray box timing model from gate-level netlist by reducing a timing graph. Previous methods of generating timing models sacrificed accuracy and/or did not scale well with design size. The proposed method is simple, yet it provides model accuracy including arbitrary levels of latch time borrowing, correct support for self-loop timing checks and capability to support timing constraints that span multiple blocks. Also, cpu and memory resources required to generate the model scale well with size of the circuit. We were able to extract a model for a 456K gate block using under 2 minutes of cpu time and 464 MB of memory on a Sun Fire 880 machine. The generated model can provide a capacity improvement in timing verification by more than two orders of magnitude.

10.3 Efficient Stimulus Independent Timing Abstraction Model Based on a New Concept of Circuit Block Transparency [p. 158]
Martin Foltin, Brian Foutz, Sean Tyler

We have developed a new timing abstraction model for digital circuit blocks that is stimulus independent, port based, supports designs with level triggered latches, and can be input into commercial STA (Static Timing Analysis) tools. The model is based on an extension of the concept of latch transparency to circuit block transparency introduced in this paper. It was implemented, tested and is being used in conjunction with transistor level STA for microprocessor designs with tens of millions of transistors. The STA simulation times are significantly shorter than with gray box timing models, which can decrease the overall chip timing verification time. The model can also be used in the intellectual property encapsulation domain.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits]: Design Aids - simulation, verification
General Terms
Performance, Design, Verification.
Keywords
Timing analysis, timing model, VLSI design, circuit optimization

10.4 An Implication-based Method to Detect Multi-Cycle Paths in Large Sequential Circuits [p. 164]
Hiroyuki Higuchi

This paper proposes a fast multi-cycle path analysis method for large sequential circuits. It determines whether or not all the paths between every flip-flop pair are multi-cycle paths. The proposed method is based on ATPG techniques, especially on implication techniques, to utilize circuit structure and multi-cycle path condition directly. The method also checks whether or not the multi-cycle path may be invalidated by static hazards in combinational logic parts. Experimental results show that our method is much faster than conventional ones.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids
General Terms
Algorithms, Designs, Verification
Keywords
multi-cycle path, sequential circuits, implication, ATPG


Session 11: Special Session: E-Textiles

Chair & Organizer: Majid Sarrafzadeh
11.1 WITHDRAWN

11.2 The Wearable Motherboard: A Framework for Personalized Mobile Information Processing (PMIP) [p. 170]
Sugmee Park, Kenneth Mackenzie, Sundaresan Jayaraman

Textiles and computing share a synergistic relationship, which is being harnessed to create a new paradigm in personalized mobile information processing (PMIP). In this paper, we provide an overview of this "interconnection" between the two fields and present the vision for "E-Textiles," which represents the convergence of the two fields. We discuss the role of the Georgia Tech Wearable Motherboard in pioneering this paradigm of "fabric is the computer" and serving as a framework for PMIP. Finally, recent research in this area resulting in the realization of a "computational fabric network" is discussed.

11.3 Challenges and Opportunities in Electronic Textiles Modeling and Optimization [p. 175]
Diana Marculescu, Radu Marculescu, Pradeep K. Khosla

This paper addresses an emerging new field of research that combines the strengths and capabilities of electronics and textiles in one: electronic textiles, or e-textiles. E-textiles, also called Smart Fabrics, have not only "wearable" capabilities like any other garment, but also local monitoring and computation, as well as wireless communication capabilities. Sensors and simple computational elements are embedded in e-textiles, as well as built into yarns, with the goal of gathering sensitive information, monitoring vital statistics and sending them remotely (possibly over a wireless channel) for further processing. Possible applications include medical (infant or patient) monitoring, personal information processing systems, or remote monitoring of deployed personnel in military or space applications. We illustrate the challenges imposed by the dual textile/electronics technology on their modeling and optimization methodology.
Categories and Subject Descriptors: I.6 [Simulation and Modeling]: Modeling methodologies; B.8.2 [Performance and reliability]: performance analysis and design aids.
General terms: design, performance


Session 12: PANEL: Analog Intellectual Property: Now? Or Never?

Chair: Stephen Ohr
Organizers: Linda Marchant, Philippe Magarshack
Panelists: Masao Hotta, Mike Brunoli, Felicia James, Rudy Koch, Roy McGuffin, Andrew Moore [p. 181]

There is considerable controversy as to whether or not the trade in analog intellectual properties (IP) will ever represent a viable business opportunity. One school of thought suggests that analog design will always be too specialized to constitute a major market; another school of thought says this will be big business if certain nagging technical problems are solved. While the demand for analog interface components is high, the ability of IP creators to render it in a tradable format is limited. And the ability of digital design teams to successfully utilize analog IPs - without a considerable amount of handholding - is similarly limited. EDA tools here are looked upon as both culpable and offering the best hopes for the future. This panel of experts - representing analog designers, analog EDA tool providers, silicon foundries and analog IP vendors Ö bring their own points of view on some of the business and technology issues which need to be resolved to provide the context for analog IP development and trade. Among the open questions:
Ý Is IP created and validated within the design environment for re-use a more productive approach than imported IP? How many 'traditional' analog designers would admit they are still using kit parts and breadboards - maybe even SPICE and manual IC layout techniques - today in their every-day job?
Ý What CAD tools are needed to help analog designers? Are newly emerging EDA technologies, designed to enhance analog design productivity, maturing rapidly enough to be accepted by designers?
Ý What is the future of analog designs at the very low voltage swings coming with sub-100nm CMOS?
Ý Though silicon foundries need to process a wide set of external IP offerings in order to allow their users to build complete systems, are foundries seeing enough activity in analog IP designs to justify specialized fab runs or the kind of process tuning that would allow analog and digital IPs to coexist on the same chip?
Ý How likely - and how soon - can we get to analog IP development and trade?


Session 13: Low-Power System Design

Chair: Giovanni De Micheli
Organizers: Renu Mehra, Enrico Macii
13.1 Task Scheduling and Voltage Selection for Energy Minimization [p. 183]
Yumin Zhang, Xiaobo (Sharon) Hu, Danny Z. Chen

Categories and Subject Descriptors
I.2.8 [Problem Solving, Control Methods, and Search]: Scheduling
General Terms
Algorithms, Design
Keywords
voltage selection, task scheduling
In this paper, we present a two-phase framework that integrates task assignment, ordering and voltage selection (VS) together to minimize energy consumption of real-time dependent tasks executing on a given number of variable voltage processors. Task assignment and ordering in the first phase strive to maximize the opportunities that can be exploited for lowering voltage levels during the second phase, i.e., voltage selection. In the second phase, we formulate the VS problem as an Integer Programming (IP) problem and solve the IP efficiently. Experimental results demonstrate that our framework is very effective in executing tasks at lower voltage levels under different system configurations.

13.2 Battery-Conscious Task Sequencing for Portable Devices Including Voltage/Clock Scaling [p. 189]
Daler Rakhmatov, Sarma Vrudhula, Chaitali Chakrabarti

Operation of battery-powered portable systems can no longer be sustained once a battery becomes discharged. Maximization of the battery lifetime is a difficult task due to nonlinearity of battery behavior that depends on the characteristics of the system load profile. We address the problem of task sequencing without and with voltage/ clock scaling that shapes the profile so that the battery lifetime is maximized. We developed an accurate analytical battery model and validated it with measurements taken on a real lithium-ion battery used in a pocket computer. We use the model as a basis for a unique battery-conscious cost function and utilize its properties to develop several novel algorithms, including insertion of recovery periods and voltage/clock scaling for delay slack distribution.
Categories and Subject Descriptors
J.6.2 [Computer-Aided Engineering]: Computer-Aided Design
General Terms
Algorithms, Performance, Design
Keywords
Battery, modeling, low-power design, scheduling, voltage scaling

13.3 An Energy Saving Strategy Based on Adaptive Loop Parallelization [p. 195]
I. Kadayif, M. Kandemir, M. Karakoy

In this paper, we evaluate an adaptive loop parallelization strategy (i.e., a strategy that allows each loop nest to execute using different number of processors if doing so is beneficial) and measure the potential energy savings when unused processors during execution of a nested loop in a multi-processor on-a-chip (MPoC) are shut down (i.e., placed into a power-down or sleep state). Our results show that shutting down unused processors can lead to as much as 67% energy savings with up to 17% performance loss in a set of array-intensive applications. We also discuss and evaluate a processor pre-activation strategy based on compile-time analysis of nested loops. Based on our experiments, we conclude that an adaptive loop parallelization strategy combined with idle processor shut-down and pre-activation can be very effective in reducing energy consumption without increasing execution time.
Categories and Subject Descriptors
D.3.4 [Programming Languages]: Processors - Compilers, Optimization
General Terms
Design, Experimentation, Performance
Keywords
Adaptive Parallelization, Multiprocessing, Energy Consumption


Session 14: Fabric-Driven Logic Synthesis

Chair: Maciej Ciesielski
Organizers: Malgorzata Merek-Sadowska, Steven Nowick
14.1 River PLAs: A Regular Circuit Structure [p. 201]
Fan Mo, Robert K. Brayton

A regular circuit structure called a River PLA and its reconfigurable version, Glacier PLA, are presented. River PLAs provide greater regularity than circuits implemented with standard-cells. Conventional optimization stages such as technology mapping, placement and routing are eliminated. These two features make the River PLA a highly predictable structure. Glacier PLAs can be an alternative to FPGAs, but with a simpler and more efficient design methodology.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids Ó Automatic synthesis.
General Terms
Algorithms.
Keywords
Programmable Logic Array, River routing.

14.2 WITHDRAWN

14.3 Layout-Aware Synthesis of Arithmetic Circuits [p. 207]
Junhyung Um, Taewhan Kim

In deep sub-micron (DSM) technology, wires are equally or more important than logic components since wire-related problems such as crosstalk, noise are much critical in system-on-chip (SoC) design. Recently, a method [12] for generating a partial product reduction tree (PPRT) with optimal-timing using bit-level adders to implement arithmetic circuits, which outperforms the current best designs, is proposed. However, in the conventional approaches including [12], interconnects are not primary components to be optimized in the synthesis of arithmetic circuits, mainly due to its high integration complexity or unpredictable wire effects, thereby resulting in unsatisfactory layout results with long and messed wire connections. To overcome the limitation, we propose a new module generation/synthesis algorithm for arithmetic circuits utilizing carry-save-adder (CSA) modules, which not only optimizes the circuit timing but also generates a much regular interconnect topology of the final circuits. Specifically, we propose a two-step algorithm: (Phase 1: CSA module generation) we propose an optimal-timing CSA module generation algorithm for an arithmetic expression under a general CSA timing model; (Phase 2: Bit-level interconnect refinements) we optimally refine the interconnects between the CSA modules while retaining the global CSA-tree structure produced by Phase 1. It is shown that the timing of the circuits produced by our approach is equal or almost close to that by [12] in most testcases (even without including the interconnect delay), and at the same time, the interconnects in layout are significantly short and regular.
Categories and Subject Descriptions
B.2.4. [Arithmetic and Logic Structures]: High-Speed Arithmetic - Algorithms,Cost/Performance
General Terms: Algorithms, Design and Performance
Keywords: Carry-save-adder, layout, high performance


Session 15: Memory Management and Address Optimization in Embedded Systems

Chair: Nikil Dutt
Organizers: Diederik Verkest, Luca Benini
15.1 Automatic Data Migration for Reducing Energy Consumption in Multi-Bank Memory Systems [p. 213]
V. De La Luz, M. Kandemir, I. Kolcu

An architectural solution to reducing memory energy consumption is to adopt a multi-bank memory system instead of a monolithic (single-bank) memory system. Some recent multi-bank memory architectures help reduce memory energy by allowing an unused bank to be placed into a low-power operating mode. This paper describes an automatic data migration strategy which dynamically places the arrays with temporal affinity into the same set of banks. This strategy increases the number of banks which can be put into low-power modes and allows the use of more aggressive energy saving modes. Experiments using several array-dominated applications show the usefulness of data migration and indicate that large energy savings can be achieved with low overhead.
Categories and Subject Descriptors
B.3 [Hardware]: Memory Structures
General Terms
Design, Experimentation, Performance
Keywords
Energy Consumption, Multi-Bank Memories, Data Migration

15.2 Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems [p. 219]
Mahmut Kandemir, J. Ramanujam, A. Choudhary

In this paper, we present a compiler strategy to optimize data accesses in regular array-intensive applications running on embedded multiprocessor environments. Specifically, we propose an optimization algorithm that targets the reduction of extra off-chip memory accesses caused by inter-processor communication. This is achieved by increasing the application-wide reuse of data that resides in the scratch-pad memories of processors. Our experimental results obtained on four array-intensive image processing applications indicate that exploiting inter-processor data sharing can reduce the energy-delay product by as much as 33.8% (and 24.3% on average) on a four-processor embedded system. The results also show that the proposed strategy is robust in the sense that it gives consistently good results over a wide range of several architectural parameters.
Categories and Subject Descriptors B.3 [Hardware] Memory Structures; D.3.4 [Software] Programming Languages: Processors [Compilers]
Terms Algorithms, management, performance.
Keywords Embedded multiprocessors, energy consumption, scratch pad memories, access patterns, compiler optimizations, data tiles.

15.3 Address Assignment Combined with Scheduling in DSP Code Generation [p. 225]
Yoonseo Choi, Taewhan Kim

One of the important issues in embedded system design is to optimize program code for the microprocessor to be stored in ROM. In this paper, we propose an integrated approach to the DSP address code generation problem for minimizing the number of addressing instructions. Unlike previous works in which code scheduling and offset assignment are performed sequentially without any interaction between them, our work tightly couples offset assignment problem with code scheduling to exploit scheduling on minimizing addressing instructions more effectively. We accomplish this by developing a fast but accurate two-phase procedure which, for a sequence of code schedules, finds a sequence of memory layouts with minimum addressing instructions. Experimental results with benchmark DSP programs show improvements of 13%-33% in the address code size over Solve-SOA/GOA [7].
Categories and Subject Descriptors
C.3 [Special-purpose and application-based systems]: [Signal processing systems]
General Terms
Algorithms, Performance
Keywords
Offset assignment, Scheduling, Code Generation


Session 16: Special Session: Optics: Lighting the Way to Eda Riches?

Chair: Jaijeet Roychowdhury
Organizers: Jaijeet Roychowdhury, Joel R. Phillips
16.1 Multifunctional Photonic Integration for the Agile Optical Internet [p. 231]
Edward H. Sargent

An agile, transparent optical network is emerging. This paper enumerates the functions that will be needed at nodes in order to add transparency and agility to the network while robustly assuring optical fibre channel performance. One key requirement will be scalable integration of multiple functions on a platform. The paper presents recent results along one strategic axis for integration š the use of functionalized self-organized photonic crystals and heterostructures thereof to control the flow and features of light.
Categories and Subject Descriptors
C.2.1 [Network Architecture and Design]: Enabling technologies.
General Terms
Experimentation, Theory.
Keywords
Agile optical networks, reconfigurability, optical performance monitoring, optoelectronic integration, photonic crystals, optical nonlinearity, electro-optics, optical polymers, semiconductor nanocrystals.

16.2 Computer Aided Design of Long-Haul Optical Transmission Systems [p. 235]
James G. Maloney, Brian E. Brewington, Curtis R. Menyuk

We present a general overview of the role of computer models in the design and optimization of commercial optical transmission systems. Specifically, we discuss (1) the role of modeling in a commercial setting, (2) achieving the proper balance between accuracy and computation speed, (3) model verification against experiment, and (4) case studies demonstrating the benefits of modeling. Ideally, experiments are preferable to models when describing system performance, particularly to support claims of a system‰s functionality to a customer. However, modeling is often the only choice for many of the problems that a commercial networking company must solve. Because there are design parameter spaces that are either too expensive or too time consuming to verify experimentally, the main role of modeling in industry is to study what experiments cannot. For example, when an analytical solution of a statistical problem is infeasible, a common modeling solution is to perform Monte Carlo trials to study the statistical behavior. Another typical modeling task involves looking at variations of hardware that would be prohibitively expensive to acquire and test. Implementing modeling in industry involves a balance between three needs: cost-efficiency, time-efficiency, and accuracy. We will discuss the approaches we have taken at PhotonEx to meet these needs: leveraging academic research, developing ‹reduced modelsŠ and utilizing computational clusters. Specifically, we will use case studies to illustrate the application of these approaches to modeling long-haul optical transmission systems.
Categories and Subject Descriptors
J.6 [Computer Applications]: Computer-Aided Engineering - computer-aided design (CAD) and computer-aided manufacturing (CAM).
General Terms
Algorithms, Measurement, Performance, Design, Experimentation, Theory, Verification
Keywords
Optical Communication, Long-Haul (LH) Transmission, Ultra-Long Haul (ULH) Transmission, Optical Modeling

16.3 A Fast Optical Propagation Technique for Modeling Micro-Optical Systems [p. 236]
Timothy P. Kurzweg, Steven P. Levitan, Jose A. Martinez, Mark Kahrs, Donald M. Chiarulli

As designers become more aggressive in introducing optical components to micro-systems, rigorous optical models are required for system-level simulation tools. Common optical modeling techniques and approximations are not valid for most optical micro-systems, and those techniques that provide accurate simulation are computationally slow. In this paper, we introduce an angular frequency optical propagation technique that greatly reduces computation time while achieving the accuracy of a full scalar formulation. We present simulations of a diffractive optical MEM Grating Light Valve to show the advantages of this optical propagation method and the integration of the technique into a system-level multi-domain CAD tool.
Categories and Subject Descriptors
I.6.5 [Simulation and Modeling]: Model Development - modeling methodologies
General Terms
Algorithms, Design
Keywords
Optical Propagation, Angular Spectrum, CAD, Optical Micro-systems, Optical MEMS


Session 17: PANEL: Nanometer Design: What Hurts Next...?

Chair: Lawrence T. Pileggi
Organizers: Rob A. Rutenbar, Andrew B. Kahng
Panelists: Bob Brodersen, Anthony Hill, John Kibarian, Desmond A. Kirkpatrick, Mitsumasa Koyanagi, Mark Lavin [p. 242]

Every year, the design and EDA communities are besieged by dire warnings about the impending doom of "design as we know it"Š Every year, another unpleasant physical effect from the evil depths of deep submicron physics surfaces, compromising our designs in new and vile ways. Every year, the same story: more nanometer woes. Rather than endorse a new winner in this year's race for the "next worst thing" from the nanometer arena, this panel gathers a set of world-class technology experts to debate what effects are hiding just around the next corner, waiting to pounce on the unwary tool or chip designer. Which among these is really the most important, when will it happen, and why?


Session 18: Novel DFT, BIST and Diagnosis Techniques

Chair: Rathish Jayabharathi
Organizers: Kwang-Ting (Tim) Cheng, T. M. Mak
18.1 Low-Cost Sequential ATPG with Clock-Control DFT [p. 243]
Miron Abramovici, Xiaoming Yu, Elizabeth M. Rudnick

We present a new clock-control DFT technique for sequential circuits, based on clock partitioning and selective clock freezing, and we use it to break the global feedback loops and to generate clock waves to test the resulting sequential circuit with self-loops. Clock waves allow us to significantly reduce the complexity of sequential ATPG. Unlike scan, our non-intrusive DFT technique does not introduce any delay penalty; the generated tests may be applied at speed, have shorter application time, and dissipate less power.
Categories and Subject Descriptors: B.8.1 [Performance and Reliability]: Reliability, Testing, and Fault-Tolerance
General Terms: Algorithms, Design, Reliability

18.2 Effective Diagnostics through Interval Unloads in a BIST Environment [p. 249]
Peter Wohl, John A. Waicukauski, Sanjay Patel, Greg Maston

Logic built-in self test (BIST) is increasingly being adopted to improve test quality and reduce test costs for rapidly growing designs. Compared to deterministic automated test pattern generation (ATPG), BIST presents inherent fault diagnostic challenges. Previous diagnostic techniques have been limited in their diagnosis resolution and/or require significant hardware overhead. This paper proposes an interval-based scan-unload method that ensures diagnosis resolution down to gate-level faults with minimal hardware overhead. Tester fail-data collection is based on a novel construct incorporated into the design-extensions of the standard test-interface language (STIL). The implementation of the proposed method is presented and analyzed.
Categories and Subject Descriptors: B.8.1 [Performance and Reliability]: Reliability, Testing and Fault-Tolerance.
General Terms: Algorithms, Design.
Keywords: built-in self-test (BIST), fault diagnosis.

18.3 On Output Response Compression in the Presence of Unknown Output Values [p. 255]
Irith Pomeranz, Sandip Kundu, Sudhakar M. Reddy

A circuit may produce unknown output values during simulation of an input sequence due to an unknown initial state or due to the existence of tri-state elements. For circuits tested using BIST, unknown output values make it impossible to determine a single unique signature for the fault free circuit. To accommodate unknown output values in a BIST scheme, we describe a procedure for synthesizing a minimal logic block that replaces unknown output values by a known constant. The proposed procedure ensures that the BIST scheme will be able to detect all the faults detectable by the input sequence applied to the circuit while allowing a single unique signature to be obtained.

18.4 Software-Based Diagnosis for Processors [p. 259]
Li Chen, Sujit Dey

Software-based self-test (SBST) is emerging as a promising technology for enabling at-speed test of high-speed microprocessors using low-cost testers. We explore the fault diagnosis capability of SBST, in which functional information can be used to guide and facilitate the generation of diagnostic tests. By using a large number of carefully constructed diagnostic test programs, the fault universe can be divided into fine-grained partitions, each corresponding to a unique pass/fail pattern. We evaluate the quality of diagnosis by constructing diagnostic-tree-based fault dictionaries. We demonstrate the feasibility of the proposed method by applying it to a processor example. Experimental results show its potential as an effective method for diagnosing larger processors.
Categories and Subject Descriptors
B.8.1 [Performance and Reliability]: Reliability, Testing, and Fault-Tolerance.
General Terms
Algorithms, Measurement, Reliability, Experimentation.
Keywords
Microprocessor, self-test, instruction, diagnostics.


Session 19: Case Studies in Embedded System Design

Chair: Wayne Wolf
Organizers: Anand Raghunathan, Xiabo (Sharon) Hu
19.1 Design of a High-Throughput Low-Power IS95 Viterbi Decoder [p. 263]
Xun Liu, Marios C. Papaefthymiou

The design of high-throughput large-state Viterbi decoders relies on the use of multiple arithmetic units. The global communication channels among these parallel processors often consist of long interconnect wires, resulting in large area and high power consumption. In this paper, we propose a data-transfer oriented design methodology to implement a low-power 256-state rate-1/3 IS95 Viterbi decoder. Our architectural level scheme uses operation partitioning, packing, and scheduling to analyze and optimize interconnect effects in early design stages. In comparison with other published Viterbi decoders, our approach reduces the global data transfers by up to 75% and decreases the amount of global buses by up to 48%, while enabling the use of deeply pipelined datapaths with no data forwarding. In the RTL implementation of the individual processors, we apply precomputation in conjunction with saturation arithmetic to further reduce power dissipation with provably no coding performance degradation. Designed using a 0.25 µm standard cell library, our decoder achieves a throughput of 20 Mbps in simulation and dissipates only 450 mW.
Categories and Subject Descriptors
B.7.1 [Integrated Circuits]: Types and Design StylesÖAlgorithms implemented in hardware
General Terms
Design, Performance
Keywords
Communications, Pipelining, Bus reduction

19.2 A Detailed Cost Model for Concurrent Use With Hardware/Software Co-Design [p. 269]
Daniel Ragan, Peter Sandborn, Paul Stoaks

Hardware/software co-design methodologies generally focus on the prediction of system performance or co-verification of system functionality. This study extends this conventional focus through the development of a methodology and software tool that evaluates system (hardware and software) development, fabrication, and testing costs (dollar costs) concurrent with hardware/software partitioning in a co-design environment. Based on the determination of key metrics such as gate count and lines of software, a new tool called Ghost, evaluates software and hardware development, fabrication, packaging and testing costs. Ghost enables optimization of hardware/software partitioning as a function of specific combinations of hardware foundries and software development environments.
Categories and Subject Descriptors
E3 [HW/SW co-design]: specification, model., co-simulation and performance analysis, system-level scheduling and partitioning.
General Terms Design, Economics.
Keywords Cost Modeling, Cost-Performance Trade-off.

19.3 Efficient Code Synthesis from Extended Dataflow Graphs for Multimedia Applications [p. 275]
Hyunok Oh, Soonhoi Ha

This paper presents efficient automatic code synthesis techniques from dataflow graphs for multimedia applications. Since multimedia applications require large size buffers containing composite type data, we aim to reduce the buffer sizes with fractional rate dataflow extension and buffer sharing technique. In an H.263 encoder experiment, the FRDF extension and buffer sharing technique enable us to reduce the buffer size by 67%. The final buffer size is no more than in a manual reference code.
Keywords
memory optimization, software synthesis, multimedia, dataflow


Session 20: Theoretical Foundations of Embedded System Design

Chair: Rajesh Gupta
Organizers: Annette Reutter, Donatella Sciuto
20.1 Transformation Based Communication and Clock Domain Refinement for System Design [p. 281]
Ingo Sander, Axel Jantsch

The ForSyDe methodology has been developed for system level design. In this paper we present formal transformation methods for the refinement of an abstract and formal system model into an implementation model. The methodology defines two classes of design transformations: (1) semantic-preserving transformations and (2) design decisions. In particular we present and illustrate communication and clock domain refinement by way of a digital equalizer system.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits]: Design-Aids; J.6 [Computer-Aided Engineering]: Computer-Aided Design (CAD)
General Terms
Design, Theory
Keywords
System Design, System Modeling, Design Refinement

20.2 Model Composition for Scheduling Analysis in Platform Design [p. 287]
Kai Richter, Dirk Ziegenbein, Marek Jersak, Rolf Ernst

Categories and Subject Descriptors
C.0 [General]: System Architecture; C.3 [Computer Systems Organization]: Special-Purpose and Application-Based Systems - realtime and embedded systems; C.4 [Computer Systems Organization]: Performance of Systems
General Terms
Algorithms, Performance, Verification
Keywords
Platform-Based Design, Performance Analysis, Scheduling, Formal Analysis

20.3 Timed Compiled-Code Simulation of Embedded Software for Performance Analysis of SOC Design [p. 293]
Jong-Yeol Lee, In-Cheol Park

In this paper, a new timing generation method is proposed for the performance analysis of embedded software. The time stamp generation of I/O accesses is crucial to performance estimation and architecture exploration in the timed functional simulation, which simulates the whole design at a functional level with timing. A portable compiler is modified to generate time-deltas, which are the estimated cycle counts between two adjacent I/O accesses, by counting the cycles of the intermediate representation (IR) operations and using a machine description that contains information on a target processor. Since the proposed method is based on the machine-independent IR of a compiler, the method can be applied to various processors by changing the machine description. The experimental results show that the proposed method is effective in that the average estimation error is about 2% and the maximum speed-up over the corresponding instruction-set simulators is about 300 times. The proposed method is also verified in a timed functional simulation environment.


Session 21: Equivalence Verification

Chair: Ziyad Hanna
Organizer: Shin-ichi Minato
21.1 Automated Equivalence Checking of Switch Level Circuits [p. 299]
Simon Jolly, Atanas Parashkevov, Tim McDougall

A chip that is required to meet strict operating criteria in terms of speed, power, or area is commonly custom designed at the switch level. Traditional techniques for verifying these designs, based on simulation, are expensive in terms of resources and cannot completely guarantee correct operation. Formal verification methods, on the other hand, provide for a complete proof of correctness, and require less effort to setup. This paper presents Motorola‰s Switch Level Verification (SLV) tool, which employs detailed switch level analysis to model the behavior of MOS transistors and obtain an equivalent RTL model. This tool has been used for equivalence checking at the switch level for several years within Motorola for the PowerPC, M*Core and DSP custom blocks. We focus on the novel techniques employed in SLV, particularly in the areas of pre-charged and sequential logic analysis, and provide details on the automated and integrated equivalence checking flow in which the tool is used.
Categories and Subject Descriptors
J.6 [Computer-Aided Engineering]: Computer-Aided Design.
General Terms
Algorithms, Design, Verification.
Keywords
Custom design, switch level analysis, equivalence checking, formal verification, MOS circuits, VLSI design.

21.2 A Practical and Efficient Method for Compare-Point Matching [p. 305]
Demos Anastasakis, Robert Damiano, Hi-Keung Tony Ma, Ted Stanion

An important step in using combinational equivalence checkers to verify sequential designs is identifying and matching corresponding compare-points in the two sequential designs to be verified. Both non-function and function-based matching methods are usually employed in commercial verification tools. In this paper, we describe a heuristic algorithm using ATPG for matching compare-points based on the functionality of the combinational blocks in the sequential designs. Results on industrial-sized circuits show our methods are both practical and efficient. Categories and Subject Descriptors
J.6 [Computer-aided engineering]: Verification, compare-point matching.
General Terms
Algorithms, Experimentation, Verification.
Keywords
Combinational verification, equivalence checking, latch mapping.

21.3 Self-referential Verification of Gate-level Implementations of Arithmetic Circuits [p. 311]
Ying-Tsai Chang, Kwang-Ting (Tim) Cheng

Verification of gate-level implementations of arithmetic circuits is challenging due to a number of reasons: the existence of some hard-to-verify arithmetic operators (e.g. multiplication), the use of different operand ordering, the incorporation of merged arithmetic with cross-operator implementations, and the employment of circuit transformations based on arithmetic relations. It is hence a peculiar problem that does not fit quite well into the existing RTL-to-gate equivalence checking methodology. In this paper, we propose a self-referential functional verification approach which uses the gate-level implementation of the arithmetic circuit under verification to verify itself. Specifically, the verification task is decomposed into a sequence of equivalence checking subproblems, each of which compare circuit pairs derived from the implementation under verification based on the proposed self-referential functional equations. A decomposition-based heuristic using structural information is employed to guide the verification process for better efficiency. Experimental results on a number of implementations of the multiply-add units and the inner product units with different architectures demonstrate the versatility of this approach.
Categories and Subject Descriptors
B.5.2 [Register-Transfer-Level Implementation]: Design AidsÖ verification
General Terms
Algorithm, Verification
Keywords
Arithmetic circuit verification


Session 22: PANEL: Whither (or Wither?) ASIC Handoff?

Chair: Michael Santarini
Organizers: Sudhakar Jilla, Mark Miller
Panelists: Tommy Eng, Sandeep Khanna, Kamalesh Ruparel, Tom Russell, Kazu Yamada [p. 317]

The traditional ASIC netlist handoff is changing - but to what? Is RTL handoff finally a reality? Or, will a placement-based handoff model emerge? Are differences among underlying tool technologies and methodologies only cosmetic? Or, are there fundamental business and IP distinctions? These and other questions will be discussed as the panel examines the future of the designer - ASIC vendor - EDA vendor relationship.


Session 23: Embedded Software Automation: From Specification to Binary

Chair: Joerg Henkel
Organizers: Marco Di Natale, Xiaobo (Sharon) Hu
23.1 Software Synthesis from Synchronous Specifications Using Logic Simulation Techniques [p. 319]
Yunjian Jiang, Robert K. Brayton

This paper addresses the problem of automatic generation of implementation software from high-level functional specifications in the context of embedded system on chip designs. Software design complexity for embedded systems has increased so much that a high-level functional programming paradigm need to be adopted for formal verifiability, maintainability and short time-to-market. We propose a framework for efficiently generating implementation software from a synchronous state machine specification for embedded control systems. The framework is generic enough to allow hardware/software partition for a given architecture platform. It is demonstrated that the logic optimization and simulation techniques can be combined to produce fast execution code for such embedded systems. Specifically, we propose a framework for software synthesis from multi-valued logic, including fast evaluation of logic functions, and scheduling techniques for node execution. Experiments are performed to show the initial results of our algorithms in this framework.

23.2 Complex Library Mapping for Embedded Software Using Symbolic Algebra [p. 325]
Armita Peymandoust, Tajana Simunic, Giovanni De Micheli

Embedded software designers often use libraries that have been pre-optimized for a given processor to achieve higher code quality. However, using such libraries in legacy code optimization is nontrivial and typically requires manual intervention. This paper presents a methodology that maps algorithmic constructs of the software specification to a library of complex software elements. This library-mapping step is automated by using symbolic algebra techniques. We illustrate the advantages of our methodology by optimizing an algorithmic level description of MPEG Layer III (MP3) audio decoder for the Badge4 [2] portable embedded system. During the optimization process we use commercially available libraries with complex elements ranging from simple mathematical functions such as exp to the IDCT routine. We implemented and measured the performance and energy consumption of the MP3 decoder software on Badge4 running embedded Linux operating system. The optimized MP3 audio decoder runs 300 times faster than the original code obtained from the standards body while consuming 400 times less energy. Since our optimized MP3 decoder runs 3.5 times faster than real-time, additional energy can be saved by using processor frequency and voltage scaling.
Categories and Subject Descriptors
C.3 [Special-Purpose and Application-Based Systems]: Microprocessor/microcomputer applications, Real-time and embedded systems, Signal processing systems.
General Terms
Algorithms, Performance, Design, Experimentation, Theory.
Keywords
Embedded software optimization, Automated library mapping, Symbolic algebra, Polynomial representation, Computation intensive software.

23.3 Retargetable Binary Utilities [p. 331]
Maghsoud Abbaspour, Jianwen Zhu

Since software is playing an increasingly important role in system-on-chip, retargetable compilation has been an active research area in the last few years. However, the retargetting of equally important downstream system tools, such as assemblers, linkers and debuggers, has either been ignored, or falls short of meeting the requirements of modern programming languages and operating systems. In this paper, we present techniques that can automatically retarget the GNU binutils tool kit, which contains a large array of production-quality downstream tools. Other than having all the advantages enjoyed by open-source software by aligning to a de facto standard, our techniques are systematic, as a result of using a formal model of instruction set architecture (ISA) and application binary interface (ABI); and simple, as a result of leveraging free software to the largest extent.
Categories and Subject Descriptors
D.3.4 [Processors]: Retargetable compilers
General Terms
Design, Languages


Session 24: Applications of Reconfigurable Computing

Chair: Ivo Bolsens
Organizers: Grant E. Martin, Kurt Keutzer
24.1 Exploiting Operation Level Parallelism through Dynamically Reconfigurable Datapaths [p. 337]
Zhining Huang, Sharad Malik

Increasing non-recurring engineering (NRE) and mask costs are making it harder to turn to hardwired Application Specific Integrated Circuit (ASIC) solutions for high performance applications [12]. The volume required to amortize these high costs has been increasing, making it increasingly expensive to afford ASIC solutions for medium volume products. This has led to designers seeking programmable solutions of varying sorts using these so-called programmable platforms. These programmable platforms span a large range from bit-level programmable Field Programmable Gate Arrays (FPGAs), to word-level programmable application-specific, and in some cases even general-purpose processors. The programmability comes with a power and performance overhead. Attempts to reduce this overhead typically involve making some core hardwired ASIC like logic blocks accessible to the programmable elements. This paper presents one such hybrid solution in this space š a relatively simple processor with a dynamically reconfigurable datapath acting as an accelerating co-processor. This datapath consists of hardwired function units and reconfigurable interconnect. We present a methodology for the design of these solutions and illustrate it with two complete case studies: an MPEG 2 coder, and a GSM coder, to show how significant speedups can be obtained using relatively little hardware. The co-processor can be viewed as a VLIW processor with a single instruction per kernel loop. We compare the efficiency of exploiting the operation level parallelism using classic VLIW processors and this proposed class of dynamically configurable co-processors. This work is part of the MESCAL project, which is geared towards developing design environments for the development of application specific platforms.
Categories and Subject Descriptors
J.6 [COMPUTER-AIDED ENGINEERING]: Computer-aided design (CAD)
General Terms
Design, Performance

24.2 Dynamic Hardware Plugins in an FPGA with Partial Run-time Reconfiguration [p. 343]
Edson L. Horta, John W. Lockwood, David E. Taylor, David Parlour

Tools and a design methodology have been developed to support partial run-time reconfiguration of FPGA logic on the Field Programmable Port Extender. High-speed Internet packet processing circuits on this platform are implemented as Dynamic Hardware Plugin (DHP) modules that fit within a specific region of an FPGA device. The PARBIT tool has been developed to transform and restructure bitfiles created by standard computer aided design tools into partial bitsteams that program DHPs. The methodology allows the platform to hot-swap application-specific DHP modules without disturbing the operation of the rest of the system.
Keywords
FPGA, partial RTR, reconfiguration, hardware, modularity, network, routing, packet, Internet, IP, platform computing
Categories and Subject Descriptors
B.7.2 [Hardware]: Circuits|Design Aids; B.7.1 [Hardware]: Circuits|VLSI ; B.4.3 [Hardware]: Input/Output and Data Communications|Interconnections (Subsystems); C.2.1 [Computer Systems Organization]: Computer-Communication Networks|Network Architecture and Design
General Terms
Design, Experimentation

24.3 A Reconfigurable FPGA-Based Readback Signal Generator For Hard-Drive Read Channel Simulator [p. 349]
Jinghuan Chen, Jaekyun Moon, Kia Bazargan

A hard disk readback signal generator designed to provide noise-corrupted signals to a channel simulator has been implemented on a Xilinx VirtexTME FPGA device. The generator simulates pulses sensed by read heads in hard drives. All major distortion and noise processes, such as intersymbol interference, transition noise, electronics noise, head and media nonlinearity, intertrack interference, and write timing error, can be generated according to the statistics and parameters defined by the user. Reconfigurable implementation enables an update of the signal characteristics in runtime. The user also has the flexibility to choose from a set of bitstreams to simulate particular combinations of noise and distortion. Such customized restructuring helps reduce the area consumption and hence virtually increase the capacity of the FPGA device. The time to generate the readback signals has been reduced by four orders compared to its software counterpart.


Session 25: New Test Methods Targeting Non-Classical Faults

Chair: Rob Aitken
Organizers: Miron Abramovici, T. M. Mak
25.1 Embedded Software-Based Self-Testing for SoC Design [p. 355]
A. Krstic, W.-C. Lai, L. Chen, K.-T. Cheng, S. Dey

At-speed testing of high-speed circuits is becoming increasingly difficult with external testers due to the growing gap between design and tester performance, growing cost of high-performance testers and increasing yield loss caused by inherent tester inaccuracy. Therefore, empowering the chip to test itself seems like a natural solution. Hardware-based self-testing techniques have limitations due to performance and area overhead and problems caused by the application of non-functional patterns. Embedded software-based self-testing has recently become focus of intense research. In this methodology, the programmable cores are used for on-chip test generation, measurement, response analysis and even diagnosis. After the programmable core on a System-onšChip (SoC) has been self-tested, it can be reused for testing on-chip buses, interfaces and other non-programmable cores. The advantages of this methodology include at-speed testing, low design-for-testability overhead and application of functional patterns in the functional environment. In this paper, we give a survey and outline the roadmap and challenges of this emerging embedded software-based self-testing paradigm.
Categories and Subject Descriptors
B.8.1 [Integrated Circuits]: Performance and Reliability š reliability, testing and fault-tolerance.
General Terms
Algorithms, Performance, Reliability.
Keywords
VLSI test, SoC test, functional test, microprocessor test.

25.2 A Novel Wavelet Transform Based Transient Current Analysis for Fault Detection and Localization [p. 361]
Swarup Bhunia, Kaushik Roy, Jaume Segura

Transient current (IDD) based testing has been often cited and investigated as an alternative and/or supplement to quiescent current (IDDQ) testing. While the potential of IDD testing for fault detection has been established, there is no known efficient method for fault diagnosis using IDD analysis. In this paper, we present a novel integrated method for fault detection and localization using wavelet transform based IDD waveform analysis. The time-frequency resolution property of wavelet transform helps us detect as well as localize faults in digital CMOS circuits. Experiments performed on measured data from a fabricated 8-bit shift register and simulation data from more complex circuits show promising results for both detection and localization. Wavelet based detection method shows superior sensitivity than spectral and time-domain methods. The effectiveness of the localization method in presence of process variation, measurement noise and complex power supply network is addressed.
Categories and Subject Descriptors
B.8.2 [Hardware]: Performance and Reliability| Reliability, Testing, and Fault-Tolerance
General Terms
Algorithms, Reliability, Experimentation
Keywords
Transient current (IDD), wavelet transform, fault localization

25.3 Signal Integrity Fault Analysis Using Reduced-Order Modeling [p. 367]
Amir Attarha, Mehrdad Nourani

This paper aims at analysis of signal integrity for the purpose of testing high speed interconnects. This requires taking into account the effect of inputs as well as parasitic RLC elements of the interconnect. To improve the analysis/simulation time in integrity fault testing, we use reduced-order modeling that essentially performs the analysis in the frequency domain. To demonstrate the generality and usefulness of our method, we also discuss its application for test pattern generation targeting signal integrity loss.

25. 4 Enhancing Test Efficiency for Delay Fault Testing Using Multiple-Clock Schemes [p. 371]
Jing-Jia Liou, Li-C. Wang, Kwang-Ting Cheng, Jennifer Dworak, M. Ray Mercer, Rohit Kapur, Thomas W. Williams

In conventional delay testing, the test clock is a single pre-defined parameter that is often set to be the same as the system clock. This paper discusses the potential of enhancing test efficiency by using multiple clock frequencies. The intuition behind our work is that for a given set of AC delay patterns, a carefully-selected, tighter clock would result in higher effectiveness to screen out the potential defective chips. Then, by using a smarter test clock scheme and combining with a second set of AC delay patterns, the overall quality of AC delay test can be enhanced while the cost of including the second pattern set can be minimized. We demonstrate these concepts through analysis and experiments using a statistical timing analysis framework with defect-injected simulation.
Categories and Subject Descriptors
B.8.1 [Hardware]: Reliability, Testing, and Fault-Tolerance
General Terms
Experimentation, Measurement, Reliability
Keywords
Delay Testing, Statistical Timing Analysis, Transition Fault Model


Session 26: Special Session: How Do You Design a 10M Gate ASIC?

Chair: Ahmed A. Jerraya
Organizers: Ahmed A. Jerraya, Kurt Keutzer
26.1 Going Mobile: The Next Horizon for Multi-million Gate Designs in the Semi-Conductor Industry [p. 375]
Christian Berthet

The complexity of a System-on-Chip design is not only in the million transistors packed in a square millimeter. The major challenge for technical success of a SoC is to make sure that millions lines of software fit in with millions gates. In this paper, the problematic of multi-million gate design is illustrated from the viewpoint of a practical development of a complex digital system done at STMicroelectronics for a GSM/GPRS cellular application.
Categories and Subject Descriptors
C.3 [Computer Systems Organisation]: Special-purpose and application-based systems š real-time and embedded systems.
General Terms
Design.
Keywords
SoC Design, HW/SW co-design.


Session 27: Power Distribution Issues

Chair: Sachin Sapatnekar
Organizers: Abhijit Dharchoudhury, Tadahiro Kuroda
27.1 HiPRIME: Hierarchical and Passivity Reserved Interconnect Macromodeling Engine for RLKC Power Delivery [p. 379]
Yahong Cao, Yu-Min Lee, Tsung-Hao Chen, Charlie Chung-Ping Chen

This paper proposes a general hierarchical analysis methodology, HiPRIME, to efficiently analyze RLKC power delivery systems. After partitioning the circuits into blocks, we develop and apply the IEKS (Improved Extended Krylov Subspace) method to build the Multi-port Norton Equivalent circuits which transform all the internal sources to Norton current sources at ports. Since there is no active elements inside the Norton circuits, passive or realizable model order reduction techniques such as PRIMA can be applied. To further reduce the top-level hierarchy runtime, we develop a second-level model reduction algorithm and prove its passivity. Experimental results show 400-700X runtime improvement with less than 0.2% error.

27.2 High-Level Current Macro-Model For Power-Grid Analysis [p. 385]
Srinivas Bodapati, Farid N. Najm

We present a frequency domain current macro-modeling technique for capturing the dependence of the block current waveform on its input vectors. The macro-model is based on estimating the Discrete Cosine Transform (DCT) of the current waveform as a function of input vector pair and then taking the inverse transform to estimate the time domain current waveform. The input vector pairs are partitioned according to Hamming distance and a current macro-model is built for each Hamming distance using regression. Regression is done on a set of current waveforms generated for each circuit, using HSPICE. The average relative error in peak current estimation using the current macro-model is less than 20%.
Categories and Subject Descriptors
B.7 [Hardware]: Integrated CircuitsÖCAD; B.7.2 [Integrated Circuits]: Design AidsÖModeling
General Terms
Algorithms
Keywords
Power grid, Current macro-model, DCT

27.3 Macro-Modeling Concepts For The Chip Electrical Interface [p. 391]
Brian W. Amick, Claude R. Gauthier, Dean Liu

The power delivery network is made up of passive elements in the distribution network, as well as the active transistor loads. A chip typically has three types of power supplies that require attention: core, I/O, and analog. Core circuits consist of digital circuits and have the largest current demand. In addition to all of the system issues/models for the core, modeling the I/O subsystem has the additional requirement of modeling return paths and discontinuities. The analog circuits present yet a different challenge to the macromodeling of the supply network because they place a tight demand on supply variations. This paper presents a design methodology on how to generate macro-models of the entire chip electrical interface. This methodology can be used by the chip, package, and system designers and is being used to design high-reliability servers.
Categories and Subject Descriptors
C.5.3 [Computer System Implementation]: VLSI Systems.
General Terms
Performance, Design, Reliability.
Keywords
VLSI Power Distribution, Inductance, High Speed Microprocessor Design, Analog and I/O Power Delivery.

27.4 Modeling and Analysis of Regular Symmetrically Structured Power/Ground Distribution Networks [p. 395]
Hui Zheng, Lawrence T. Pileggi

In this paper we propose a novel and efficient methodology for modeling and analysis of regular symmetrically-structured power/ ground distribution networks. The modeling of inductive effects is simplified by a folding technique which exploits the symmetry in the power/ground distribution. Furthermore, employment of susceptance [10,11] (inverse of inductance) models enables further simplification of the analysis, and is also shown to preserve the symmetric positive definiteness of the circuit equations. Experimental results demonstrate that our approach can provide up to 8x memory savings and up to10x speedup over the already efficient simulation based on the original sparse susceptance matrix without loss of accuracy. Importantly, this work demonstrates that by employing limited regularity, one can create excellent power/ground distribution designs that are dramatically simpler to analyze, and therefore amenable to more powerful global design optimization. Categories and Subject Descriptors
B.7.2 [Integrated Circuits] Design Aids - verification
General Terms
Design, Verification
Keywords
Power/Ground Distribution, Susceptance, Folding Technique, Design Regularity

27. 5 Clock Tree Optimization in Synchronous CMOS Digital Circuits for Substrate Noise Reduction Using Folding of Supply Current Transients [p. 399]
Mustafa Badaroglu, Kris Tiri, StÚphane Donnay, Piet Wambacq, Ingrid Verbauwhede, Georges Gielen, Hugo De Man

In a synchronous clock distribution network with zero latencies, digital circuits switch simultaneously on the clock edge, therefore they generate substrate noise due to the sharp peaks on the supply current. We present a novel methodology optimizing the clock tree for less substrate generation by using statistical single cycle supply current profiles computed for every clock region taking the timing constraints into account. Our methodology is novel as it uses an error-driven compressed data set during the optimization over a number of clock regions specified for a significant reduction in substrate noise. It also produces a quality analysis of the computed latencies as a function of the clock skew. The experimental results show >x2 reduction of substrate noise generation from the circuits having four clock regions of which the latencies are optimized.
Categories and Subject Descriptors
B.5.1 [Register-Transfer-Level Implementation]: Design - datapath design. B.6.1 [Logic Design]: Design Styles - sequential circuits. B.6.3 [Logic Design]: Design Aids - optimization, simulation. B.7.1 [Integrated Circuits]: Types and Design Styles- VLSI. B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids.
General Terms
Algorithms, Design, Performance, Reliability.
Keywords
Substrate noise, di/dt noise, low-noise digital design, clock distribution networks, supply current shaping and optimization.


Session 28: Advances in Synthesis

Chair: Marek Perkowski
Organizers: Soha M. Hassoun, Yusuke Matsunaga
28.1 Resynthesis and Peephole Transformations for the Optimization of Large-Scale Asynchronous Systems [p. 405]
Tiberiu Chelcea, Steven M. Nowick

Several approaches have been proposed for the syntax-directed compilation of asynchronous circuits from high-level specification languages, such as Balsa and Tangram. Both compilers have been successfully used in large real-world applications; however, in practice, these methods suffer from significant performance overheads due to their reliance on straightforward syntax-directed translation. This paper introduces a powerful new set of transformations, and an extended channel-based language to support them, which can be used an optimizing back-end for Balsa. The transforms described in this paper fall into two categories: resynthesis and peephole. The proposed optimization techniques have been fully integrated into a comprehensive asynchronous CAD package, Balsa. Experimental results on several substantial design examples indicate significant performance improvements.

28.2 Design of Asynchronous Circuits by Synchronous CAD Tools [p. 411]
Alex Kondratyev, Kelvin Lwin

The roadblock to wide acceptance of asynchronous methodology is poor CAD support. Current asynchronous design tools require a significant re-education of designers, and their features are far behind synchronous commercial tools. This paper considers a particular subclass of asynchronous circuits (Null Convention Logic or NCL) and suggests a design flow that is based entirely on commercial CAD tools. This new design flow shows a significant area improvement over known flows based on NCL.

28.3 Implementing Asynchronous Circuits using a Conventional EDA Tool-Flow [p. 415]
Christos P. Sotiriou

This paper presents an approach by which asynchronous circuits can be realised with a conventional EDA tool flow and conventional standard cell libraries. Based on a gate-level asynchronous circuit implementation technique, direct-mapping, and by identifying the delay constraints and exploiting certain EDA tool features, this paper demonstrates that a conventional EDA tool flow can be used to describe, place, route and timing-verify asynchronous circuits.
Categories and Subject Descriptors
B7.1. [Integrated Circuits]: Types and Design Styles
General Terms
Design, Experimentation, Standardization
Keywords
Asynchronous, EDA, Tool-Flow

28.4 Transformation Rules for Designing CNOT-based Quantum Circuits [p. 419]
Kazuo Iwama, Yahiko Kambayashi, Shigeru Yamashita

This paper gives a simple but nontrivial set of local transformation rules for Control-NOT(CNOT)-based combinatorial circuits. It is shown that this rule set is complete, namely, for any two equivalent circuits, S1 and S2, there is a sequence of transformations, each of them in the rule set, which changes S1 to S2. Our motivation is to use this rule set for developing a design theory for quantum circuits whose Boolean logic parts should be implemented by CNOT-based circuits. As a preliminary example, we give a design procedure based on our transformation rules which reduces the cost of CNOT-based circuits.
Categories and Subject Descriptors
B.6.m [LOGIC DESIGN]: Miscellaneous
General Terms
Design, Theory
Keywords
Quantum Circuit, CNOT Gate, Local Transformation Rules

28.5 Fast Three-Level Logic Minimization Based on Autosymmetry [p. 425]
Anna Bernasconi, Valentina Ciriani, Fabrizio Luccio, Linda Pagli

Sum of Pseudoproducts (SPP) is a three level logic synthesis technique developed in recent years. In this framework we exploit the "regularity" of Boolean functions to decrease minimization time. Our main results are: 1) the regularity of Boolean function f of n variables is expressed by its autosymmetry degree k (which 0 <= k <= n), where k = 0 means no regularity (that is, we are not able to provide any advantage over standard synthesis); 2) for k >= 1 the function is autosymmetric, and a new function fk is identified in polynomial time: fk is "equivalent" to, but smaller than f, and depends on n - k variables only; 3) given a minimal SPP form for fk, a minimal SPP form for f is built in linear time; 4) experimental results show that 61% of the functions in the classical ESPRESSO benchmark suite are autosymmetric, and the SPP minimization time for them is critically reduced; we can also solve cases otherwise practically intractable. We finally discuss the role and meaning of autosymmetry.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids - Automatic Synthesis, Optimization. General Terms
Algorithms, Design, Theory.
Keywords:
Three-Level Logic, Synthesis, Autosymmetry.


Session 29: Analog Synthesis & Design Methodology

Chair: C.-J. Richard Shi
Organizers: Joel R. Phillips, Kartikeya Mayaram
29.1 An Efficient Optimization-based Technique to Generate Posynomial Performance Models for Analog Integrated Circuits [p. 431]
Walter Daems, George Gielen, Willy Sansen

This paper presents an new direct-fitting method to generate posynomial response surface models with arbitrary constant exponents for linear and nonlinear performance parameters of analog integrated circuits. Posynomial models enable the use of efficient geometric programming techniques for circuit sizing and optimization. The automatic generation avoids the time-consuming nature and inaccuracies of handcrafted analytic model generation. The technique is based on the fitting of posynomial model templates to numerical data from SPICE simulations. Attention is paid to estimating the relative "goodness-of-fit" of the generated models. Experimental results illustrate the significantly better accuracy of the new approach.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits]: Design Aids; B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids; I.6.5 [Simulation and Modeling]: Model Development
General Terms
Performance, Design, Algorithms
Keywords
Performance Modeling for Analog Circuits, Posynomial Response Surface Modeling, Geometric Programming

29.2 Remembrance of Circuits Past: Macromodeling by Data Mining in Large Analog Design Spaces [p. 437]
Hongzhou Liu, Amit Singhee, Rob A. Rutenbar, L. Richard Carley

The introduction of simulation-based analog synthesis tools creates a new challenge for analog modeling. These tools routinely visit 103 to 105 fully simulated circuit solution candidates. What might we do with all this circuit data? We show how to adapt recent ideas from large-scale data mining to build models that capture significant regions of this visited performance space, parameterized by variables manipulated by synthesis, trained by the data points visited during synthesis. Experimental results show that we can automatically build useful nonlinear regression models for large analog design spaces.
CATEGORIES AND SUBJECT DESCRIPTORS
B.7.2 [Integrated Circuits]: Design aids - verification
GENERAL TERMS
Algorithms

29.3 Optimal Design of Delta-Sigma ADCs by Design Space Exploration [p. 443]
Ovidiu Bajdechi, Georges Gielen, Johan H. Huijsing

An algorithm for architecture-level exploration of delta-sigma ADC design space is presented. The algorithm finds an optimal solution by exhaustively exploring both single-loop and cascaded architectures, with single-bit or multi-bit quantizer, for a range of oversampling ratios. A fast filter-level step evaluates the performance of all loop-filter topologies and passes the accepted solutions to the architecture-level optimization step which maps the filters on feasible architectures and evaluates their performance. The power consumption of each accepted architecture is estimated and the best top-ten solutions in terms of the ratio of peak SNDR versus power consumption are further optimized for yield. Experimental results for two different design targets are presented. They show that previously published solutions are among the best architectures for a given target but that better solutions can be designed.
Categories and Subject Descriptors
J.6 [Computer Applications]: Computer-Aided Engineering
General Terms
Design
Keywords
ADC, CAD, delta-sigma

29.4 Systematic Design of a 200 MS/s 8-bit Interpolating/Averaging A/D Converter [p. 449]
J. Vandenbussche, K. Uyttenhove, E. Lauwers, M. Steyaert, G. Gielen

The systematic design of a high-speed, high-accuracy Nyquistrate A/D converter is proposed. The presented design methodology covers the complete flow and is supported by software tools. A generic behavioral model is used to explore the A/D converter‰s specifications during high-level design and exploration. The inputs to the flow are the specifications of the A/D converter and the technology process. The result is a generated layout and the corresponding extracted behavioral model. The approach has been applied to a real-life test case, where a Nyquist-rate 8-bit 200 MS/s 4-2 interpolating/averaging A/D converter was developed for a WLAN application.
Categories and Subject Descriptors
B.7.m Integrated Circuits: miscellaneous
General Terms
Design
Keywords
A/D converters, Interpolating, Flash, Simulated Annealing.


Session 30: Low-Power Physical Design

Chair: Massoud Pedram
Organizers: Chaitali Chakrabarti, Sarma Vrudhula
30.1 Petri Net Modeling of Gate and Interconnect Delays for Power Estimation [p. 455]
Ashok K. Murugavel, N. Ranganathan

In this paper, a new type of Petri net called Hierarchical Colored Hardware Petri net, to model real-delay switching activity for power estimation is proposed. The logic circuit is converted into a HCHPN and simulated as a Petri net to get the switching activity estimate and thus the power values. The method is accurate and is significantly faster than other simulative methods. The HCHPN yields an average error of 4.9% with respect to Hspice for the ISCAS '85 benchmark circuits. The per-pattern simulation time is about 46 times lesser than PowerMill.

30.2 Power Estimation in Global Interconnects and its Reduction Using a Novel Repeater Optimization Methodology [p. 461]
Pawan Kapur, Gaurav Chandra, Krishna C. Saraswat

The purpose of this work is two fold. First, to quantify and establish future trends for the dynamic power dissipation in global wires of high performance integrated circuits. Second, to develop a novel and efficient delay-power tradeoff formulation for minimizing power due to repeaters, which can otherwise constitute 50% of total global wire power dissipation. Using the closed form solutions from this formulation, power savings of 50% on repeaters are shown with minimal delay penalties of about 5% at the 50 nm technology node. These closed-form, analytical solutions provide a fast and powerful tool for designers to minimize power.

30.3 Low-Swing Clock Domino Logic Incorporating Dual Supply and Dual Threshold Voltage [p. 467]
Seong-Ook Jung, Ki-Wook Kim, Sung-Mo Kang

High-speed domino logic is now prevailing in performance critical block of a chip. Low Voltage Swing Clock (LVSC) domino logic family is developed for substantial dynamic power saving. To boost up the transition speed in proposed circuitry, a well-established dual threshold voltage technique is exploited. Dual supply voltage technique in the LVSC domino logic is geared to reduce power consumption in clock tree and logic gates effectively. Delay Constrained Power Optimization (DCPO) algorithm allocates low supply voltage to logic gates such that dynamic power consumed by logic gates is minimized. Delay time variations due to gate-to-source voltage change and input signal arrival time difference are considered for accurate timing analysis in DCPO.
Categories and Subject Descriptors
B.6 [Hardware]: Logic Design; B.7 [Hardware]: Integrated Circuits
General Terms
Design
Keywords
domino logic, low swing clock, dual supply voltage, dual threshold voltage, low power

30.4 DRG-Cache: A Data Retention Gated-Ground Cache for Low Power [p. 473]
Amit Agarwal, Hai Li, Kaushik Roy

In this paper we propose a novel integrated circuit and architectural level technique to reduce leakage power consumption in high performance cache memories using single Vt (transistor threshold voltage) process. We utilize the concept of Gated-Ground [5] (NMOS transistor inserted between Ground line and SRAM cell) to achieve reduction in leakage energy without significantly affecting performance. Experimental results on gated-Ground caches show that data is retained (DRG-Cache) even if the memory are put in the stand-by mode of operation. Data is restored when the gated-Ground transistor is turned on. Turning off the gated-Ground transistor in turn gives large reduction in leakage power. This technique requires no extra circuitry; row decoder itself can be used to control the gated-Ground transistor. The technique is applicable to data and instruction caches as well as different levels of cache hierarchy such as the L1, L2, or L3 caches. We fabricated a test chip in TSMC 0.25µ technology to show the data retention capability and the cell stability of DRG-cache. Our simulation results on 100nm and 70nm processes (Berkeley Predictive Technology Model) show 16.5% and 27% reduction in consumed energy in L1 cache and 50% and 47% reduction in L2 cache with less than 5% impact on execution time and within 4% increase in area overhead.
Categories and Subject Descriptors
B.3.2 [Memory Structure]: Design Styles --- Cache memories; B.3.1 [Memory Structure]: Semiconductor Memories --- Static memory (SRAM); B.7.1 [Integrated Circuits]: Types and Design Styles --- Memory technology.
General Terms: Design, Performance and
Experimentation.
Keywords: Gated-ground, SRAM, low leakage cache.


Session 31: PANEL: Unified Tools for SoC Embedded Systems: Mission Critical, Mission Impossible or Mission Irrelevant?

Chair: Gary Smith
Organizers: Daya Nadamuni, Sharad Malik
Panelists: Rick Chapman, John Fogelin, Kurt Keutzer, Grant Martin, Brian Bailey [p. 479]

As designers struggle with developing application solutions consisting of complex systems-on-a-chip with a significant software component, they must deal with a diversity of tools with very different philosophies and assumptions, to help manage this task. On one hand are tools which assume a clean separation between the hardware and software parts of the design with an abstraction of the hardware available for software development. On the other hand are tools that try to handle the hardware and software parts of the design concurrently. What drives these different philosophies? Which of these is critical for emerging system designs? Which of these is viable going forward? Our panel of experts consisting of designers, embedded software tool providers, system design tool providers and an academic will answer these challenging questions.


Session 32: Multi-Voltage, Multi-Threshold Design

Chair: Rajendran Panda
Organizers: Renu Mehra, Sarma Vrudhula
32.1 Dynamic and Leakage Power Reduction in MTCMOS Circuits Using an Automated Efficient Gate Clustering Technique [p. 480]
Mohab Anis, Shawki Areibi, Mohamed Mahmoud, Mohamed Elmasry

Reducing power dissipation is one of the most principle subjects in VLSI design today. Scaling causes subthreshold leakage currents to become a large component of total power dissipation. This paper presents two techniques for efficient gate clustering in MTCMOS circuits by modeling the problem via Bin-Packing (BP) and Set-Partitioning (SP) techniques. An automated solution is presented, and both techniques are applied to six benchmarks to verify functionality. Both methodologies offer significant reduction in both dynamic and leakage power over previous techniques during the active and standby modes respectively. Furthermore, the SP technique takes the circuit‰s routing complexity into consideration which is critical for Deep Sub-Micron (DSM) implementations. Sufficient performance is achieved, while significantly reducing the overall sleep transistors‰ area. Results obtained indicate that our proposed techniques can achieve on average 90% savings for leakage power and 15% savings for dynamic power.
Categories and Subject Descriptors: B.7.1 [Integrated Circuits]: Types and Design Styles
General Terms: Design

32.2 Total Power Optimization By Simultaneous Dual-Vt Allocation and Device Sizing in High Performance Microprocessors [p. 486]
Tanay Karnik, Yibin Ye, James Tschanz, Liqiong Wei, Steven Burns, Venkatesh Govindarajulu, Vivek De, Shekhar Borkar

We describe various design automation solutions for design migration to a dual-Vt process technology. We include the results of a Lagrangian Relaxation based tool, iSTATS, and a heuristic iterative optimization flow. Joint dual-Vt allocation and sizing reduces total power by 10+% compared with Vt allocation alone, and by 25+% compared with pure sizing methods. The heuristic flow requires 5x larger computation runtime than iSTATS due to its iterative nature.
Categories and Subject Descriptors
B.7 INTEGRATED CIRCUITS
B.7.1 Types and Design Styles š Microprocessors and microcomputers, VLSI.
General Terms
Algorithms, Performance, Design, Experimentation, Verification.
Keywords
Dual-Vt design, multiple threshold, sizing, optimization.

32.3 An Optimal Voltage Synthesis Technique for a Power-Efficient Satellite Application [p. 492]
Dong-In Kang, Jinwoo Suh, Stephen P. Crago

This paper presents an optimal voltage synthesis technique for a satellite application to maximize system performance subject to energy budget. A period of a satellite's orbit is partitioned into several independent regions with different characteristics such as type of computation, importance, performance requirements, and energy consumption. Given a periodic energy recharge model, optimal voltages for the regions are synthesized such that the overall performance is maximized within the energy budget in the period.
Categories and Subject Descriptors
C.4 [PERFORMANCE OF SYSTEMS] Design studies, Modeling techniques, Performance attributes.
General Terms
Algorithms, Management, Performance, Design.
Keywords
Power-aware design, power-efficient design, satellite application, queueing.


Session 33: Advanced Simulation Techniques

Chair: L. Miguel Silveira
Organizers: Georges G. Gielen, Kartikeya Mayaram
33.1 Fast and Accurate Behavioral Simulation of Fractional-N Frequency Synthesizers and other PLL/DLL Circuits [p. 498]
Michael H. Perrott

Techniques for fast and accurate simulation of fractional-N synthesizers at a detailed behavioral level are presented. The techniques allow a uniform time step to be used for the simulator, and can be applied to a variety of phase locked loop (PLL) and delay locked loop (DLL) circuits beyond fractional-N synthesizers, as well as to a variety of simulation frameworks such as Verilog and Matlab. Simulated results from a custom C++ simulator are shown to compare well to measured results from a prototype fractional-N synthesizer using a Delta-Sigma modulator to dither its divide value.
Categories and Subject Descriptors
I.6.5 [Simulation and Modeling]: Model Development General Terms
Algorithms
Keywords
fractional-N,frequency,synthesizer,sigma,delta,PLL,DLL

33.2 Time-domain Steady-state Simulation of Frequency-Dependent Components Using Multi-interval Chebyshev Method [p. 504]
Baolin Yang, Joel Phillips

Simulation of RF circuits often demands analysis of distributed component models that are described via frequency-dependent multiport Y, Z, or S parameters. Frequency-domain methods such as harmonic balance are able to handle these components without difficulty, while they are more difficult for time-domain simulation methods to treat. In this paper, we propose a hybrid frequency-time approach to treat these components in steady-state time-domain simulations. Efficiency is achieved through the use of the multi-interval Chebyshev (MIC) simulation method and a low-order rational-fitting model for preconditioning matrix-implicit Krylov-subspace solvers.
Categories and Subject Descriptors
I.6 [Computing Methodologies]: Simulation and Modeling
General Terms
Algorithms
Keywords
RF circuit simulation, frequency dependent, S parameter

33.3 A Time-domain RF Steady-State Method for Closely Spaced Tones [p. 510]
Jaijeet Roychowdhury

Verifying circuits with two or more closely-spaced driving frequencies is important in RF and wireless communications, e.g., in the design of down-conversion mixers. Existing steady-state calculation methods, like harmonic balance, rely on Fourier series expansions to find the difference-frequency components typically of interest. Time-domain methods are, however, better suited for circuits with strong nonlinearities such as switching. Towards this end, we present a purely time-domain method for direct computation of difference tones in closely-spaced multi-tone problems. Our approach is based on multiple artificial time scales for decoupling the tones driving the circuit. Our method relies on a novel multi-time reformulation that expresses circuit equations directly in terms of time-scales corresponding to difference tones. We apply the new technique to an RF-CMOS mixer to predict baseband bit-streams and down-conversion gain and distortion, in two orders of magnitude less CPU time than traditional time-stepping simulation.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits]: Design Aids - Analog Verification
General Terms
Algorithms, Design, Verification
Keywords
MPDE, difference-frequency time scales, multi-time PDEs, harmonic balance, shooting, envelope, analog, mixed-signal, analog/RF simulation, RF switching mixers, artificial time scales, homotopy, continuation methods

33.4 An Algorithm for Frequency-Domain Noise Analysis in Nonlinear Systems [p. 514]
Giorgio Casinovi

Many mixed-technology and mixed-signal designs require the assessment of the system's performance in the presence of signals that are best characterized as having continuous frequency spectra: for instance, the simulation of a circuit's behavior in the presence of electrical noise. This paper describes an algorithm for frequency-domain simulation of nonlinear systems capable of handling signals with continuous spectra. The algorithm is based on an orthogonal series expansion of the signals in the frequency domain. Thus it is, in a sense, the dual approach to frequency-domain simulation with respect to harmonic balance, which relies on a time-domain series expansion of the signals. Signal spectra are obtained from the solution of a system of nonlinear algebraic equations whose dimension increases with the desired spectral accuracy. Numerical results obtained on an optical amplifier are presented.
Categories and Subject Descriptors
J.6 [Computer-Aided Engineering]: Computer-Aided Design; I.6.8 [Simulation and Modeling]: Types of Simulation - continuous
General Terms
Algorithms, Verification
Keywords
Circuit simulation, frequency-domain analysis


Session 34: Design Methodologies Meet Network Applications

Chair: Anand Raghunathan
Organizers: Anand Raghunathan, Marco Di Natale
34.1 System-level Performance Optimization of the Data Queueing Memory Management in High-Speed Network Processors [p. 518]
Ch. Ykman-Couvreur, J. Lambrecht, D. Verkest, F. Catthoor, A. Nikologiannis, G. Konstantoulakis

In high-speed network processors, data queueing has to allow real-time memory (de)allocation, buffering, retrieving, and forwarding of incoming data packets. Its implementation must be highly optimized to combine high speed, low power, large data storage, and high memory bandwidth. In this paper, such data queueing is used as case study to demonstrate the effectiveness of a new system-level exploration method for optimizing the memory performance in dynamic memory management. Assuming that a multi-bank memory architecture is used for data storage, the method trades off bank conflicts against memory accesses during real-time memory (de)allocation. It has been applied to the data queueing module of the PRO3 system [8]. Compared with the conventional memory management technique for embedded systems, our exploration method can save up to 90% of the bank conflicts, which allows to improve worst-case memory performance of data queueing operations by 50% too.
Categories and Subject Descriptors
B.3.3 [Memory Structures]: Performance Analysis and Design Aids
General Terms
Memory performance
Keywords
System-level exploration, memory management

34.2 Analysis of Power Consumption on Switch Fabrics in Network Routers [p. 524]
Terry Tao Ye, Luca Benini, Giovanni De Micheli

In this paper, we introduce a framework to estimate the power consumption on switch fabrics in network routers. We propose different modeling methodologies for node switches, internal buffers and interconnect wires inside switch fabric architectures. A simulation platform is also implemented to trace the dynamic power consumption with bit-level accuracy. Using this framework, four switch fabric architectures are analyzed under different traffic throughput and different numbers of ingress/egress ports. This framework and analysis can be applied to the architectural exploration for low power high performance network router designs.
Categories and Subject Descriptors
C.4 [Performance of Systems]: Design studies, Modeling techniques
General Terms
Design, Experimentation, Performance
Keywords
Networks on Chip, Interconnect Networks, Systems on Chip, Power Consumption

34.3 Memory Optimization in Single Chip Network Switch Fabrics [p. 530]
David Whelihan, Herman Schmit

Moving high bandwidth (10Gb/s+) network switches from the large scale, rack mount design space to the single chip design space requires a re-evaluation of the overall design requirements. In this paper, we explore the design space for these single chip devices by evaluating the ITRS. We find that unlike ten years ago when interconnect was scarce, the limiting factor in today's designs is on-chip memory. We then discuss an architectural technique for maximizing the effectiveness of queue memory in a single chip switch. Next, we show simulation results that indicate that a more than two order of magnitude improvement in dropped packet probability can be achieved by re-distributing memory and allowing sharing between the switch's ports. Finally, we evaluate the cost of the optimized architecture in terms of other on-chip resources.
Categories and Subject Descriptors
B.3.2 [Harware]: Design Styles
General Terms
Design, Performance


Session 35: Advances in Analog Modeling

Chair: Alan Mantooth
Organizers: Joel R. Phillips, Kartikeya Mayaram
35.1 Behavioral Modeling of (Coupled) Harmonic Oscillators [p. 536]
Piet Vanassche, Georges Gielen, Willy Sansen

A new approach is presented for the construction of behavioral models for harmonic oscillators and sets of coupled harmonic oscillators. The models can be used for system-level simulations and trade-off analysis. Besides the steady-state behavior, the model also takes the transient behavior of the oscillation amplitudes and phase differences into account. This behavior is modeled using a set of linear differential equations. Their extraction from a netlist description is based upon the ideas of perturbation analysis and stochastic averaging. The modeling equations are valid in the neighbourhood of the oscillator‰s steady-state operating point and can be evaluated at very low computational cost. Another major advantage of the approach is that it explicitly separates the fast-varying steady-state behavior and the slow-varying transient behavior. This allows for straightforward application of multi-rate simulation techniques, greatly boosting simulation speeds. The technique is illustrated for a quadrature oscillator.
Categories and Subject Descriptors
G.1.7 [Mathematics of Computing]: Ordinary Differential Equations; I.6.5 [Computing Methodologies]: Model DevelopmentÖ modeling methodologies
General Terms
Algorithms, Theory
Keywords
Averaging, Behavioral modeling, Coupled harmonic oscillators, Perturbation theory

35.2 Model Checking Algorithms for Analog Verification [p. 542]
Walter Hartong, Lars Hedrich, Erich Barke

In this contribution we present the first method for model checking on nonlinear analog systems. Based on digital CTL model checking algorithms and results in hybrid model checking, we have developed a concept to adapt these ideas to analog systems. Using an automatic state space subdivision method the continuous state space is transfered into a discrete model. In doing this, the most challenging task is to retain the essential nonlinear behavior of the analog system. To describe analog specification properties, an extension to the CTL language is needed. Two small examples show the properties and advantages of this new method and the capability of the implemented prototype tool.
Categories and Subject Descriptors
I.6.m [Computer Methodologies]: Simulation and Modeling
General Terms
Algorithms, Language, Theory, Verification
Keywords
Model Checking, Formal Methods, Nonlinear Analog Systems

35.3 Regularization of Hierarchical VHDL-AMS Models using Bipartite Graphs [p. 548]
Jochen Mades, Manfred Glesner

The powerful capability of VHDL-AMS to describe complex continuous systems in form of differential algebraic equations (DAEs) often leads to problems during numerical simulation. This paper presents a discrete algorithm to analyze unsolvable DAE systems and to correct the underlying hierarchical VHDL-AMS description automatically in interaction with the designer, avoiding time consuming manual error correction.
Categories and Subject Descriptors
G.2.2 [Discrete Mathematics]: Graph algorithms
General Terms
Algorithms, Design
Keywords
DAEs, VHDL-AMS, Regularization, Structural Solvability, Bipartite Graphs

35.4 Improving the Generality of the Fictitious Magnetic Charge Approach to Computing Inductances in the Presence of Permeable Materials [p. 552]
Yehia Massoud, Jacob White

In this paper we present an improvement to the fictitious magnetic charge approach to computing inductances in the presence of permeable materials. The improvement replaces integration over a "non-piercing" surface with a line integral and an efficient quadrature scheme. Eliminating the need to generate non-piercing surfaces substantially simplifies handling problems with general geometries of permeable materials. Computational results are presented to demonstrate the accuracy and versatility of the new method.


Session 36: Advances in Timing and Simulation

Chair: David J. Hathaway
Organizers: Louis Scheffer, Narendra V. Shenoy
36.1 A General Probabilistic Framework for Worst Case Timing Analysis [p. 556]
Michael Orshansky, Kurt Keutzer

The traditional approach to worst-case static-timing analysis is becoming unacceptably conservative due to an ever-increasing number of circuit and process effects. We propose a fundamentally different framework that aims to significantly improve the accuracy of timing predictions through fully probabilistic analysis of gate and path delays. We describe a bottom-up approach for the construction of joint probability density function of path delays, and present novel analytical and algorithmic methods for finding the full distribution of the maximum of a random path delay space with arbitrary path correlations.
Categories and Subject Descriptors
J.6.1 [Computer-Aided Engineering]: Computer-aided design.
General Terms
Algorithms

36.2 False Timing Path Identification Using ATPG Techniques and Delay-Based Information [p. 562]
Jing Zeng, Magdy Abadir, Jacob Abraham

A well-known problem in timing verification of VLSI circuits using static timing analysis tools is the generation of false timing paths. This leads to a pessimistic estimation of the processor speed and wasted engineering effort spent optimizing unsensitizable paths. Earlier results have shown how ATPG techniques can be used to identify false paths efficiently [6],[9], as well as how to bridge the gap between the physical design on which the static timing analysis is based and the test view on which ATPG technique is applied to identify false paths [9]. In this paper, we will demonstrate efficient techniques to identify more false timing paths by utilizing information from an ordered list of timing paths according to the delay information. More than 10% of additional false timing paths out of the total timing paths analyzed are identified compared to earlier results on the MPC7455, a Motorola processor executing to the PowerPCTM 1 instruction set architecture.
Categories and Subject Descriptors
J.6 [Computer-Aided Engineering]: --Computer-aided design; B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids; F.2.3 [Analysis of Algorithms and Problem Complexity]: Tradeoffs between Complexity Measures; B.6.3 [Logic Design]: Design Aids - Optimization
General Terms
Algorithms, Performance
Keywords
static timing analysis, false timing paths, ATPG, timing slack

36.3 False-Path-Aware Statistical Timing Analysis and Efficient Path Selection for Delay Testing and Timing Validation [p. 566]
Jing-Jia Liou, Angela Krstic, Li-C. Wang, Kwang-Ting Cheng

We propose a false-path-aware statistical timing analysis framework. In our framework, cell as well as interconnect delays are assumed to be correlated random variables. Our tool can characterize statistical circuit delay distribution for the entire circuit and produce a set of true critical paths.
Categories and Subject Descriptors
B.8.2 [Hardware]: Performance Analysis and Design Aids
General Terms
Algorithm, Performance, Reliability
Keywords
Critical path selection, false path, statistical timing analysis

36.4 A Fast, Inexpensive and Scalable Hardware Acceleration Technique for Functional Simulation [p. 570]
Srihari Cadambi, Chandra S. Mulpuri, Pranav N. Ashar

We introduce a novel approach to accelerating functional simulation. The key attributes of our approach are high-performance, low-cost, scalability and low turn-around-time (TAT).We achieve speedups between 25 and 2000x over zero delay event-driven simulation and between 75 and 1000x over cycle-based simulation on benchmark and industrial circuits while maintaining the cost, scalability and TAT advantages of simulation. Owing to these attributes, we believe that such an approach has potential for very wide deployment as replacement or enhancement for existing simulators. Our technology relies on a VLIW-like virtual simulation processor (SimPLE) mapped to a single FPGA on an off-the-shelf PCI-board. Primarily responsible for the speed are (i) parallelism in the processor architecture (ii) high pin count on the FPGA enabling large instruction bandwidth and (iii) high speed (124 MHz on Xilinx Virtex-II) single-FPGA implementation of the processor with regularity driven efficient place and route. Companion to the processor is the very fast SimPLE compiler which achieves compilation rates of 4 million gates/hour. In order to simulate the netlist, the compiled instructions are streamed through the FPGA, along with the simulation vectors. This architecture plugs in naturally into any existing HDL simulation environment. We have a working prototype based on a commercially available PCI-based FPGA board.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids - Simulation, Verification
General Terms
Algorithms, Performance, Experimentation, Verification
Keywords
Functional simulation, hardware acceleration, VLIW, FPGA


Session 37: PANEL: Formal Verification Methods: Getting around the Brick Wall

Chair: David Dill
Organizers: Nate James, Shishpal Rawat
Panelists: GÚrard Berry, Limor Fix, Harry Foster, Rajeev Ranjan, Gunnar Stalmarck, Curt Widdoes [p. 576]

Do formal verification tools and methodologies require a drastic overhaul to move beyond equivalence checking? Equivalence checking catches errors in synthesis and local hand-modifications to designs. However, powerful formal verification technologies are emerging to combat "behavioral" errors, which represent today's biggest verification problems. Nonetheless, formal verification experts are split on how formal tools should adapt to this challenge. Some of our panelists feel that designers can sufficiently benefit from new formal verification technologies by making incremental changes to current methodologies. Others, however, argue that major changes are required to reap meaningful benefits from these new technologies. Just how much change is enough, what is the capacity of our current tools and what is limiting the full deployment of FV technology? Our panel of experts, consisting of users, tool providers, and core engine builders, will answer these challenging questions. The panel will debate these issues while discussing real life examples from the user base. They will provide a perspective of how the progression of technology will bring the real promise of formal verification to the user base.


Session 38: Routing and Buffering

Chair: Noel Menezes
Organizers: Charles J. Alpert, Steven Teig
38.1 S-Tree: A Technique for Buffered Routing Tree Synthesis [p. 578]
Milos Hrkic, John Lillis

We present the S-Tree algorithm for synthesis of buffered interconnects. The approach incorporates a unique combination of real-world issues (handling of routing and buffer blockages, cost minimization, critical sink isolation, sink polarities), robustness and scalability. The algorithm is able to achieve the slack comparable to that of buffered P-Tree [7] using less resources (wire and buffers) in an order of magnitude less cpu time.
Categories and Subject Descriptors
B.7.2 [Hardware]: Integrated Circuits - Design Aids
General Terms
Algorithms
Keywords
timing algorithms, Steiner trees, routing, buffer insertion.

38.2 An Algorithm for Integrated Pin Assignment and Buffer Planning [p. 584]
Hua Xiang, Xiaoping Tang, D. F. Wong

The buffer block methodology has become increasingly popular as more and more buffers are needed in deep-submicron design, and it leads to many challenging problems in physical design. In this paper, we present a polynomial-time exact algorithm for integrated pin assignment and buffer planning for all two-pin nets from one macro block (source block) to all other blocks of a given buffer block plan as well as minimizing the total cost alpha.W + beta.R for any positive a and b where W is the total wire length and R is the number of buffers. By applying this algorithm iteratively (each time pick one block as the source block), it provides a polynomial-time algorithm for pin assignment and buffer planning for nets among multiple macro blocks. Experimental results demonstrate its efficiency and effectiveness.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits]: Design AidsÖPlacement and routing; J.6 [Computer Applications]: Computer-Aided EngineeringÖ Computer-aided design
General Terms
Algorithms, Design, Performance, Theory
Keywords
buffer insertion, pin assignment, min-cost maximum flow

38.3 An Efficient Routing Database [p. 590]
Narendra V. Shenoy, William Nicholls

Routing is an important problem in the process of design creation. In this paper, we focus on the problem of designing a database for the non-partitioned routing problem. New technology libraries describe constraints that are hard to manage in grid-based approaches to the routing database. While general region query based data-structures have been proposed, they typically suffer from speed problems when applied to large blocks. We introduce an interval-based approach. It provides more flexibility than grid-based techniques. It exploits the notion of preferred direction for metal layers to manage the memory efficiently. It supports efficient region queries. We finally present a comparison study for real industrial designs on this database.
Categories and Subject Descriptors
J.6 [Computer-Aided Engineering]: Computer-aided design
General Terms
Algorithms, Design
Keywords
Physical design, routing, database


Session 39: System on Chip Design

Chair: Rolf Ernst
Organizers: Krzysztof Kuchcinski, Miodrag Potkonjak
39.1 Automatic Generation of Embedded Memory Wrapper for Multiprocessor SoC [p. 596]
Ferid Gharsalli, Samy Meftali, FrÚdÚric Rousseau, Ahmed A. Jerraya

Embedded memory plays a critical role to improve performances of systems-on-chip (SoC). In this paper, we present a new methodology for embedded memory design in the case of application specific multiprocessor system-on-chip. This approach facilitates the integration of standard memory components. The concept of memory wrapper allows automatic adaptation of physical memory interfaces to a communication network that may have a different number of access ports. We give also a generic architecture to produce this memory wrapper. This approach has successfully been applied on a low-level image processing application.
General Terms
Design, Experimentation.
Categories and Subject Descriptors
M1.9 and T4.5: High level and architectural synthesis
Keywords
Embedded Memory, System-on-Chip, Memory access, Memory Wrapper Generation.

39.2 A Novel Synthesis Technique for Communication Controller Hardware from Declarative Data Communication Protocol Specifications [p. 602]
Robert Siegmund, Dietmar MÄller

An innovative methodology for the efficient design of communication controller hardware for popular protocols such as ATM, USB or CAN is proposed. In our approach, controller hardware in form of RTL models is synthesized from a formal specification of a communication protocol. The difference to previously published work related to hardware synthesis techniques from protocol specifications is that in our approach a complete communication architecture consisting of both the interacting transaction producer and the consumer controllers, as well as the interconnect between them, are synthesized from one single protocol specification in the same synthesis tool run, thus ensuring conformity of all producer and consumer controllers to the protocol specification while tremendously reducing the modeling effort for the controller specifications. The formalism used for protocol specification and a corresponding hardware synthesis algorithm from such specifications are presented. The methodology has been applied to the design of various communication controllers including IEC14443 Wireless SmartCard, ATM and CAN. The novelty and efficiency of our methodology is demonstrated through comparison to State-of-The-Art protocol synthesis tools such as [10].
Categories and Subject Descriptors
B.4.4 [Hardware]: Input/Output and Data Communications Devices - Performance Analysis and Design Aids; B.5.2 [Hardware]: Register-Transfer-Level-Implementation - Design Aids
General Terms
Algorithms, Design, Languages
Keywords
Protocol Specification, Controller Hardware Synthesis, Interface-based Design

39.3 An Integrated Algorithm for Memory Allocation and Assignment in High-level Synthesis [p. 608]
Jaewon Seo, Taewhan Kim, Preeti R. Panda

With the increasing design complexity and performance requirement, data arrays in behavioral specification are usually mapped to fast on-chip memories in behavioral synthesis. This paper describes a new algorithm that overcomes two limitations of the previous works on the problem of memory-allocation and array-mapping to memories. Specifically, its key features are (1) a tight link to the scheduling effect, which was totally or partially ignored by the existing memory synthesis systems, and supporting (2) non-uniform access speeds among the ports of memories, which greatly diversify the possible (practical) memory configurations. Experimental data on a set of benchmark filter designs are provided to show the effectiveness of the proposed exploration strategy in finding globally best memory configurations.
Categories and Subject Descriptors : B.5.2 [Register-Transfer-Level Implementation]: Design Aids
General Terms : Algorithms, Performance, Design
Keywords : memory allocation, memory assignment, memory design, scheduling effect

39.4 High-Level Synthesis of Multiple-Precision Circuits Independent of Data-Objects Length [p. 612]
M. C. Molina, J. M. Mendâas, R. Hermida

This paper presents a heuristic method to perform the high-level synthesis of multiple-precision specifications. The scheduling is based on the balance of the number of bits calculated per cycle, and the allocation on the bit-level reuse of the hardware resources. The implementations obtained are multiple-precision datapaths independent of the number and widths of the specification operations. As a result impressive area savings are achieved in comparison with conventional algorithms implementations.
Categories and Subject Descriptors
B.5.2 [Register-transfer-level Implementation]: Design Aids automatic synthesis, optimization.
General Terms
Algorithms, Design.
Keywords
High-level synthesis, multiple-precision, scheduling, allocation.


Session 40: Timing Analysis and Memory Optimization for Embedded Systems

Chair: Giuseppe Lipari
Organizers: Marco Di Natale, Xiaobo (Sharon) Hu
40.1 Schedulability of Event-Driven Code Blocks in Real-Time Embedded Systems [p. 616]
Samarjit Chakraborty, Thomas Erlebach, Simon KÄnzli, Lothar Thiele

Many real-time embedded systems involve a collection of independently executing event-driven code blocks, having hard real-time constraints. Tasks in many such systems, like network processors, are either not preemptable or have restrictions on the number of preemptions allowed. All the previous work on the schedulability analysis of such systems either have exponential complexity, or allow unbounded number of preemptions and are usually based on heuristics. In this paper we present the exact necessary and sufficient conditions under EDF, for the schedulability of such a collection of code blocks in a non-preemptive environment, and give efficient algorithms for testing them. We validate our analytical results with experiments and show that the schedulability analysis problem in such systems can be exactly and efficiently solved in practice.
Categories and Subject Descriptors
C.3 [Computer Systems Organization]: Special-purpose and application-based systemsÖReal-time and embedded systems
General Terms
Algorithms, Performance, Verification

40.2 Associative Caches in Formal Software Timing Analysis [p. 622]
Fabian Wolf, Jan Staschulat, Rolf Ernst

Precise cache analysis is crucial to formally determine program running time. As cache simulation is unsafe with respect to the conservative running time bounds for real-time systems, current cache analysis techniques combine basic block level cache modeling with explicit or implicit program path analysis. We present an approach that extends instruction and data cache modeling from the granularity of basic blocks to program segments thereby increasing the overall running time analysis precision. Data flow analysis and local simulation of program segments are combined to safely predict cache line contents for associative caches in software running time analysis. The experiments show significant improvements in analysis precision over previous approaches on a typical embedded processor.
Categories and Subject Descriptors
C.3 [Special-Purpose and Application-Based Systems]: [Realtime and embedded systems]; D.2.4 [Software Engineering]: Software/ Program VerificationÖformal methods; C.4 [Performance of Systems]: Performance Attributes
General Terms
Algorithms, Performance, Verification
Keywords
Cache Analysis, Embedded Software, Real-Time, Timing Analysis

40.3 Compiler-Directed Scratch Pad Memory Hierarchy Design and Management [p. 628]
M. Kandemir, A. Choudhary

One of the primary challenges in embedded system design is designing the memory hierarchy and restructuring the application to take advantage of it. This task is particularly important for embedded image and video processing applications that make heavy use of large multidimensional arrays of signals and nested loops. In this paper, we show that a simple reuse vector/matrix abstraction can provide compiler with useful information in a concise form. Using this information, compiler can either adapt application to an existing memory hierarchy or can come up with a memory hierarchy. Our initial results indicate that the compiler is very successful in both optimizing code for a given memory hierarchy and designing a hierarchy with reasonable performance/ size ratio.
Categories and Subject Descriptors
B.3 [Hardware]: Memory Structures; D.3.4 [Programming Languages]: ProcessorsÖCompilers;Optimization
General Terms
Design, Experimentation, Performance
Keywords
Data Reuse, Scratch Pad Memory, Memory Hierarchy


Session 41: Processors and Accelerators for Embedded Applications

Chair: Chris Rowen
Organizers: Kurt Keutzer, Majid Sarrafzadeh
41.1 Unlocking the Design Secrets of a 2.29 Gb/s Rijndael Processor [p. 634]
Patrick R. Schaumont, Henry Kuo, Ingrid M. Verbauwhede

This contribution describes the design and performance testing of an Advanced Encryption Standard (AES) compliant encryption chip that delivers 2.29 GB/s of encryption throughput at 56 mw of power consumption. We discuss how the high level reference specification in C is translated into a parallel architecture. Design decisions are motivated from a system level viewpoint. The prototyping setup is discussed.
Categories and Subject Descriptors
C.3 [Special-Purpose and Application-based Systems]
General Terms
Measurement, Performance, Design, Security.
Keywords
Rijndael, Encryption, Domain-Specific, Low-Power.

41.2 The iCOREœ 520 MHz Synthesizable CPU Core [p. 640]
Nick Richardson, Lun Bin Huang, Razak Hossain, Tommy Zounes, Naresh Soni, Julian Lewis

This paper describes a new implementation of the ST20-C2 CPU architecture. The design involves an eight-stage pipeline with hardware support to execute up to three instructions in a cycle. Branch prediction is based on a 2-bit predictor scheme with a 1024-entry Branch History Table and a 64 entry Branch Target Buffer and a 4-entry Return Stack. The implementation of all blocks in the processor was based on synthesized logic generation and automatic place and route. The full design of the CPU from microarchitectural investigations to layout required approximately 8-man years. The CPU core, without the caches, has an area of approximately 1.5 mm2 in a 6-metal 0.18m CMOS process. The design operates up to 520 MHz at 1.8V, among the highest reported speeds for a synthesized CPU core [1].
Categories and Subject Descriptors
M.2.4 [Design Methodologies]: High performance ASIC design: Optimizing microarchitecture to layout.
General Terms: Performance, Design.
Keywords: Synthesis, ASIC, CPU, embedded, st20, pipeline, cache, branch-prediction, high-frequency, microarchitecture

41.3 A Flexible Accelerator for Layer 7 Networking Application [p. 646]
Gokhan Memik, William H. Mangione-Smith

In this paper, we present a flexible accelerator designed for networking applications. The accelerator can be utilized efficiently by a variety of Network Processor designs. Most Network Processors employ hardware accelerators for implementing key tasks. New applications require new tasks, such as pattern matching, to be performed on the packets in real-time. Using our proposed accelerator, we have implemented several such tasks and measured their performance. Specifically, the accelerator achieves 25-fold improvement on the performance of pattern matching, and 10-fold improvement for tree lookup, over optimized software solutions. Since the accelerator is used for different tasks, the hardware requirements are small compared to an accelerator group that implements the same set of tasks. We also present accurate analytic models to estimate the execution time of these networking tasks.
Categories and Subject Descriptors
B.2.4 [Arithmetic and logic structures]: High-Speed Arithmetic, Cost/Performance.
General Terms
Algorithms, Measurement, Performance, Design.
Keywords
Application-Specific Processor, Networking Applications, Network Processor, Accelerator, Table Lookup, Pattern Matching.


Session 42: PANEL: What's the Next EDA Driver?

Chair: Jan Rabaey
Organizers: Joachim Kunkel, Dennis Brophy
Panelists: Raul Camposano, Davoud Samani, Larry Lerner, Rick Hetherington [p. 652]

The PC industry was the major consumer of silicon in the 80's and 90's. It defined the requirements for EDA. In a world dominated by PC's, clock frequency was the ultimate measure of performance. Times have changed. Today the internet, wireless communications and consumer applications such as high end gaming stations have replaced the PC as the primary driver. How we measure "cutting edge" has also changed. Raw computing speed measured in terms of clock speed has given way to alternate forms of measuring performance - MBits/s, MMACS, and Polygons/sec now capture the differing computation requirements for these domains. Depending on the application domain, weight and power are just as important if not more important than raw computational power. How do these changes drive technology development in EDA? Which of these is the dominant EDA driver? How much of this drive is a function of EDA following the money? How much of this drive is a function of unique technical challenges posed by these domains? This diverse set of panelists with representation from various application segments and the EDA industry will attempt to answer these questions.


Session 43: Cross-Talk Noise Analysis and Management

Chair: Cheng-Kok Koh
Organizers: Kaushik Roy, Noel Menezes
43.1 Estimation of the Likelihood of Capacitive Coupling Noise [p. 653]
Sarma B. K. Vrudhula, David Blaauw, Supamas Sirichotiyakul

Categories and Subject Descriptors
B.7.2 [Hardware]: Integrated Circuits - Design Aids,Reliability and Testing
General Terms
Performance, Reliability, Verification
Keywords
Noise, Signal Integrity, Deep Submicron
The corruption of signals due to capacitive and inductive coupling of interconnects has become a significant problem in the design of deep submicron circuits (DSM). Noise simulators, based on worst case assumptions, are overly pessimistic. As a result, when they are used on industrial ICs with hundreds of thousands of nets, thousands of nets are reported as having potential noise violations. There is a need to prioritize the problem nets based on the likelihood of the noise and possibly even eliminate them from further consideration if the likelihood is negligible. In this paper, a probabilistic approach is described which allows for a quantitative means to prioritize nets based on the likelihood of the reported noise violation. We derive upper bounds on the probability that the total noise injected on a given victim net by a specific set of aggressors exceeds a threshold. This bound is then used to determine a lower bound on the expected number of clock cycles (ENC) before the first violation occurs on a given net. Nets can be prioritized based on the ENC. We demonstrate the utility of this approach through experiments carried out on a large industrial processor design using a state-of-the-art industrial noise analysis tool. A significant and interesting result of this work is that a substantial portion (25%) of the nets were found to have an ENC of more than five years. If five years is deemed to be sufficiently long time, then these could be eliminated from further consideration.

43.2 Crosstalk Noise Estimation for Noise Management [p. 659]
Paul B. Morton, Wayne Dai

One of the main challenges in developing an effective crosstalk noise management strategy is to develop a crosstalk noise estimate which is both accurate and leads to a tractable optimization problem that can be used to optimally redistribute uncommitted routing resources to resolve crosstalk noise violations. Devgan's [4] estimate comes very close to meeting these objectives, however, it is extremely pessimistic for nets with long couplings or aggressor nets with short rise times. The increased complexity of more sophisticated estimates, such as that proposed by Vittal et. al. [13] lead to a crosstalk management problem that is very hard to solve. In this paper we develop two estimates, similar in form to Devgan's estimate, that are based on local approximations of Vittal's estimate. Our estimates are substantially more accurate than Devgan's estimate while still allowing us to formulate the crosstalk management problem in a form that can be solved efficiently.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits]: Design Aids - Layout, Placement and routing
General Terms
design, performance
Keywords
crosstalk, noise, estimation, local approximation, optimal spacing, noise management

43.3 Variable Frequency Crosstalk Noise Analysis: A Methodology to Guarantee Functionality from DC to FMAX [p. 665]
Byron Krauter, David Widiger

One means of reducing pessimism in crosstalk analysis is to consider timing orthogonality. While earlier works have addressed the temporal alignment of timing windows [1, 2, 3, 4], these treatments have overlooked one key point. Crosstalk noise failures are frequency dependent. A chip that functions at one frequency can fail due to crosstalk noise at faster and slower frequencies. Moreover, because system developers and manufacturers need chips that operate over a wide range of frequencies, noise analysis tools should guarantee a wide range of operation. In this paper, we explain this phenomenon and propose a simple procedure to include timing information and guarantee functionality from dc to fmax.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits]: Design Aids - simulation, verification
General Terms
Algorithms, Design, Verification
Keywords
crosstalk, noise analysis, timing windows, timing orthogonality, GCD frequency, LCM window, frequency-dependent noise

43.4 Towards Global Routing With RLC Crosstalk Constraints [p. 669]
James D. Z. Ma, Lei He

Conventional global routing minimizes total wire length and congestion. Experiments using large industrial benchmark circuits show that up to 24% of nets in such routing solutions may have RLC crosstalk violations at 3GHz clock. We develop an extremely efficient length-scaled Keff (LSK) model that has a high fidelity for long-range RLC crosstalk. We formulate an extended global routing problem (denoted as GSINO) to consider simultaneous shield insertion and net ordering with RLC crosstalk constraints, then propose an effective three-phase GSINO algorithm. The GSINO algorithm completely eliminates the RLC crosstalk violations, and has small area and wire length overhead compared to conventional routing.
Categories and Subject Descriptors
B.7.2 [Design Aids] - Placement and Routing.
General Terms
Algorithms, Design, Theory


Session 44: Test Cost Reduction for SOCS

Chair: Yervant Zorian
Organizers: Seiji Kajihara, Kwang-Ting (Tim) Cheng
44.1 Reduction of SOC Test Data Volume, Scan Power and Testing Time Using Alternating Run-length Codes [p. 673]
Anshuman Chandra, Krishnendu Chakrabarty

We present a test resource partitioning (TRP) technique that simultaneously reduces test data volume, test application time and scan power. The proposed approach is based on the use of alternating run-length codes for test data compression. Experimental results for the larger ISCAS-89 benchmarks and an IBM production circuits show that reduced test data volume, test application time and low power scan testing can indeed be achieved in all cases.

44.2 Embedded Test Control Schemes for Compression in SOCs [p. 679]
Douglas Kay, Sung Chung, Samiha Mourad

This paper presents novel control schemes for testing embedded cores in a system on a chip (SOC). It converts a traditional BIST scheme into an externally controllable scheme to achieve a high test quality within optimal test execution time without inserting test points. The scheme promotes design and test reuse without revealing IP information.
Categories
Testing; Test generation and debugging
General Terms
Algorithms, Performance, Design, Experimentation
Key words:
BIST, Data Compression, Test Resource Allocation

44.3 Wrapper/TAM Co-Optimization, Constraint-Driven Test Scheduling, and Tester Data Volume Reduction for SOCs [p. 685]
Vikram Iyengar, Krishnendu Chakrabarty, Erik Jan Marinissen

This paper describes an integrated framework for plug-and-play SOC test automation. This framework is based on a new approach for wrapper/TAM co-optimization based on rectangle packing. We first tailor TAM widths to each core's test data needs. We then use rectangle packing to develop an integrated scheduling algorithm that incorporates precedence and power constraints in the test schedule, while allowing the SOC integrator to designate a group of tests as preemptable. Finally, we study the relationship between TAM width and tester data volume to identify an effective TAM width for the SOC. We present experimental results for non-preemptive, preemptive, and power-constrained test scheduling, as well as for effective TAM width identification for an academic benchmark SOC and three industrial SOCs.
Categories and Subject Descriptors
B.7.3 [Integrated circuits]: Reliability and testing
General Terms
Algorithms,design,experimentation,reliability,verification


Session 45: Scheduling Techniques for Embedded Systems

Chair: Rolf Ernst
Organizers: Diederik Verkest, Donatella Sciuto
45.1 Communication Architecture Based Power Management for Battery Efficient System Design [p. 691]
Kanishka Lahiri, Anand Raghunathan, Sujit Dey

Communication-based power management (CBPM) is a new battery-driven system-level power management methodology in which the system level communication architecture regulates the execution of various system components, with the aim of improving battery efficiency, and hence, battery life. Unlike conventional power management policies (that attempt to efficiently shut down idle components), CBPM may delay the execution of selected system components even when they are active, in order to adapt the system-level current discharge profile to suit the battery's characteristics. In this paper, we present a methodology for the design of CBPM based systems, which consists of system-level performance and power profiling, battery discharge analysis, instrumentation of system components to facilitate CBPM, definition of CBPM policies, and generation of the CBPM based system architecture. We present extensive evaluations of CBPM, and demonstrate its application to the design of an IEEE 801.11 Wireless LAN MAC processor system. Our results indicate that CBPM based systems are significantly more battery efficient than those based on conventional power management techniques. Further, we demonstrate that CBPM enables design-time as well as run-time tradeoffs between system performance and battery life.
Categories and Subject Descriptors
C.5.4 [Computer System Implementation]: VLSI Systems
General Terms
Algorithms, Design, Experimentation
Keywords
Embedded Systems, Communication Architectures, Battery Efficiency, Power Management, Low Power Design

45.2 Scheduler-Based DRAM Energy Management [p. 697]
V. Delaluz, A. Sivasubramaniam, M. Kandemir, N. Vijaykrishnan, M. J. Irwin

Previous work on DRAM power-mode management focused on hardware-based techniques and compiler-directed schemes to explicitly transition unused memory modules to low-power operating modes. While hardware-based techniques require extra logic to keep track of memory references and make decisions about future mode transitions, compiler-directed schemes can only work on a single application at a time and demand sophisticated program analysis support. In this work, we present an operating system (OS) based solution where the OS scheduler directs the power mode transitions by keeping track of module accesses for each process in the system. This global view combined with the flexibility of a software approach brings large energy savings at no extra hardware cost. Our implementation using a full-fledged OS shows that the proposed technique is also very robust when different system and workload parameters are modified, and provides the first set of experimental results for memory energy optimization with a multiprogrammed workload on a real platform. The proposed technique is applicable to both embedded systems and high-end computing platforms.
Categories and Subject Descriptors
D.4.0 [Operating Systems]: General; D.4.1 [Operating Systems]: Process ManagementÖScheduling; B.3.1 [Memory Structures]: Dynamic Memory (DRAM)
General Terms
Management, Design, Experimentation
Keywords
Energy Management, DRAM, Operating Systems, Scheduler, Energy Estimation.

45.3 An Integer Linear Programming Based Approach for Parallelizing Application in On-Chip Multiprocessors [p. 703]
I. Kadayif, M. Kandemir, U. Sezer

With energy consumption becoming one of the first-class optimization parameters in computer system design, compilation techniques that consider performance and energy simultaneously are expected to play a central role. In particular, compiling a given application code under performance and energy constraints is becoming an important problem. In this paper, we focus on an on-chip multiprocessor architecture and present a parallelization strategy based on integer linear programming. Given an array-intensive application, our optimization strategy determines the number of processors to be used in executing each nest based on the objective function and additional compilation constraints provided by the user. Our initial experience with this strategy shows that it is very successful in optimizing array-intensive applications on on-chip multiprocessors under energy and performance constraints.
Categories and Subject Descriptors
D.3.4 [Programming Languages]: ProcessorsÖCompilers, Optimization General Terms
Design, Experimentation, Performance
Keywords
Embedded Systems, Loop-Level Parallelism, Constraint-Based Compilation


Session 46: Special Session: Designing SoCs for Yield Improvement

Chair: Srivaths Ravi
Organizers: Anand Raghunathan, Alfred E. Dunlop
46.1 Embedding Infrastructure IP for SOC Yield Improvement [p. 709]
Yervant Zorian

In addition to the functional IP cores, today's SOC necessitates embedding a special family of IP blocks, called Infrastructure IP blocks. These are meant to ensure the manufacturability of the SOC and to achieve adequate levels of yield and reliability. The Infrastructure IP leverages the manufacturing knowledge and feeds back the information into the design phase. This paper analyzes the key trends and challenges resulting in manufacturing susceptibility and field reliability that necessitate the use of such Infrastructure IP. It also describes several examples of such embedded IPs for detection, analysis and correction.
Categories and Subject Descriptors
B.8.1 [Performance and Reliability]: Reliability, testing and fault tolerance.
General Terms: Measurement, Performance, Design, Reliability, and Verification.
Keywords: Semiconductor IP, Embedded Test and Repair, Yield Optimization, Test Resource Partitioning.

46.2 Using Embedded FPGAs for SoC Yield Improvement [p. 713]
Miron Abramovici, Charles Stroud, Marty Emmert

In this paper we show that an embedded FPGA core is an ideal host to implement infrastructure IP for yield improvement in a bus-based SoC. We present methods for testing,diagnosing,and repairing embedded FPGAs,for which complete testability is achieved without any area overhead or performance degradation. We show how an FPGA core can provide embedded testers for other cores in the SoC, so that cores designed to be tested with external vectors can be tested with BIST,and the entire SoC can be tested with a low-cost tester.
Categories and Subject Descriptors: B.8.1 [Performance and Reliability]: Reliability, Testing, and Fault-Tolerance
General Terms: Algorithms, Design, Reliability


Session 47: Advances in SAT

Chair: Joao Marques-Silva
Organizers: Malgorzata Merek-Sadowska, Soha M. Hassoun
47.1 A Proof Engine Approach to Solving Combinational Design Automation Problems [p. 725]
Gunnar Andersson, Per Bjesse, Byron Cook, Ziyad Hanna

There are many approaches available for solving combinational design automation problems encoded as tautology or satisfiability checks. Unfortunately there exists no single analysis that gives adequate performance for all problems of interest, and it is therefore critical to be able to combine approaches. In this paper, we present a proof engine framework where individual analyses are viewed as strategies|functions between different proof states. By defining our proof engine in such a way that we can compose strategies to form new, more powerful, strategies we achieve synergistic effects between the individual methods. The resulting framework has enabled us to develop a small set of powerful composite default strategies. We describe several strategies and their interplay; one of the strategies, variable instantiation, is new. The strength of our approach is demonstrated with experimental results showing that our default strategies can achieve up to several magnitudes of speed-up compared to BDD-based techniques and search-based satisfiability solvers such as ZCHAFF.
Categories and Subject Descriptors
B.6.3 [Hardware]: Logic Design|Design Aids
General Terms
Algorithms, Design, Verification

47.2 Solving Difficult SAT Instances in the Presence of Symmetry [p. 731]
Fadi A. Aloul, Arathi Ramani, Igor L. Markov, Karem A. Sakallah

Research in algorithms for Boolean satisfiability and their implementations [23, 6] has recently outpaced benchmarking efforts. Most of the classic DIMACS benchmarks [10] can be solved in seconds on commodity PCs. More recent benchmarks take longer to solve because of their large size, but are still solved in minutes [25]. Yet, small and difficult SAT instances must exist because Boolean satisfiability is NP-complete. We propose an improved construction of symmetry-breaking clauses [9] and apply it to achieve significant speed-ups over current state-of-the-art in Boolean satisfiability. Our techniques are formulated as preprocessing and can be applied to any SAT solver without changing its source code. We also show that considerations of symmetry may lead to more efficient reductions to SAT in the routing domain. Our work articulates SAT instances that are unusually difficult for their size, including satisfiable instances derived from routing problems. Using an efficient implementation to solve the graph automorphism problem [18, 20, 22], we show that in structured SAT instances difficulty may be associated with large numbers of symmetries.
Categories and Subject Descriptors
I.1.2 [Algorithms]: Algebraic algorithms.
General Terms
Algorithms, experimentation, verification.
Keywords
SAT, CNF, faster, search, symmetry, difficult, instances, speed-up.

47.3 Satometer: How Much Have We Searched? [p. 737]
Fadi A. Aloul, Brian D. Sierawski, Karem A. Sakallah

We introduce Satometer, a tool that can be used to estimate the percentage of the search space actually explored by a backtrack SAT solver. Satometer calculates a normalized minterm count for those portions of the search space identified by conflicts. The computation is carried out using a zero-suppressed BDD data structure and can have adjustable accuracy. The data provided by Satometer can help diagnose the performance of SAT solvers and can shed light on the nature of a SAT instance.
Categories and Subject Descriptors
I.1 [Symbolic and Algebraic Manipulation]: Expressions and Their Representation, Algorithms.
General Terms
Algorithms, Measurement, Experimentation, Verification.
Keywords
BDDs, ZBDDs, SAT, CNF, backtrack search, conflict diagnosis, search space coverage, search progress.

47.4 SAT with Partial Clauses and Back-Leaps [p. 743]
Slawomir Pilarski, Gracia Hu

This paper presents four new powerful SAT optimization techniques: partial clauses, back-leaps, immediate implications, and local decisions. These optimization techniques can be combined with SAT mechanisms used in Chaff, SATO, and GRASP to develop a new solver that significantly outperforms its predecessors on a large set of benchmarks. Performance improvements for standard benchmark groups vary from 1.5x to 60x. Performance improvements measured using VLIW microprocessor benchmarks amount to 3.31x.

47.5 Combining Strengths of Circuit-based and CNF-based Algorithms for a High-Performance SAT Solver [p. 747]
Malay K. Ganai, Lintao Zhang, Pranav Ashar, Aarti Gupta, Sharad Malik

We propose Satisfiability Checking (SAT) techniques that lead to a consistent performance improvement of up to 3x over state-of-the-art SAT solvers like Chaff on important problem domains in VLSI CAD. We observe that in circuit oriented applications like ATPG and verification, different software engineering techniques are required for the portions of the formula corresponding to learnt clauses compared to the original formula. We demonstrate that by employing the same innovations as in advanced CNF-based SAT solvers, but in a hybrid approach where these two portions of the formula are represented differently and processed separately, it is possible to obtain the consistently highest performing SAT solver for circuit oriented problem domains. We also present controlled experiments to highlight where these gains come from. Once it is established that the hybrid approach is faster, it becomes possible to apply low overhead circuit-based heuristics that would be unavailable in the CNF domain for greater speedup.
Categories and Subject Descriptors
J.6 [Computer Applications]: Computer-aided Engineering - Computer-aided design.
General Terms
Algorithms, Performance, Experimentation,Verification.
Keywords
Boolean Satisfiability (SAT), Boolean Constraint Propagation (BCP), Conjunctive Normal Form (CNF), Bounded Model Checking (BMC).


Session 48: Inductance and Substrate Analysis

Chair: Noel Menezes
Organizers: Jaijeet Roychowdhury, Mustafa Celik
48.1 A Solenoidal Basis Method For Efficient Inductance Extraction [p. 751]
Hemant Mahawar, Vivek Sarin, Weiping Shi

The ability to compute the parasitic inductance of the interconnect is critical to the timing verification of modern VLSI circuits. A challenging aspect of inductance extraction is the solution of large, dense, complex linear systems of equations via iterative methods. Accelerating the convergence of the iterative method through preconditioning is made difficult due to the non-availability of the system matrix. This paper presents a novel algorithm to solve these linear systems by restricting current to a discrete solenoidal subspace in which Kirchoff‰s law is obeyed, and solving a reduced system via an iterative method such as GMRES. A preconditioner based on the Green‰s function is used to achieve near-optimal convergence rates in several cases. Experiments on a number of benchmark problems illustrate the advantages of the proposed method over FastHenry.
Categories and Subject Descriptors
B.7.2 [Integrating Circuits]: Design Aids - placement and routing, simulation, verification
General Terms
ALGORITHMS
Keywords
Inductance extraction, interconnect, solenoidal basis, iterative methods, preconditioning

48.2 On the Efficacy of Simplified 2D On-Chip Inductance Models [p. 757]
Tao Lin, Michael W. Beattie, Lawrence T. Pileggi

Full three-dimensional (3D) inductance models of on-chip interconnect contain an extremely large number of forward coupling terms. It is therefore desirable to use a two-dimensional (2D) approximation in which forward couplings are not included. Unlike capacitive coupling, however, truncating mutual inductance terms can result in loss of accuracy and even instability. This paper investigates whether ignoring forward couplings is an acceptable choice for all good IC designs or if full 3D models are necessary in certain on-chip interconnect configurations. We show that the significance of the forward coupling inductance depends on various aspects of the design.
Categories and Subject Descriptors
B7.2 [Integrated Circuits] Design Aids - simulation, verification General Terms
Theory, algorithms
Keywords
On-chip inductance, PEEC, sparsified model

48.3 A Physical Model for the Transient Response of Capacitively Loaded Distributed rlc Interconnects [p. 763]
Raguraman Venkatesan, Jeffrey A. Davis, James D. Meindl

Rapid approximation of the transient response of high-speed global interconnects is needed to estimate the time delay, crosstalk, and overshoot in a GSI multilevel wiring network. This paper outlines a rigorous and physical solution for transients in a capacitively-loaded distributed rlc interconnect using a convergent series of modified Bessel functions. Compact models for time delay and repeater insertion are derived. The single-line model is extended to evaluate crosstalk in two-coupled lines. These solutions are validated by HSPICE simulation, and have potential applications to rapid rlc timing analysis, global wire sizing and repeater insertion, signal integrity estimation, and reliability modeling (e.g. voltage overshoot and high current density concerns).
Categories and Subject Descriptors
B.4.3 [Input/Output and Data Communications]: Interconnections (Subsystems) - physical structures (e.g. backplanes, cables, chip carriers).
General Terms
Performance, Design, and Verification.
Keywords
Interconnects, distributed rlc lines, transient response, time delay, crosstalk, overshoot, repeaters.

48.4 HSpeedEx: A High-Speed Extractor for Substrate Noise Analysis in Complex Mixed-Signal SOC [p. 767]
Adil Koukab, Catherine Dehollain, Michel Declercq

The unprecedented impact of noise coupling on Mixed-Signal Systems-On-a-Chip (MS-SOC) functionality, brings a new set of challenges for Electronics Design Automation (EDA) tool developers. In this paper, we propose a new approach which combines a thorough physical comprehension of the noise coupling effects with an improved Boundary-Element-Method (BEM) to accelerate the substrate model extraction and to avoid the dense matrix storage. The low computational efforts required, as well as speed and accuracy reached, makes this method a highly promising alternative to verify complex MS-SOCs.

48.5 Combined BEM/FEM Substrate Resistance Modeling [p. 771]
E. Schrik, N.P. van der Meijs

For present-day micro-electronic designs, it is becoming ever more important to accurately model substrate coupling effects. Basically, either a Finite Element Method (FEM) or a Boundary Element Method (BEM) can be used. The FEM is the most versatile and flexible whereas the BEM is faster, but requires a stratified, layout-independent doping profile for the substrate. Thus, the BEM is unable to properly model any specific, layout-dependent doping patterns that are usually present in the top layers of the substrate, such as channel stop layers. This paper describes a way to incorporate these doping patterns into our substrate model by combining a BEM for the stratified doping profiles with a 2D FEM for the top-level, layout-dependent doping patterns, thereby achieving improved flexibility compared to BEM and improved speed compared to FEM. The method has been implemented in the SPACE layout to circuit extractor and it has been successfully verified with two other tools.
Categories and Subject Descriptors
I.6.5 [Simulation and Modeling]: Model DevelopmentÖ Modeling Methodologies; B.7.2 [Integrated Circuits]: Design Aids - Verification; I.6.4 [Simulation and Modeling]: Model Validation and Analysis
General Terms
Verification, Design
Keywords
substrate noise, modeling, finite element method, boundary element method


Session 49: Development of Processors and Communication Networks for Embedded Systems

Chair: Jan Rabaey
Organizers: Grant E. Martin, Majid Sarrafzadeh
49.1 System Design Methodologies for a Wireless Security Processing Platform [p. 777]
Srivaths Ravi, Anand Raghunathan, Nachiketh Potlapally, Murugan Sankaradass

Security protocols are critical to enabling the growth of a wide range of wireless data services and applications. However, they impose a high computational burden that is mismatched with the modest processing capabilities and battery resources available on wireless clients. Bridging the security processing gap, while retaining sufficient programmability in order to support a wide range of current and future security protocol standards, requires the use of novel system architectures and design methodologies. We present the system-level design methodology used to design a programmable security processor platform for next-generation wireless handsets. The platform architecture is based on (i) a configurable and extensible processor that is customized for efficient domain-specific processing, and (ii) layered software libraries implementing cryptographic algorithms that are optimized to the hardware platform. Our system-level design methodology enables the efficient co-design of optimal cryptographic algorithms and an optimized system architecture. It includes novel techniques for algorithmic exploration and tuning, performance characterization and macro-modeling of software libraries, and architecture refinement based on selection of instruction extensions to accelerate performance-critical, computation-intensive operations. We have designed a programmable security processor platform to support both public-key and private-key operations using the proposed methodology, and have evaluated its performance through extensive system simulations as well as hardware prototyping. Our experiments demonstrate large performance improvements (e.g., 31.0X for DES, 33.9X for 3DES, 17.4X for AES, and up to 66.4X for RSA) compared to well-optimized software implementations on a state-of-the-art embedded processor.
Categories and Subject Descriptors
C.0 [Computer Systems Organization]: General- System architectures; C.1.0 [Computer Systems Organization]: Processor architectures- General; C.2.0 [Computer Systems Organization]: Computer-Communication Networks- General, Security and protection; C.5.3 [Computer Systems Organization]: Computer System Implementation- Microcomputers, Portable devices; E.3 [Data]: Data encryption- DES, Public key cryptosystems
General Terms
Security, Performance, Design, Algorithms
Keywords
Security, Security processing, Encryption, Decryption, Wireless, Handset, Embedded system, Performance, DES, 3DES, AES, RSA, SSL, IPSec, Design methodology, Platform, System architecture

49.2 Constraint-Driven Communication Synthesis [p. 783]
Alessandro Pinto, Luca P. Carloni, Alberto L. Sangiovanni-Vincentelli

Constraint-driven Communication Synthesis enables the automatic design of the communication architecture of a complex system from a library of pre-defined Intellectual Property (IP) components. The key communication parameters that govern all the point-to-point interactions among system modules are captured as a set of arc constraints in the communication constraint graph. Similarly, the communication features offered by each of the components available in the IP communication library are captured as a set of feature resources together with its cost figures. Then, every communication architecture that can be built using the available components while satisfying all constraints is implicitly considered (as an implementation graph matching the constraint graph) to derive the optimum design solution with respect to the desired cost figure. The corresponding constrained optimization problem is efficiently solved by a novel algorithm that is presented here together with its rigorous theoretical foundations.
Categories and Subject Descriptors
J.6.1 [Computer Applications]: Computer-Aided Engineering ÖCAD.
General Terms
Algorithm.
Keywords
Communication Synthesis, Systems-on-Chip, Network Design.

49.3 Component-Based Design Approach for Multicore SoCs [p. 789]
W. Ces½rio, A. Baghdadi, L. Gauthier, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, A. A. Jerraya, M. Diaz-Nava

This paper presents a high-level component-based methodology and design environment for application-specific multicore SoC architectures. Component-based design provides primitives to build complex architectures from basic components. This bottom up approach allows design-architects to explore efficient custom solutions with best performances. This paper presents a high-level component-based methodology and design environment for application-specific multicore SoC architectures. The system specifications are represented as a virtual architecture described in a SystemC-like model and annotated with a set of configuration parameters. Our component-based design environment provides automatic wrapper-generation tools able to synthesize hardware interfaces, device drivers, and operating systems that implement a high-level interconnect API. This approach, experimented over a VDSL system, shows a drastic design time reduction without any significant efficiency loss in the final circuit.
Categories and Subject Descriptors
C.3 [SPECIAL-PURPOSE AND APPLICATION-BASED SYSTEMS]: Microprocessor/microcomputer applications, Real-time and embedded systems.
General Terms
Design, Experimentation, Standardization.
Keywords
Multicore System-on-Chip, Component-based design, HW/SW interfaces abstraction.

49.4 Traffic Analysis for On-chip Networks Design of Multimedia Applications [p. 795]
Girish Varatkar, Radu Marculescu

The objective of this paper is to introduce self-similarity as a fundamental property exhibited by the bursty traffic between on-chip modules in typical MPEG-2 video applications. Statistical tests performed on relevant traces extracted from common video clips establish unequivocally the existence of self-similarity in video traffic. Using a generic communication architecture, we also discuss the implications of our findings on on-chip buffer space allocation and present quantitative evaluations for typical video streams. We believe that our findings open up new directions of research with deep implications on some fundamental issues in on-chip network design for multimedia applications.
Categories and Subject Descriptors: C.4 [Performance of systems]: Modeling techniques; B.8.2 [Performance and reliability]: performance analysis and design aids.
General terms: performance, design
Keywords: system-level design, on-chip networks, communication analysis, self-similarity, long-range dependence.


Session 50: Moving Towards More Effective Validation

Chair: Magdy Abadir
Organizers: Carl Pixley, Masahiro Fujita
50.1 Deriving a Simulation Input Generator and a Coverage Metric From a Formal Specification [p. 801]
Kanna Shimizu, David L. Dill

This paper presents novel uses of functional interface specifications for verifying RTL designs. We demonstrate how a simulation environment, a correctness checker, and a functional coverage metric are all created automatically from a single specification. Additionally, the process exploits the structure of a specification written with simple style rules. The methodology was used to verify a large scale I/O design from the Stanford FLASH project.
Categories and Subject Descriptors
B.5.2 [RTL Implementation]: Design Aids
General Terms
Documentation, Performance, Design, Verification
Keywords
testbench, input generation, BDD minimization, coverage

50.2 Hole Analysis for Functional Coverage Data [p. 807]
Oded Lachish, Eitan Marcus, Shmuel Ur, Avi Ziv

One of the main goals of coverage tools is to provide the user with informative presentation of coverage information. Specifically, information on large, cohesive sets of uncovered tasks with common properties is very useful. This paper describes methods for discovering and reporting large uncovered spaces (holes) for crossproduct functional coverage models. Hole analysis is a presentation method for coverage data that is both succinct and informative. Using case studies, we show how hole analysis was used to detect large uncovered spaces and improve the quality of verification.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design AidsÖVerification; D.2.4 [Software Engineering]: Software/Program Verification; D.2.5 [Software Engineering]: Testing and DebuggingÖTesting tools
General Terms
Verification, Measurement, Algorithms, Experimentation
Keywords
Functional verification, Coverage analysis

50.3 Effective Safety Property Checking Using Simulation-Based Sequential ATPG [p. 813]
Shou Sheng, Koichiro Takayama, Michael S. Hsiao

In this paper, we present a successful application of a simulation-based sequential Automatic Test Pattern Generation (ATPG) for safety property verification, with the target on verifying safety property of large, industrial-strength, hardware designs for which current formal methods fail. Several techniques are developed to increase the effectiveness and efficiency during state exploration and justification of the test generator for verification, including (1) incorporation of a small combinational ATPG engine, (2) reset signal masking, (3) threshold-value simulation, and (4) weighted Hamming distance. Experimental results on both ISCAS89 benchmark circuits and real industry circuits have shown that this simulation-based verifier achieves better or comparable results to current state-of-the-art formal verification tools BINGO and CHAFF.
Categories and Subject Descriptors
B.7.2 [INTEGRATED CIRCUITS]: Design Aids, Verification, Simulation; B.8.1 [PERFORMANCE AND RELIABILITY]: Testing
General Terms
Algorithms, Verification
Keywords
Verification, Simulation-Based, Sequential ATPG

50.4 A Comparison of Three Verification Techniques: Directed Testing, Pseudo-Random Testing and Property Checking [p. 819]
Mike G. Bartley, Darren Galpin, Tim Blackmore

This paper describes the verification of two versions of a bridge between two on-chip buses. The verification was performed just as the Infineon Technologies Design Centre in Bristol was introducing pseudo-random testing (using Specman) and property checking (using GateProp) into their verification flows and thus provides a good opportunity to compare these two techniques with the existing strategy of directed testing using VHDL bus functional models.
Categories and Subject Descriptors
B.5.2 [Register-Transfer-Level Implementation]: Design Aids - Verification.
General Terms
Verification.


Session 51: Special Session: Energy Efficient Mobile Computing

Chair: Enrico Macii
Organizer: Mani Srivastava
51.1 Energy-Efficient Communication Protocols [p. 824]
Carla F. Chiasserini, Pavan Nuggehalli, Vikram Srinivasan, Ramesh R. Rao

Wireless networking has experienced a great deal of popularity, and significant advances have been made in wireless technology. However, energy efficiency of radio communication systems is still a critical issue due to the limited battery capacity of portable devices. In this paper, we deal with the charge recovery effect that takes place in electrochemical cells and show how we can take advantage of this mechanism to increase the energy delivered by a battery. Then, we present energy-aware traffic shaping techniques, as well as scheduling and routing algorithms, which exploit the battery recovery effect.
Categories and Subject Descriptors
C.2.2 [Computer-Systems Organization]: Network Protocols
General Terms
Algorithms
Keywords
Wireless networks, energy efficiency, battery charge recovery

51.2 Reliable and Energy-Efficient Digital Signal Processing [p. 830]
Naresh Shanbhag

This paper provides an overview of algorithmic noise-tolerance (ANT) for designing reliable and energy-efficient digital signal processing systems. Techniques such as prediction-based, error cancellation-based, and reduced precision redundancy based ANT are discussed. Average energy-savings range from 67% to 71% over conventional systems. Fluid IP core generators are proposed as a means of encapsulating the benefits of an ANT-based low-power design methodology. CAD issues resident in such a methodology are also discussed.
Categories and Subject Descriptors
B.7 [Hardware]: Integrated Circuits
General Terms
DESIGN
Keywords
reliability, energy-efficiency, low-power, deep submicron, noise, noise-tolerance, broadband, communications

51.3 CMOS: A Paradigm for Low Power Wireless? [p. 836]
Michiel Steyaert, Peter Vancorenland

An overview and comparison of different topologies for wireless architectures are discussed, where the main focus lies on the power consumption and possibilities towards integration and reduction of external components. Architectures with reduced number of building blocks (both internal and external) are presented where the main benefits are the low costs, both in the CMOS technology as well as the power.
Categories and Subject Descriptors
B.7 [Hardware]: Integrated Circuits
General Terms
Design
Keywords
CMOS Wireless receivers Low-power


Session 52: Floorplanning and Placement

Chair: Ralph Otten
Organizers: Ralph Otten, Steven Teig
52.1 TCG-S: Orthogonal Coupling of P*-admissible Representations for General Floorplans [p. 842]
Jai-Ming Lin, Yao-Wen Chang

We extend in this paper the concept of the P-admissible floorplan representation to that of the P*-admissible one. A P*-admissible representation can model the most general floorplans. Each of the currently existing P*-admissible representations, SP, BSG, and TCG, has its strengths as well as weaknesses. We show the equivalence of the two most promising P*-admissible representations, TCG and SP, and integrate TCG with a packing sequence (part of SP) into a new representation, called TCGS. TCG-S combines the advantages of SP and TCG and at the same time eliminates their disadvantages. With the property of SP, faster packing and perturbation schemes are possible. Inherited nice properties from TCG, the geometric relations among modules are transparent to TCGS (implying faster convergence to a desired solution), placement with position constraints becomes much easier, and incremental update for cost evaluation can be realized. These nice properties make TCG-S a superior representation which exhibits an elegant solution structure to facilitate the search for a desired floorplan/placement. Extensive experiments show that TCG-S results in the best area utilization, wirelength optimization, convergence speed, and stability among existing works and is very flexible in handling placement with special constraints.
Categories and Subject Descriptors
J.6 [COMPUTER-AIDED ENGINEERING]: Computer-aided design (CAD)
General Terms
Algorithms

52.2 Floorplanning with Alignment and Performance Constraints [p. 848]
Xiaoping Tang, D. F. Wong

In this paper, we present a floorplanning algorithm based on sequence pair representation. Our floorplanner has the following important features: 1) It is explicitly designed for fixed-frame floorplanning, which is different from traditional well-researched minarea floorplanning. Moreover, we also show that it can be adapted to minimize total area. 2) It addresses the problem of handling alignment constraint which arises in bus structure. 3) It deals with performance constraint such as bounded net delay, while many existing floorplanners just minimize total wire length. 4) More importantly, even with all these constraints the algorithm is very fast in that it evaluates the feasibility of a sequence pair and translates to a floorplan in O (n log log n) time typically where n is the number of blocks and the number of constrained blocks is O(n), which is significantly faster than the O(n3) method operating on constraint graph. Our algorithm is based on computing the longest common subsequence of a pair of weighted sequences. Experimental results on MCNC benchmark for block placement show the promise of the method.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits]: Design AidsÖPlacement and routing; J.6 [Computer Applications]: Computer-Aided EngineeringÖ Computer-aided design
General Terms
Algorithms, Design, Performance, Theory
Keywords
longest common subsequence, floorplanning, sequence pair

52.3 Algorithms for Simultaneous Satisfaction of Multiple Constraints and Objective Optimization in a Placement Flow with Application to Congestion Control [p. 854]
Ke Zhong, Shantanu Dutt

This paper addresses the problem of tackling multiple constraints simultaneously during a partitioning driven placement (PDP) process, where a large solution space is available for constraint-satisfying optimization compared to post-placement method. A general methodology of multi-constraint satisfaction that balances violation correction and primary optimization is presented. A number of techniques are introduced to ensure its convergence and enhance its solution search capability with intermediate relaxation. Application of our approach to congestion control modeled as pin density and external net distribution balance constraints shows it effectively reduces overall congestion by 14.3% and improves chip area by 8.9%, with reasonable running time and only 1.6% increase in wire length. As far as we know, this is the first time an approach to congestion reduction during placement optimization produced good congestion improvement with very small wire length increase.
Categories and Subject Descriptors:
J.6 [Computer-Aided Engineering]:
General Terms
Algorithms, Experimentation
Keywords
Multi-constraint satisfaction, partitioning-driven placement, intermediate relaxation, congestion reduction, connector generation, minimization of objective deterioration.


Session 53: Circuit Effects in Static Timing

Chair: Jamil Kawa
Organizers: Chandu Visweswariah, Louis Scheffer
53.1 Coping with Buffer Delay Change Due to Power and Ground Noise [p. 860]
Lauren Hui Chen, Malgorzata Marek-Sadowska, Forrest Brewer

Variation of power and ground levels affect VLSI circuit performance. Trends in device technology and in packaging have necessitated a revision in conventional delay models. In particular, simple scalable models are needed to predict delays in the presence of uncorrelated power and ground noise. In this paper, we analyze the effect of such noise on signal propagation through a buffer and present simple, closed-form formulas to estimate the corresponding change of delay. The model captures both positive (slowdown) and negative (speedup) delay changes. It is consistent with short-channel MOSFET behavior, including carrier velocity saturation effects. An application shows that repeater chains using buffers instead of inherently faster inverters tend to have superior supply level-induced jitter characteristics.
Categories and Subject Descriptors:
J.6 [Computer-Aided Engineering]: Computer-aided design (CAD).
General Terms:
Algorithms.
Keywords:
Power and ground noise, differential mode noise, common mode noise, incremental delay change.

53.2 Osculating Thevenin Model for Predicting Delay and Slew of Capacitively Characterized Cells [p. 866]
Bernard N. Sheehan

To extrapolate from one point to another using a line, one had better get the slope right. In this paper we apply a similar concept to the important problem in Static Timing Analysis (STA) of predicting cell timing for RC loads using capacitive characterization data. Instead of a line we have a Thevenin circuit, and instead of matching slopes we match load sensitivities. We present a table driven, highly accurate cell delay and slew prediction procedure that can improve STA when interconnect effects dominate. It is significantly more accurate (especially slew) than previously published Ceff methods.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids - simulation, verification.
General Terms Algorithms, Verification.
Keywords Static Timing Analysis, Effective Capacitance

53.3 Timed Pattern Generation for Noise-on-Delay Calculation [p. 870]
Seung Hoon Choi, Florentin Dartu, Kaushik Roy

Computing the noise on delay effects is required for all circuits from simple ASIC designs to microprocessors. Transistor-level simulation engines make accurate delay calculation affordable only if the number of simulation per stage is very small. We propose a solution that predicts the alignment of aggressor signals with respect to the victim signal to induce the worst-case noise effect on delay. The aggressor alignment can be used to setup a detailed simulation. The worst-case delay in the presence of noise is predicted within 5% error for more than 99% of the cases tested using an average of 1.27 simulations per stage transition.
Categories and Subject Descriptors
B.8.2 [Hardware] : Performance Analysis and Design Aids
General Terms : Performance, Verification

53.4 VeriCDF: A New Verification Methodology for Charged Device Failures [p. 874]
Jaesik Lee, Ki-Wook Kim, Sung-Mo Kang

A novel tool for full-chip verification is reported for CDM-ESD protection. Until recently, ESD protection has been simulated in device level, leading to the well known limitations on capturing global features such as the power protection circuits and package parasitics. In practice, fatal failures occur due to unexpected discharged paths in multi-power supply chips, which can only be verified by chip-level simulation. Associated with the new concept of macromodelling, hierarchical approach provides effective analysis methodology for mixed-signal chips. The hierarchical approach provides the analysis of chip-level discharging paths and reliability of gate oxide. Simulation results on a CMOS ASIC chip processed in a 0.25-µm technology are in accordance with the measurement data. Scanning electron microscope locates a gate oxide fault as our analysis predicted.
Categories and Subject Descriptors
B.8 [Hardware]: Performance and Reliability; C.4 [Computer Systems Organization]: Performance of Systems
General Terms
Verification
Keywords
Reliability, Modeling, Simulation


Session 54: Design Space Exploration for Embedded Systems

Chair: Henk Corporaal
Organizers: Jo Dale Carothers, Luca Benini
54.1 A Framework for Evaluating Design Tradeoffs in Packet Processing Architectures [p. 880]
Lothar Thiele, Samarjit Chakraborty, Matthias Gries, Simon KÄnzli

We present an analytical method to evaluate embedded network packet processor architectures, and to explore their design space. Our approach is in contrast to those based on simulation, which tend to be infeasible when the design space is very large. We illustrate the feasibility of our method using a detailed case study.
Categories and Subject Descriptors
C.0 [Computer Systems Organization]: GeneralÖModeling of computer architecture, System architectures; B.4.1 [Hardware]: Input/output and data communications - Data communication devices
General Terms
Performance, Design

54.2 Energy Estimation and Optimization of Embedded VLIW Processors based on Instruction Clustering [p. 886]
A. Bona, M. Sami, D. Sciuto, C. Silvano, V. Zaccaria, R. Zafalon

Aim of this paper is to propose a methodology for the definition of an instruction-level energy estimation framework for VLIW (Very Long Instruction Word) processors. The power modeling methodology is the key issue to define an effective energy-aware software optimisation strategy for state-of-the-art ILP (Instruction Level Parallelism) processors. The methodology is based on an energy model for VLIW processors that exploits instruction clustering to achieve an efficient and fine grained energy estimation. The approach aims at reducing the complexity of the characterization problem for VLIW processors from exponential, with respect to the number of parallel operations in the same very long instruction, to quadratic, with respect to the number of instruction clusters. Furthermore, the paper proposes a spatial scheduling algorithm based on a low-power reordering of the parallel operations within the same long instruction. Experimental results have been carried out on the Lx processor, a 4-issue VLIW core jointly designed by HPLabs and STMicroelectronics. The results have shown an average error of 1:9% between the cluster-based estimation model and the reference design, with a standard deviation of 5:8%. For the Lx architecture, the spatial instruction scheduling algorithm provides an average energy saving of 12%.
Categories and Subject Descriptors
B.7.0 [Design Aids]: General; B.6.0 [Logic Design]: General
General Terms
Experimentation
Keywords
power estimation, vliw architectures

54.3 Energy Exploration and Reduction of SDRAM Memory Systems [p. 892]
Yongsoo Joo, Yongseok Choi, Hojun Shim, Hyung Gyu Lee, Kwanho Kim, Naehyuck Chang

In this paper, we introduce a precise energy characterization of SDRAM main memory systems and explore the amount of energy associated with design parameters, leading to energy reduction techniques that we are able to recommend for practical use. We build an in-house energy simulator for SDRAM main memory systems based on cycle-accurate energy measurement and state-machine-based characterizations which independently characterize dynamic and static energy. We explore energy behavior of the memory systems by changing design parameters such as processor clock, memory clock and cache configuration. Finally we propose new energy reduction techniques for the address bus and practical mode control schemes for the SDRAM devices. We save 10.8mJ and 12mJ, 40.2% and 14.5% of the total energy, for 24M instructions of an MP3 decoder and a JPEG compressor, using a typical 32-bit, 64MB SDRAM memory system.
Categories and Subject Descriptors
B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids; C.4 [Performance of Systems]
General Terms
Design, Experimentation, Measurement, Performance
Keywords
low power, memory system, SDRAM


Session 55: Behavioral Synthesis

Chair: Petru Eles
Organizers: Ahmed A. Jerraya, Krzysztof Kuchcinski
55.1 Coordinated Transformations for High-Level Synthesis of High Performance Microprocessor Blocks [p. 898]
Sumit Gupta, Timothy Kam, Michael Kishinevsky, Shai Rotem, Nick Savoiu, Nikil Dutt, Rajesh Gupta, Alex Nicolau

High performance microprocessor designs are partially characterized by functional blocks consisting of a large number of operations that are packed into very few cycles (often single-cycle) with little or no resource constraints but tight bounds on the cycle time. Extreme parallelization, conditional and speculative execution of operations is essential to meet the processor performance goals. However, this is a tedious task for which classical high-level synthesis (HLS) formulations are inadequate and thus rarely used. In this paper, we present a new methodology for application of HLS targeted to such microprocessor functional blocks that can potentially speed up the design space exploration for microprocessor designs. Our methodology consists of a coordinated set of source-level and fine-grain parallelizing compiler transformations that targets these behavioral descriptions, specifically loop constructs in them and enables efficient chaining of operations and high-level synthesis of the functional blocks. As a case study in understanding the complexity and challenges in the use of HLS, we walk the reader through the detailed design of an instruction length decoder drawn from the Pentium®-family of processors. The chief contribution of this paper is formulation of a domain-specific methodology for application of high-level synthesis techniques to a domain that rarely, if ever, finds use for it.
Categories and Subject Descriptors: B.5.1 [Register-Transfer-Level Implementation] Design Aids C.5.3 [Computer System Implementation] Microcomputers š Microprocessors
General Terms: Design
Keywords: High-level synthesis, microprocessor design

55.2 Forward-Looking Objective Functions: Concept and Applications in High Level Synthesis [p. 904]
Jennifer L. Wong, Seapahn Megerian, Miodrag Potkonjak

The effectiveness of traditional CAD optimization algorithms is proportional to the accuracy of the targeted objective functions. However, behavioral synthesis tools are not used in isolation; they form a strongly connected design flow where each tool optimizes its own objective function without considering the consequences on the optimization goals of the subsequently applied tools. Therefore, efforts to optimize one aspect of a design often have unforeseen negative impacts on other phases of the design process. Our objective is to establish a systematic way of developing and validating new types of objective functions that consider the effects on subsequently applied synthesis steps. We demonstrate the generic forward-looking objective function (FLOF) strategy on three main steps in behavioral synthesis: (i) Transformation, (ii) Scheduling, and (iii) Register Assignment. We show how the FLOF can be used in the first two phases to reduce the total number of registers required in the third phase.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids - Optimization.
General Terms
Algorithms, Design.
Keywords
Objective Functions, Behavioral Synthesis, Transformations, Scheduling, Register Assignment

55.3 ILP-Based Engineering Change [p. 910]
Farinaz Koushanfar, Jennifer L. Wong, Jessica Feng, Miodrag Potkonjak

We have developed a generic integer linear programming(ILP)-based engineering change(EC) methodology. The EC methodology has three components: enabling, fast, and preserving. Enabling EC provides a user with the means to specify the amount of flexibility and how this flexibility should be distributed throughout the solution so that one can guarantee that a specific set of EC demands can be satisfied while preserving the quality of the initially obtained solution. Fast EC conducts changes in a fraction of the time needed to solve the problem while preserving or in some cases improving the quality of the initial solution. Preserving EC maintains either user specified components of the solution or as much as possible of the initial solution while still guaranteeing an optimal solution to the altered problem instance. We applied the generic methodology to Boolean Satisfiability (SAT) problem. The effectiveness of all proposed approaches and algorithms is demonstrated on standard benchmarks.
Categories and Subject Descriptors
B.6 [Logic Design]: General; B.6.3 [Design Aids ]: Software Engineering; B.7 [Integrated circuits]: [VLSI (very large scale integration)]
General Terms
Automatic synthesis, optimization
Keywords
Engineering change, satisfiability(SAT), integer linear programming, synthesis