SIGDA, DAC 2004, Abstracts

DAC 2004 ABSTRACTS

Sessions: [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55]

SESSION 1: CEO Panel: EDA - This is Serious Business [p. 1]

Chair: A. Richard Newton (University of California at Berkeley)
Organizers: Robert Dahlberg, Kurt Keutzer
Panelists: R. Bingham (Cadence Design Systems, Inc.), A. De Geus (Synopsys, Inc.), W. C. Rhines (Mentor Graphics Corp.)

Over the last 20 years, electronic design has grown to annual revenues of $3B per year and dominates the Technical and Systems Sector of the Software Industry. Cadence and Synopsys are among the top 20 independent software vendors when ranked by market capitalization. In short, the EDA industry has come of age, but the future of the industry is far from certain. The panel participants are the three CEOs from the three largest EDA companies: Ray Bingham, Cadence Design Systems, Inc.; Aart de Geus, Synopsys, Inc.; and Wally Rhines, Mentor Graphics Corp.. The panel will open with each of the CEOs sharing their outlook for the industry as a whole and their visions to foster growth of their respective companies. After the opening statements by the CEOs, Professor Richard Newton, University of California at Berkeley, will moderate a series of questions posed by a panel of six questioners�

SPECIAL SESSION 2: Hot Leakage

Chair: Massoud Pedram (University of Southern California)
Organizer: Lei He

2.1 Design Optimizations for Microprocessors at Low Temperature [p. 2]

A. Vassighi, A. Keshavarzi, S. Narendra, G. Schrom, Y. Ye (Circuits Research), S. Lee (Intel Labs., OR), G. Chrysler (Intel Labs., AZ), M. Sachdev (University of Waterloo), V. De (Circuits Research)

We investigate trade-offs in microprocessor frequency and system power achievable for low temperature operation in scaled high leakage technologies by combining refrigeration with supply voltage selection, body bias, transistor sizing and shorter channel length. Reducing channel length provides better frequency and power improvement than forward body bias. When, the leakage power is more than 30% of chip power, combining refrigeration with enhancing technology by shorter channel length provides the best trade-off for power and frequency.
KEYWORDS: Low temperature, Electrothermal modeling, microprocessor, power, frequency, CMOS, cooling, refrigeration

2.2 Leakage in Nano-Scale Technologies: Mechanisms, Impact and Design Considerations [p. 6]

A. Agarwal, C. H. Kim, S. Mukhopadhyay, K. Roy (Purdue University)

The high leakage current in nano-meter regimes is becoming a significant portion of power dissipation in CMOS circuits as threshold voltage, channel length, and gate oxide thickness are scaled. Consequently, the identification of different leakage components is very important for estimation and reduction of leakage. Moreover, the increasing statistical variation in the process parameters has led to significant variation in the transistor leakage current across and within different dies. Designing with the worst case leakage may cause excessive guard-banding, resulting in a lower performance. This paper explores various intrinsic leakage mechanisms including weak inversion, gateoxide tunneling and junction leakage etc. Various circuit level techniques to reduce leakage energy and their design trade-off are discussed. We also explore process variation compensating techniques to reduce delay and leakage spread, while meeting power constraint and yield.
Keywords: Leakage current, Circuit design, Process variation.

2.3 System Level Leakage Reduction Considering the Interdependence of Temperature and Leakage [p. 12]

L. He, W. Liao (University of California at Los Angeles), M. R. Stan (University of Virginia)

The high leakage devices in nanometer technologies as well as the low activity rates in system-on-a-chip (SOC) contribute to the growing significance of leakage power at the system level. We first present system-level leakage-power modeling and characteristics and discuss ways to reduce leakage for caches. Considering the interdependence between leakage power and temperature, we then discuss thermal runaway and dynamic power and thermal management (DPTM) to reduce power and prevent thermal violations. We show that a thermal-independent leakage model may hide actual failures of DPTM. Finally, we present voltage scaling considering DPTM for different packaging options. We show that the optimal Vdd for the best throughput may be smaller than the largest Vdd allowed by the given packaging platform, and that advanced cooling techniques can improve throughput significantly.
Keywords: Microarchitecture, Leakage power, Temperature

SESSION 3: Clock Routing and Buffering

Chair: John Lillis (University of Illinois)
Organizers: Dirk Stroobandt, Raymond Nijssen

3.1 Reducing Clock Skew Variability via Cross Links [p. 18]

A. Rajaram, J. Hu, R. Mahapatra (Texas A&M University),

Increasingly significant variational effects present a great challenge for delivering desired clock skew reliably. Nontree clock network has been recognized as a promising approach to overcome the variation problem. Existing non-tree clock routing methods are restricted to a few simple or regular structures, and often consume excessive amount of wirelength. In this paper, we suggest to construct a low cost nontree clock network by inserting cross links in a given clock tree. The effects of the link insertion on clock skew variability are analyzed. Based on the analysis, we propose two link insertion schemes that can quickly convert a clock tree to a non-tree with significantly lower skew variability and very limited wirelength increase. In these schemes, the complicated non-tree delay computation is circumvented. Further, they can be applied to the recently popular non-zero skew routing easily. Experimental results on benchmark circuits show that this approach can achieve significant skew variability reduction with less than 2% increase of wirelength.
Keywords: VLSI, physical design, variation, clock network synthesis

3.2 Fast and Flexible Buffer Trees that Navigate the Physical Layout Environment [p. 24]

C. J. Alpert (IBM Corporation), M. Hrkic (University of Illinois at Chicago), J. Hu (Texas A&M University), S. T. Quay (IBM Corporation)

Buffer insertion is an increasingly critical optimization for achieving timing closure, and the number of buffers required increases significantly with technology migration. It is imperative for an automated buffer insertion algorithm to be able to efficiently optimize tens of thousands of nets. One must also be able to effectively navigate the existing layout, including handling large blockages, blockages with holes specifically for buffers, specially allocated buffer blocks, placement porosity, and routing congestion. The algorithm must also be flexible enough to know when to use and when not to use expensive layout resources. Although several previous works have addressed buffer insertion in the presence of blockages, this is the first to present a complete solution that can manage the physical layout environment.
Keywords: Buffer insertion, physical synthesis, global routing.

3.3 Practical Repeater Insertion For Low Power: What Repeater Library Do We Need? [p. 30]

X. Liu, Y. Peng (North Carolina State University), M. C. Papaefthymiou (University of Michigan)

In this paper, we investigate the problem of repeater insertion for low power under a given timing budget. We propose a novel repeater insertion algorithm to compute the optimal repeater number and width in the discrete solution space, as defined by a given repeater library. Using our algorithm, we show that rounding the solution under the continuity assumption to the closest discrete solution candidate may result in suboptimal designs, or it may even fail to find an existing solution. Given a certain tolerance to the degradation of repeater power dissipation, we address two practical and highly important questions: (1) How coarse could the repeater size granularity be? (2) What range should the repeater size be in? Experimental results demonstrate the high effectiveness of the proposed scheme and provide valuable insights into repeater library design. Our approach achieves up to 23% power reduction in comparison to rounding-based approaches. With a 4% power degradation tolerance, repeater size granularity as coarse as 8 λ can be used, reducing the library size by more than 87%. For interconnects with various wire lengths and timing targets, our investigation reveals that the range of optimal repeater sizes for low-power is limited, indicating that a low-cost small-size repeater library, if well designed, is adequate to provide high quality repeater insertion solutions.
Keywords: Interconnect, Repeater Insertion, Low power

SESSION 4: Tools and Strategies for Dynamic Verification

Chair: Jacob Abraham (University of Texas)
Organizers: Adnan Aziz, Yaron Kashi

4.1 Industrial Experience with Test Generation Languages for Processor Verification [p. 36]

M. Behm, J. Ludden (IBM Development Center Austin), Y. Lichtenstein, M. Rimon, M. Vinov (IBM Research Laboratory, Haifa)

We report on our experience with a new test generation language for processor verification. The verification of two superscalar multiprocessors is described and we show the ease of expressing complex verification tasks. The cost and benefit are demonstrated: training takes up to six months; the simulation time required for a desired level of coverage has decreased by a factor of twenty; the number of escape bugs has been reduced.
Keywords: Functional verification, Processor verification, Test generation

4.2 Defining Coverage Views to Improve Functional Coverage Analysis [p. 41]

S. Asaf, E. Marcus, A. Ziv (IBM Research Laboratory, Haifa)

Coverage analysis is used to monitor the quality of the verification process. Reports provided by coverage tools help users identify areas in the design that have not been adequately tested. Because of their sheer size, the analysis of large coverage models can be an intimidating and time-consuming task. Practically, it can only be done by focusing on specific parts of the model. This paper presents a method for defining views onto the coverage data of cross-product functional coverage models. The proposed method allows users to focus on certain aspects of the coverage data to extract relevant, useful information, thereby improving the quality of the coverage analysis. A number of examples are provided that show how the proposed method improved the verification of actual designs.
Keywords: Functional verification, Coverage analysis

4.3 Systematic Functional Coverage Metric Synthesis from Hierarchical Temporal Event Relation Graph [p. 45]

Y.-S. Kwon, Y.-I. Kim, C.-M. Kyung (KAIST)

Functional coverage is a technique for checking the completeness of test vectors in HDL simulation. Temporal events are used to monitor the sequence of events in the specification. In this paper, automatic generation of temporal events for functional coverage is proposed. The HiTER is the graph where nodes represent basic temporal properties or subgraph and edges represent time-shift value between two nodes. Hierarchical temporal events are generated by traversing HiTER such that invalid, or irrelevant properties are eliminated. Concurrent edge groups make it possible to generate more comprehensive temporal properties and hierarchical structure makes it easy to describe large design by combining multiple subgraphs. Automatically generated temporal events describe almost all the possible temporal properties of the design under verification.
Keywords: Functional coverage, Temporal Event

4.4 Probabilistic Regression Suites for Functional Verification [p. 49]

S. Fine, S. Ur, A. Ziv (IBM Research Laboratory, Haifa)

Random test generators are often used to create regression suites on-the-fly. Regression suites are commonly generated by choosing several specifications and generating a number of tests from each one, without reasoning which specification should be used and how many tests should be generated from each specification. This paper describes a technique for building high quality random regression suites. The proposed technique uses information about the probability of each test specification covering each coverage task. This probability is used, in turn, to determine which test specifications should be included in the regression suite and how many tests should be generated from each specification. Experimental results show that this practical technique can be used to improve the quality, and reduce the cost, of regression suites. Moreover, it enables better informed decisions regarding the size and distribution of the regression suites, and the risk involved.
Keywords: Functional Verification, Coverage Analysis, Regression Suite

SESSION 5: Timing - Driven System Synthesis

Chair: Prashant Sawkar (Intel Corporation)
Organizers: Gila Kamhi, Krzysztof Kuchcinski

5.1 Modular Scheduling of Guarded Atomic Actions [p. 55]

D. L. Rosenband, Arvind (Massachusetts Institute of Technology)

A modular synthesis flow is essential for a scalable and hierarchical design methodology. This paper considers a particular modular flow where each module has interface methods and the internal behavior of the module is described in terms of a set of guarded atomic actions on the state elements of the module. A module can also read and update the state of other modules but only by invoking the interface methods of those modules. This paper extends the past work on hardware synthesis of a set of guarded atomic actions by Hoe and Arvind to modules of such actions. It presents an algorithm that, given the scheduling constraints on the interface methods of the called modules, derives the "glue logic" and the scheduling constraints for the interface methods of the calling module such that the atomicity of the guarded actions is preserved across module boundaries. Such modules provide reusable IP which facilitates "correctness by construction" design methodology. It also reduces compile-times dramatically in comparison to the compilation that flattens all the modules first.

5.2 Automatic Correct Scheduling of Control Flow Intensive Behavioral Descriptions in Formal Synthesis [p. 61]

K. Kapp, V. Sabelfeld (University of Karlsruhe)

Formal synthesis ensures the correctness of hardware synthesis by automatically deriving the circuit implementation by behavior preserving transformations within a theorem prover. In this paper, we present a new approach to formal synthesis that is able to handle control flow intensive descriptions. In particular, we consider here the scheduling phase, which is a central task in high-level synthesis. We describe a methodology employing a new control flow oriented representation which allows the fully automatic scheduling of control flow intensive descriptions in formal synthesis. To obtain scheduled circuits of high quality, the scheduling is guided by conventional scheduling algorithms.
Keywords: Formal Synthesis, Transformational Design, Scheduling

5.3 A Timing-Driven Module-Based Chip Design Flow [p. 67]

F. Mo, R. K. Brayton (University of California at Berkeley)

A Module-Based design flow for digital ICs with hard and soft modules is presented. Versions of the soft modules are implemented with different area/delay characteristics. The versions represent flexibility that can be used in the physical design to meet timing requirements. The flow aims at minimizing the clock cycle of the chip while providing quicker turn-around time. Unreliable wiring estimation is eliminated and costly iterations are reduced resulting in substantial reductions in run time as well as a significant decrease in the clock periods.
Keywords: Design flow, Physical synthesis, Timing constraints

5.4 Timing Closure through a Globally Synchronous, Timing Partitioned Design Methodology [p. 71]

A. Edman, C. Svensson (Link&ooul;ping University)

A method to mitigate timing problems due to global wire delays is proposed. The method follows closely a fully synchronous design flow and utilizes only true digital library elements. The design is partitioned into isochronous blocks at system level, where a few clock cycles latency is inserted between the isochronous blocks. This latency is then utilized to automatically mitigate unknown global wire delays, unknown global clock skews and other timing uncertainties occurring in backend design. The new method is expected to considerably reduce the timing closure effort in large high frequency digital designs in deep submicron technologies.
Keywords: Timing closure, wire delays, clock skew

SPECIAL SESSION 6: Reliable System-on-a-Chip Design in the Nanometer Era

Chair: Leon Stok (IBM Corporation)
Organizer: Naresh Shanbhag

6.1 Design and Reliability Challenges in Nanometer Technologies [p. 75]

S. Borkar, T. Karnik, V. De (Intel Laboratories)

CMOS technology scaling is causing the channel lengths to be sub-wavelength of light. Parameter variation, caused by subwavelength lithography, will pose a major challenge for design and reliability of future high performance microprocessors in nanometer technologies. In this paper, we present the impact of these variations on processor functionality, predictability and reliability. We propose design and CAD solutions for variation tolerance. We conclude this paper with soft error rate scaling trends and soft error tolerant circuits for reliability enhancement.
Keywords: Low-power, variation tolerance, leakage tolerance, reliability, soft errors.

6.2 A Communication-Theoretic Design Paradigm for Reliable SOCs [p. 76]

N. R. Shanbhag (University of Illinois at Urbana-Champaign)

Presented is a design paradigm, pioneered at the University of Illinois in 1997, for reliable and energy -efficient system-on-a-chip (SOC) in nanometer process technologies. These technologies are characterized by non-idealities such as coupling, leakage, soft errors, and process variations, which contribute to a reliability problem. Increasing complexity of systems-on-a-chip (SOC) leads to a related power problem. The proposed paradigm provides solutions to both problems by viewing SOCs as communication networks, and employs ideas from error-control coding, communications, and information theory in order to achieve the dual goals of reliability and energy-efficiency.
Keywords: Low-power, system-on-a-chip, reliability, communications, coding.

6.3 Reliable Communication in Systems on Chips [p. 77]

G. De Micheli (Stanford University)

KeywordsSoCs, VLSI, Networking

6.4 Designing Robust Microarchitectures [p. 78]

T. M. Austin (University of Michigan)

A fault-tolerant approach to microprocessor design, developed at the University of Michigan, is presented. Our approach is based on the use of in-situ checker components that validate the functional and electrical characteristics of complex microprocessor designs. Two design techniques are highlighted: a low-cost double-sampling latch design capable of eliminating power-hungry voltage margins, and a formally verifiable checker co-processor that validates all computation produced by a complex microprocessor core. By adopting a "better than worst-case" approach to system design, it is possible to address reliability and uncertainty concerns that arise during design, manufacturing and system operation.
Keywords: Low-power, system-on-a-chip, reliable microarchitecture design.

6.5 Hierarchical Application Aware Error Detection and Recovery [p. 79]

R. K. Iyer (University of Illinois at Urbana-Champaign)

Proposed is a four-tired approach to develop and integrate detection and recovery support at different levels of the system hierarchy. The proposed mechanisms exploit support provided by (i) embedded hardware, (ii) operating system, (iii) compiler, and (iv) application.
Keywords: Hierarchical error detection and recovery, embedded hardware, software implemented fault tolerance.

SESSION 7: Panel - When IC Yield Missed the Target, Who is at Fault? [p. 80]

Chair: Lucio Lanza (Lanza techVentures)
Organizer: Andrzej Strojwas
Panelists: M. Campbell (Qualcomm, Inc.), V. C. Gerousis (Infineon Tech.), J. Hogan (Telos Venture Partners), J. Kibarian (PDF Solutions, Inc.), M. Levitt (Cadence Design Systems, Inc.), W. Ng (Chartered Semiconductor Manufacturing, Inc.), D. Pramanik (Synopsys, Inc.), M. Templeton (Artisan Components, Inc.)

Silicon yield once was dominated by contaminants and particulates, making yield a process issue. But with today's electronics supply chain, multiple suspects may be indicted on manufacturability issues. Who is responsible for preventative actions in manufacturability and yield? The panelists, representing a foundry, a fabless company, an IP provider, two EDA vendors, and an IC design team, will discuss the problems and the solutions for achieving manufacturability and yield goals.

SESSION 8: Power Modeling and Optimization for Embedded Systems

Chair: Massoud Pedram (University of Southern California)
Organizers: Luca Benini, Trevor Mudge

8.1 Memory Access Scheduling and Binding Considering Energy Minimization in Multi-Bank Memory Systems [p. 81]

C.-G. Lyuh (Electronics and Telecommunications Research Institute), T. Kim (Seoul National University)

Memory-related activity is one of the major sources of energy consumption in embedded systems. Many types of memories used in embedded systems allow multiple operating modes (e.g., active, standby, nap, power-down) to facilitate energy saving. Furthermore, it has been known that the potential energy saving increases when the embedded systems use multiple memory banks in which their operating modes are controlled independently. In this paper, we propose (a compiler-directed) integrated approach to the problem of maximally utilizing the operating modes of multiple memory banks by solving the three important tasks simultaneously: (1) assignment of variables to memory banks, (2) scheduling of memory access operations, and (3) determination of operating modes of banks. Specifically, for an instance of tasks 1 and 2, we formulate task 3 as a shortest path(SP) problem in a network and solved it optimally. We then develop an SP-based heuristic that solves tasks 2 and 3 efficiently in an integrated fashion. We then extend the proposed approach to address the limited register constraint in processor. From experiments with a set of benchmark programs, we confirm that the proposed approach is able to reduce the energy consumption by 15.76% over that by the conventional greedy approach.
Keywords: low energy design, scheduling, binding

8.2 Profile-based Optimal Intra-task Voltage Scheduling for Hard Real-Time Applications [p. 87]

J. Seo (KAIST), T. Kim (Seoul National University), K.-S. Chung (Hanyang University)

This paper presents a set of comprehensive techniques for the intratask voltage scheduling problem to reduce energy consumption in hard real-time tasks of embedded systems. Based on the execution profile of the task, a voltage scheduling technique that optimally determines the operating voltages to individual basic blocks in the task is proposed. The obtained voltage schedule guarantees minimum average energy consumption. The proposed technique is then extended to solve practical issues regarding transition overheads, which are totally or partially ignored in the existing approaches. Finally, a technique involving a novel extension of our optimal scheduler is proposed to solve the scheduling problem in a discretely variable voltage environment. In summary, it is confirmed from experiments that the proposed optimal scheduling technique reduces energy consumption by 20.2% over that of one of the state-of-the-art schedulers [11] and, further, the extended technique in a discrete voltage environment reduces energy consumption by 45.3% on average.
Keywords: low energy design, DVS, intra-task voltage scheduling

8.3 Requirement-Based Design Methods for Adaptive Communications Links [p. 93]

J. A. Carballo, K. Nowka, S.-M. Yoo, I. Vo (IBM Austin Research Laboratory), C. Cranford, R. Norman (IBM Systems and Technology)

High-speed communications link cores must consume low-power, feature low bit-error-rates (BER), and address many applications. We present a methodology to design adaptive link architectures, whereby the link.s internal logic complexity, frequency, and supply are simultaneously adapted to application requirements. The requirement space is mapped to the design space using requirements measurement circuits and configurable logic blocks. CMOS results indicate that power savings of 60% versus the worst case are possible, while the area overhead is kept under 5%.
Keywords: Energy Efficient Design, Communication Architectures.

8.4 Automated Energy/Performance Macromodeling of Embedded Software [p. 99]

A. Muttreja (Princeton University), A. Raghunathan, S. Ravi (NEC Laboratories), N. K. Jha (Princeton University)

Efficient energy and performance estimation of embedded software is a critical part of any system-level design flow. Macromodeling based estimation is an attempt to speed up estimation by exploiting reuse that is inherent in the design process. Macromodeling involves precharacterizing reusable software components to construct high-level models, which express the execution time or energy consumption of a sub-program as a function of suitable parameters. During simulation, macromodels can be used instead of detailed hardware models, resulting in orders of magnitude simulation speedup. However, in order to realize this potential, significant challenges need to be overcome in both the generation and use of macromodels� including how to identify the parameters to be used in the macromodel, how to define the template function to which the macromodel is fitted, etc. This paper presents an automatic methodology to perform characterization-based high-level software macromodeling, which addresses the aforementioned issues. Given a sub-program to be macromodeled for execution time and/or energy consumption, the proposed methodology automates the steps of parameter identification, data collection through detailed simulation, macromodel template selection, and fitting. We propose a novel technique to identify potential macromodel parameters and perform data collection, which draws from the concept of data structure serialization used in distributed programming. We utilize symbolic regression techniques to concurrently filter out irrelevant macromodel parameters, construct a macromodel function, and derive the optimal coefficient values to minimize fitting error. Experiments with several realistic benchmarks suggest that the proposed methodology improves estimation accuracy and enables wide applicability of macromodeling to complex embedded software, while realizing its potential for estimation speedup. We describe a case study of how macromodeling can be used to rapidly explore algorithm-level energy tradeoffs, for the zlib data compression library.
Keywords: Data Serialization, Embedded Software, Genetic Programming, Macromodeling, Symbolic Regression

8.5 Coding for System-on-Chip Networks: A Unified Framework [p. 103]

S. R. Sridhara, N. R. Shanbhag (University of Illinois at Urbana-Champaign)

In this paper, we present a coding framework derived from a communication-theoretic view of a DSM bus to jointly address power, delay, and reliability. In this framework, the data is first passed through a nonlinear source coder that reduces self and coupling transition activity and imposes a constraint on the peak coupling transitions on the bus. Next, a linear error control coder adds redundancy to enable error detection and correction. The framework is employed to efficiently combine existing codes and to derive novel codes that span a wide range of trade-offs between bus delay, codec latency, power, area, and reliability. Simulation results, for a 1-cm 32- bit bus in a 0.18-&um;m CMOS technology, show that 31% reduction in energy and 62% reduction in energy-delay product are achievable.
Keywords: Bus coding, crosstalk avoidance, low-power, low-swing, error-correcting codes

SESSION 9: Performance Evaluation and Run Time Support

Chair: Joerg Henkel (University of Karlsruhe)
Organizers: Radu Marculescu, Rainer Leupers

9.1 Abstraction of Assembler Programs for Symbolic Worst Case Execution Time Analysis [p. 107]

T. Schuele, K. Schneider (University of Kaiserslautern)

Various techniques have been proposed to determine the worst case execution time of real-time systems. For most of these approaches, it is not necessary to capture the complete semantics of the system. Instead, it suffices to analyze an abstract model provided that it reflects the system's execution time correctly. To this end, we present an abstraction technique based on program slicing that can be used to simplify software systems at the level of assembler programs. The key idea is to determine a minimal set of instructions such that the control flow of the program is maintained. This abstraction is essential for reducing the runtime of the analysis algorithms, in particular, when symbolic methods are used to perform a complete state space exploration.
Keywords: Real�time systems, worst case execution time, abstraction, program slicing, assembler programs, symbolic simulation

9.2 Extending the Transaction Level Modeling Approach for Fast Communication Architecture Exploration [p. 113]

S. Pasricha, N. Dutt (University of California at Irvine), M. Ben-Romdhane (Conexant Systems Inc.)

System-on-Chip (SoC) designs are increasingly becoming more complex. Efficient on-chip communication architectures are critical for achieving desired performance in these systems. System designers typically use Bus Cycle Accurate (BCA) models written in high level languages such as C/C++ to explore the communication design space. These models capture all of the bus signals and strictly maintain cycle accuracy, which is useful for reliable performance exploration but results in slow simulation speeds for complex designs, even when they are modeled using high level languages. Recently there have been several efforts to use the Transaction Level Modeling (TLM) paradigm for improving simulation performance in BCA models. However these BCA models capture a lot of details that can be eliminated when exploring communication architectures. In this paper we extend the TLM approach and propose a new and faster transaction-based modeling abstraction level (CCATB) to explore the communication design space. Our abstraction level bridges the gap between the TLM and BCA levels, and yields an average performance speedup of 55% over BCA models. We demonstrate how fast and accurate exploration of tradeoffs is possible for high-performance shared bus architectures such as AMBA 2.0 and AMBA 3.0 (AXI) in industrial strength designs at the proposed abstraction level.
Keywords: Communication Architecture Exploration, Transaction Level Modeling, Bus Cycle Accurate Modeling, Shared Bus Architectures, AMBA

9.3 Specific Scheduling Support to Minimize the Reconfiguration Overhead of Dynamically Reconfigurable Hardware [p. 119]

J. Resano, D. Mozos (Universidad Complutense de Madrid), D. Verkest (IMEC vzw, Vrije Universiteit Brussel, Katholieke Universiteit Leuven), F. Catthoor (IMEC vzw, Vrije Universiteit Brussel), S. Vernalde (IMEC vzw)

Dynamically Reconfigurable Hardware (DRHW) platforms present both flexibility and high performance. Hence, they can tackle the demanding requirements of current dynamic multimedia applications, especially for embedded systems where it is not affordable to include specific HW support for all the applications. However, DRHW reconfiguration latency represents a major drawback that can make the use of DRHW resources inefficient for highly dynamic applications. To alleviate this problem, we have developed a set of techniques that provide specific support for DRHW devices and we have integrated them into an existing multiprocessor scheduling environment. In our experiments, with actual multimedia applications, we have reduced the original overhead due to the reconfiguration latency by at least 93%.
Keywords: Dynamic reconfigurable hardware, run-time scheduling.

9.4 LODS: Locality-Oriented Dynamic Scheduling for On-Chip Multiprocessors [p. 125]

M. Kandemir (The Pennsylvania State University)

Current multiprocessor SoC applications like network protocol codes, multimedia processing and base-band telecom circuits have tight time-to-market and performance constraints, which require an efficient design cycle. Consequently, automated techniques such as those oriented towards exploiting data locality are critical. In this paper, we demonstrate that existing loop scheduling techniques provide performance improvements even on on-chip multiprocessors, but they fall short of generating the best results since they do not take data locality into account as an explicit optimization parameter. Based on this observation, we propose a data locality-oriented loop-scheduling algorithm. The idea is to assign loop iterations to processors in such a fashion that each processor makes maximum reuse of the data it accesses.
Keywords: Data Locality, Parallelization, Loop Scheduling, Embedded System.

9.5 An Area Estimation Methodology for FPGA Based Designs at SystemC-Level [p. 129]

C. Brandolese, W. Fornaciari, F. Salice (Politecnico di Milano - DEI)

This paper presents a parametric area estimation methodology at SystemC level for FPGA-based designs. The approach is conceived to reduce the effort to adapt the area estimators to the evolutions of the EDA design environments. It consists in identifying the subset of measures that can be derived form the system level description and that are also relevant at VHDL-RT level. Estimators' parameters are then automatically derived from a set of benchmarks.
Keywords: FPGAs, SystemC, Area metrics.

SESSION 10: Advances in Analog Circuit and Layout Synthesis

Chair: Koen Lampaert (Mindspeed Technologies Inc.)
Organizers: Georges G. Gielen, Koen Lampaert

10.1 Automated Design of Operational Transconductance Amplifiers using Reversed Geometric Programming [p. 133]

J. P. Vanderhaegen, R. W. Brodersen (University of California at Berkeley)

We present a method for designing operational amplifiers using reversed geometric programming, which is an extension of geometric programming that allows both convex and non-convex constraints. Adding a limited set of non-convex constraints can improve the accuracy of convex equation-based optimization, without compromising global optimality. These constraints allow increased accuracy for critical modeling equations, such as the relationship between gm and IDS . To demonstrate the design methodology, a folded-cascode amplifier is designed in a 0.18 μm technology for varying speed requirements and is compared with simulations and designs obtained from geometric programming.
Keywords: CMOS integrated circuits, operational transconductance amplifiers, reversed geometric programming

10.2 Correct-by-Construction Layout-Centric Retargeting of Large Analog Designs [p. 139]

S. Bhattacharya, N. Jangkrajarng, R. Hartono, C.-J. R. Shi (University of Washington)

Aggressive design cycles in the semiconductor industry demand a design-reuse principle for analog circuits. The strong impact of layout intricacies on analog circuit performance necessitates design reuse with special focus on layout aspects. This paper presents a computer-aided design tool and the methodology for a layout-centric reuse of large analog intellectual property blocks. From an existing layout representation, an analog circuit is retargeted to different processes and performances; the corresponding correct-by-construction layouts are generated automatically and have performances comparable to manually crafted layouts. The tool and the methodology are validated on large analog intellectual-property blocks. While manual re-design and re-layout is known to take weeks to months, our reuse tool-suite achieves comparable performance in hours.
Keywords: Analog Integrated Circuit Design, Analog Layout Automation, Layout Symmetry, Analog Synthesis and Optimization.

10.3 Fast and Accurate Parasitic Capacitance Models for Layout-Aware Synthesis of Analog Circuits [p. 145]

A. Agarwal, H. Sampath, V. Yelamanchili, R. Vemuri (University of Cincinnati)

Considering layout effects early in the analog design process is becoming increasingly important. We propose techniques for estimating parasitic capacitances based on look-up tables and multivariate linear interpolation. These models enable fast and accurate estimation of parasitic capacitances and are very suitable for use in a synthesis flow. A layout aware methodology for synthesis of analog CMOS circuits using these parasitic models is presented. Results indicate that the proposed synthesis system is fast as compared to a layout-inclusive synthesis approach.
Keywords: Analog Synthesis, Layout Aware, Parasitic Estimation

10.4 ORACLE: Optimization with Recourse of Analog Circuits Including Layout Extraction [p. 151]

Y. Xu, L. T. Pileggi (Carnegie Mellon University), S. P. Boyd (Stanford University)

Long design cycles due to the inability to predict silicon realities is a well-known problem that plagues analog/RF integrated circuit product development. As this problem worsens for technologies below 100nm, the high cost of design and multiple manufacturing spins causes fewer products to have the volume required to support full custom implementation. Design reuse and analog synthesis make analog/RF design more affordable; however, the increasing process variability and lack of modeling accuracy remains extremely challenging for nanoscale analog/RF design. We propose an analog/RF circuit design methodology ORACLE, which is a combination of reuse and shared-use by formulating the synthesis problem as an optimization with recourse problem. Using a two-stage geometric programming with recourse approach, ORACLE solves for both the globally optimal shared and application-specific variables. Concurrently, we demonstrate ORACLE for novel metal-mask configurable designs, where a range of applications share common underlying structure and application-specific customization is performed using the metal-mask layers. We also include the silicon validation of the metal-mask configurable designs.
Keywords: Optimization with recourse

10.5 A Synthesis Flow Toward Fast Parasitic Closure for Radio-Frequency Integrated Circuits [p. 155]

G. Zhang (Carnegie Mellon University), A. Dengi, R. A. Rohrer (Neolinear Inc.), R. A. Rutenbar, L. R. Carley (Carnegie Mellon University)

An electrical and physical synthesis flow for high-speed analog and radio-frequency circuits is presented in this paper. Novel techniques aiming at fast parasitic closure are employed throughout the flow. Parasitic corners generated based on the earlier placement statistics are included for circuit resizing to enable parasitic robust designs. A performance-driven placement with simultaneous fast incremental global routing is proposed to achieve accurate parasitic estimation. Device tuning is utilized during layout to compensate for layout induced performance degradations. This methodology allows sophisticated macromodels of performances versus device variables and parasitics to be used during layout synthesis to make it truly performance-driven. Experimental results of a 4GHz LNA and a mixer demonstrate fast parasitic closure with this methodology.
Keywords: Synthesis, Sizing, Layout, Parasitic, Modeling, Radio Frequency

SESSION 11: Power Grid Design and Analysis Techniques

Chair: Eli Chiprout (Intel Corporation)
Organizers: Abhijit Dharchoudhury, Lei He

11.1 Buffer Sizing for Clock Power Minimization Subject to General Skew Constraints [p. 159]

K. Wang, M. Marek-Sadowska (University of California at Santa Barbara)

In this paper, we investigate the problem of buffer sizing for clock power minimization subject to general skew constraints. A novel approach based on sequential linear programming is presented. By taking the first-order Taylor's expansion of clock path delay with respect to buffer widths, the original nonlinear problem is transformed to a sequence of linear programs, which incorporate clock skew scheduling and buffer sizing to minimize clock power dissipation. For each linear program, the sensitivities of clock path delay with respect to buffer widths are efficiently updated by applying time-domain analysis to the clock network in a divide-and-conquer fashion. Our approach can take process variations and power supply noise into account. We demonstrate experimentally that the proposed technique is not only capable of effectively reducing clock power consumption, but also able to provide more accurate delay and skew results compared to the traditional approach.
Keywords: Clock skew scheduling, sizing, sequential linear programming

11.2 Optimal Placement of Power Supply Pads and Pins [p. 165]

M. Zhao, Y. Fu, V. Zolotov, S. Sundareswaran, R. Panda (Motorola, Inc.)

Power delivery networks of VLSI chips require adequate input supply connections to ensure reliable performance. This paper addresses the problem of finding an optimum set of pads, pins, and on-chip voltage regulators, and their placement in a given power supply network, subject to constraints on the voltage drops in the network and maximum currents through the pads, pins and regulators. The problem is modeled as a mixed integer linear program using macromodeling techniques and several heuristic techniques are proposed to make the problem tractable. The effectiveness of the proposed techniques is demonstrated on several real chips and memories used in low-power and high-performance applications.
Keywords: Pad optimization, pad placement

11.3 A Stochastic Approach to Power Grid Analysis [p. 171]

S. Pant, D. Blaauw (University of Michigan), V. Zolotov, S. Sundareswaran, R. Panda (Motorola, Inc.)

Power supply integrity analysis is critical in modern high performance designs. In this paper, we propose a stochastic approach to obtain statistical information about the collective IR and LdI/dt drop in a power supply network. The currents drawn from the power grid by the blocks in a design are modelled as stochastic processes and their statistical information is extracted, including correlation information between blocks in both space and time. We then propose a method to propagate the statistical parameters of the block currents through the linear model of the power grid to obtain the mean and standard deviation of the voltage drops at any node in the grid. We show that the run time is linear with the length of the current waveforms allowing for extensive vectors, up to millions of cycles, to be analyzed. We implemented the approach on a number of grids, including a grid from an industrial microprocessor and demonstrate its accuracy and efficiency. The proposed statistical analysis can be use to determine which portions of the grid are most likely to fail as well as to provide information for other analyses, such as statistical timing analysis.
Keywords: IR drop, Ldi/dt, power supply networks.

11.4 Efficient Power/Ground Network Analysis for Power Integrity-Driven Design Methodology [p. 177]

S.-W. Wu (Elan Microelectronics Corporation), Y.-W. Chang (National Taiwan University)

As technology advances, the metal width is decreasing with the length increasing, making the resistance along the power line increase substantially. Together with the nonlinear scaling of the threshold voltage that makes the ratio of the threshold voltage to the supply voltage rise, the voltage (IR) drop become a serious problem in modern VLSI design. Traditional power/ground (P/G) network analysis methods are typically very computationally expensive and thus not feasible to be integrated into floorplanning. To make the integration of the P/G analysis with floorplanning feasible, we need a very efficient, yet sufficiently accurate analysis method. In this paper, we present the methods for the fast analysis of the P/G networks at the floorplanning stage and integrate our analyzer into a commercial tool to develop a power integrity (IR drop) driven design methodology. Experimental results based on three real-world circuit designs show that our P/G network analyzer is accurate enough and very efficient. In particular, with our floorplan-based P/G network analyzer, the power integrity-driven design flow successfully fixed the IRdrop errors earlier at the floorplanning stage and thus enabled the single-pass design methodology.
Keywords: Floorplanning, Power/Ground Network

11.5 Reliability-Driven Layout Decompaction for Electromigration Failure Avoidance in Complex Mixed-Signal IC Designs [p. 181]

G. Jerke (Robert Bosch GmbH), J. Lienig (Dresden University of Technology), J. Scheible (Robert Bosch GmbH)

The negative effect of electromigration on signal and power line lifetime and functional reliability is an increasingly important problem for the physical design of integrated circuits. We present a new approach that addresses this electromigration issue by considering current density and inhomogeneous current-flow within arbitrarily shaped metallization patterns during physical design. Our proposed methodology is based on a post-route modification of critical layout structures that utilizes current-density data from a previously performed current-density verification. It is especially tailored to overcome the lack of current-flow consideration within existing routing tools. We also present experimental results obtained after successfully integrating our methodology into a commercial IC design flow.
Keywords: Physical design, electromigration, interconnect reliability, layout decomposition, compaction, decompaction

SESSION 12: Panel - What Happened to ASIC? Go (Recon)figure? [p. 185]

Chair: Kurt Keutzer (University of California at Berkeley)
Organizers: Nitin Deo, Behrozz Zahiri
Panelists: I. Bolsens (Xilinx, Inc.), J. Cong (Magma Design Automation, Inc.), B. Gupta (STMicroelectronics), P. Lopresti (NEC Electronics America), C. Reynolds (IBM Corporation), C. Rowen (Tensilica, Inc.), R. Simar (Texas Instruments, Inc.)

: Increasing design cost and risk is causing more and more designers to build configurable platforms that will amortize design and manufacturing costs across many generations. Several reconfiguration schemes exist and none of them seems to be a clear winner. Most notably there are three forms of technology: structured ASIC, configurable software programmable, reconfigurable FPGA-like platform. New solutions are emerging and may induce drastic changes in the current industry organization. The panel will examine factors in the success of the various options, and look to what the future might bring.

SESSION 13: Methods for A Priori Feasible Layout Generation

Chair: K. S. Leung (Magma Design Automation, Inc.)
Organizers: Charles J. Alpert, Dirk Stroobandt, Raymond Nijssen

13.1 Optical Proximity Correction (OPC)-Friendly Maze Routing [p. 186]

L.-D. Huang (Texas Instruments), M. D. F. Wong (University of Illinois at Urbana-Champaign)

As the technology migrates into the deep submicron manufacturing (DSM) era, the critical dimension of the circuits is getting smaller than the lithographic wavelength. The unavoidable light diffraction phenomena in the sub-wavelength technologies have become one of the major factors in the yield rate. Optical proximity correction (OPC) is one of the methods adopted to compensate for the light diffraction effect as a post layout process. However, the process is time-consuming and the results are still limited by the original layout quality. In this paper, we propose a maze routing method that considers the optical effect in the routing algorithm. By utilizing the symmetrical property of the optical system, the light diffraction is efficiently calculated and stored in tables. The costs that guide the router to minimize the optical interferences are obtained from these look-up tables. The problem is first formulated as a constrained maze routing problem, then it is shown to be a multiple constrained shortest path problem. Based on the Lagrangian relaxation method, an effective algorithm is designed to solve the problem.
Keywords: micro-lithography, VLSI, maze routing, optical system, manufacturing, OPC

13.2 Design Automation for Mask Programmable Fabrics [p. 192]

N. V. Shenoy, J. Kawa, R. Camposano (Synopsys, Inc.)

Programmable circuit design has played an important role in improving design productivity over the last few decades. By imposing structure on the design, efficient automation of synthesis, placement and routing is possible. We focus on a class of programmable circuits known as mask programmable circuits. In this paper, we describe key issues in design and tool methodology that need to be addressed in creating a programmable fabric. We construct an efficient design flow that can explore different logic and routing architectures. The main advantage of our work is that we tailor tools designed for standard cell design, that are readily available in the market, to work on a programmable fabric. Our flow requires some additional software capability. A special router that understands programmable routing constructs to complete connections is described. In addition, a tool that packs logic efficiently after synthesis is also presented.
Keywords: Integrated circuits, Mask programmable fabrics

13.3 On Designing Via-Configurable Cell Blocks for Regular Fabrics [p. 198]

Y. Ran, M. Marek-Sadowska (University of California at Santa Barbara)

In this paper we describe the design process of a via-configurable block for regular fabrics. The block consists of via-configurable functional cells, via-decomposable flip-flops, and via-configured sizable repeaters. The fabric has fixed layers up to M2. An M1-M2 via mask is used to define the block's functionality. The upper-level metals are customized. Compared to other structures based on LUTs or PLAs, and fixed flip-flops, our block has much smaller area, higher performance and lower power consumption.
Keywords: Regular fabric, via configurable, layout

13.4 Routing Architecture Exploration for Regular Fabrics [p. 204]

V. Kheterpal, A. J. Strojwas, L. Pileggi (Carnegie Mellon University)

In an effort to control the parameter variations and systematic yield problems that threaten the affordability of application-specific ICs, new forms of design regularity and structure have been proposed. For example, there has been speculation [6] that regular logic fabrics [1] based on regular geometry patterns [2] can offer tighter control of variations and greater control of systematic manufacturing failures. In this paper we describe a routing framework that accommodates arbitrary descriptions of regular and structured routing architectures. We further propose new regular routing architectures and explore the various performance vs. manufacturability trade-offs. Results demonstrate that a more regular, restricted routing architecture can provide a substantial advantage in terms of manufacturability and predictability while incurring a moderate performance penalty.
Keywords: BEOL, Regularity

13.5 Accurate Pre-layout Estimation of Standard Cell Characteristics [p. 208]

H. Yoshida, K. De, V. Boppana (Zenasis Technologies, Inc.)

With the advent of deep-submicron technologies, it has become essential to model the impact of physical/layout effects up front in all design flows [1]. The effect of layout parasitics is considerable even at the intra-cell level in standard cells. Hence, it has become critically important for any transistor-level optimization to consider the effect of these layout parasitics as an integral part of the optimization process. However, since it is not computationally feasible for the actual layout to be a part of any such optimization procedures, we propose a technique that estimates cell layout characteristics without actually performing the layout and subsequent extraction steps. We demonstrate in this work that it is indeed feasible to estimate the layout effects to get timing characteristics that are on average within about 1.5% of post-layout timing and that the technique is thousands of times faster than the actual creation of layout.
Keywords: Standard cell, cell characterization, transistor-level optimization

SESSION 14: Abstraction Techniques for Functional Verification

Chair: Vigyan Singhal (Jasper Design Automation, Inc.)
Organizers: Karem A. Sakallah, Rajeev Ranjan

14.1 An Efficient Finite-Domain Constraint Solver for Circuits [p. 212]

G. Parthasarathy, M. K. Iyer, K.-T. Cheng, L.-C. Wang (University of California at Santa Barbara)

This paper presents a novel hybrid finite-domain constraint solving engine for RTL circuits. We describe how DPLL search is modified for search in combined integer and Boolean domains by using efficient finite-domain constraint propagation. This enables efficient combination of Boolean SAT and linear integer arithmetic solving techniques. We automatically use control and data-path abstraction in RTL descriptions. We use conflict-based learning using the variables on the boundary of control and data-path for additional performance benefits. Finally, we analyze the hybrid constraint solver experimentally using some example circuits.
Keywords: Design Verification, Decision Procedures, Boolean Satisfiability, Integer Linear Programming, Bit-vector arithmetic

14.2 Automatic Abstraction and Verification of Verilog Models [p. 218]

Z. S. Andraus, K. A. Sakallah (University of Michigan)

Abstraction plays a critical role in verifying complex systems. A number of languages have been proposed to model hardware systems by, primarily, abstracting away their wide datapaths while keeping the low-level details of their control logic. This leads to a significant reduction in the size of the state space and makes it possible to verify intricate control interactions formally. These languages, however, require that the abstraction be done manually, a tedious and error-prone process. In this paper we describe Vapor, a tool that automatically abstracts behavioral RTL Verilog to the CLU language used by the UCLID system. Vapor performs a sound abstraction with emphasis on minimizing false errors. Our method is fast, systematic, and complements UCLID by serving as a back-end for dealing with UCLID counterexamples. Preliminary results show the feasibility of automatic abstraction and its utility in formal verification.
Keywords: Register Transfer Level (RTL), Verilog, Abstraction, Logic of Counter Arithmetic with Lambda Expressions and Uninterpreted Functions (CLU), UCLID.

14.3 Abstraction Refinement by Controllability and Cooperativeness Analysis [p. 224]

F. Y. C. Mang, P.-H. Ho (Synopsys, Inc.)

We present a new abstraction refinement algorithm to better refine the abstract model for formal property verification. In previous work, refinements are selected either based on a set of counter examples of the current abstract model, as in [5][6][7][8][9][19][20], or independent of any counter examples, as in [17]. We (1) introduce a new "controllability" analysis that is independent of any particular counter examples, (2) apply a new "cooperativeness" analysis that extracts information from a particular set of counter examples and (3) combine both to better refine the abstract model. We implemented the algorithm and applied it to verify several real world designs and properties. We compared the algorithm against the abstraction refinement algorithms in [19] and [20] and the interpolation-based reachability analysis in [14]. The experimental results indicate that the new algorithm outperforms the other three algorithms in terms of runtime, abstraction efficiency (as defined in [19]) and the number of proven properties.
Keywords: Formal Verification, Abstraction Refinement, Controllability, Cooperativeness.

14.4 Verifying A Gigabit Ethernet Switch Using SMV [p. 230]

Y. Lu, M. Jorda (Broadcom Corporation)

We use model checking techniques to verify a switching block in a new Gigabit Ethernet switch � BCM5690. Due to its dynamic nature, this block has been traditionally difficult to verify. Formal techniques are far more efficient than simulation for this particular design. Among 26 design errors discovered, 22 are found using formal methods. We then improve our model checking capability to analyze switch latency. We also use induction to avoid state explosion in the model checker.

14.5 A General Decomposition Strategy for Verifying Register Renaming [p. 234]

H. I. Shehata, M. D. Aagaard (University of Waterloo)

This paper describes a strategy for verifying data-hazard correctness of out-of-order processors that implement register-renaming. We define a set of predicates to characterize register-renaming techniques and provide a set of model-checking obligations that are sufficient to guarantee that a register-renaming technique satisfies data-hazard correctness. We demonstrate how two register renaming techniques (retirement-register-file and dual-RAT) instantiate our predicates, and present model checking results for the Dual-RAT technique, which is based on the Intel Pentium R 4 processor.
Keywords: Register renaming, Formal design verification, Pipelined circuits.

SESSION 15: Memory and Network Optimization in Embedded Designs

Chair: Faraydon Karim (STMicroelectronics)
Organizers: Adam Donlin, Grant E. Martin

15.1 An Integrated Hardware/Software Approach For Run-Time Scratchpad Management [p. 238]

P. Francesco (University of Bologna), P. Marchal (IMED vzw), D. Atienza (DACYA/UCM), L. Benini (University of Bologna), F. Catthoor (IMEC vzw), J. M. Mendias (DACYA/UCM)

An ever increasing number of dynamic interactive applications are implemented on portable consumer electronics. Designers depend largely on operating systems to map these applications on the architecture. However, today's embedded operating systems abstract away the precise architectural details of the platform. As a consequence, they cannot exploit the energy efficiency of scratchpad memories. We present in this paper a novel integrated hardware/software solution to support scratchpad memories at a high abstraction level. We exploit hardware support to alleviate the transfer cost from/to the scratchpad memory and at the same time provide a high-level programming interface for run-time scratchpad management. We demonstrate the effectiveness of our approach with a case-study.
Keywords: Scratchpad, DMA, Dynamic Allocation, AMBA AHB.

15.2 Multi-Profile Based Code Compression [p. 244]

E. Wanderley Netto, R. Azevedo, P. Centoducatte (UNICAMP), G. Araujo (Seoul National University, TIMA Laboratory)

Code compression has been shown to be an effective technique to reduce code size in memory constrained embedded systems. It has also been used as a way to increase cache hit ratio, thus reducing power consumption and improving performance. This paper proposes an approach to mix static/dynamic instruction profiling in dictionary construction, so as to best exploit trade-offs in compression ratio/performance. Compressed instructions are stored as variable-size indices into fixed-size codewords, eliminating compressed code misalignments. Experimental results, using the Leon (SPARCv8) processor and a program mix from MiBench and Mediabench, show that our approach halves the number of cache accesses and power consumption while produces compression ratios as low as 56%.
Keywords: Code compression, compression, code density.

15.3 An Efficient Scalable and Flexible Data Transfer Architecture for Multiprocessor SoC with Massive Distributed Memory [p. 250]

S.-I. Han (Seoul National University), A. Baghdadi (ENST Bretagne), M. Bonaciu (TIMA Laboratory), S.-I. Chae (Seoul National University), A. A. Jerraya (TIMA Laboratory)

Massive data transfer encountered in emerging multimedia embedded applications requires architecture allowing both highly distributed memory structure and multiprocessor computation to be handled. The key issue that needs to be solved is then how to manage data transfers between large numbers of distributed memories. To overcome this issue, our paper proposes a scalable Distributed Memory Server (DMS) for multiprocessor SoC (MPSoC). The proposed DMS is composed of: (1) high-performance and flexible memory service access points (MSAPs), which execute data transfers without intervention of the processing elements, (2) data network, and (3) control network. It can handle direct massive data transfer between the distributed memories of an MPSoC. The scalability and flexibility of the proposed DMS are illustrated through the implementation of an MPEG4 video encoder for QCIF and CIF formats. The experiments show clearly how DMS can be adapted to accommodate different SoC configurations requiring various data transfer bandwidths. Synthesis results show that bandwidth can scale up to 28.8 GB/sec.
Keywords: Multiprocessor SoC, Message passing, Data transfer architecture, Memory Server, Network on chip, Network Interface.

15.4 Operating-System Controlled Network on Chip [p. 256]

V. Nollet, T. Marescaux, D. Verkest, J.-Y. Mignolet, S. Vernalde (IMEC vzw)

Managing a Network-on-Chip (NoC) in an efficient way is a challenging task. To succeed, the operating system (OS) needs to be tuned to the capabilities and the needs of the NoC. Only by creating a tight interaction can we combine the necessary flexibility with the required efficiency. This paper illustrates such an interaction by detailing the management of communication resources in a system containing a packet-switched NoC and a closely integrated OS. Our NoC system is emulated by linking an FPGA to a PDA. We show that, with the right NoC support, the OS is able to optimize communication resource usage. Additionally, the OS is able to diminish or remove the interference between independent applications sharing a common NoC communication resource.
Keywords: Network on Chip, Operating System, MP-SoC

15.5 DyAD - Smart Routing for Networks-on-Chip [p. 260]

J. Hu, R. Marculescu (Carnegie Mellon University)

In this paper, we present and evaluate a novel routing scheme called DyAD which combines the advantages of both deterministic and adaptive routing schemes. More precisely, we envision a new routing technique which judiciously switches between deterministic and adaptive routing based on the network's congestion conditions. The simulation results show the effectiveness of DyAD by comparing it with purely deterministic and adaptive routing schemes under different traffic patterns. Moreover, a prototype router based on the DyAD idea has been designed and evaluated Compared to purely adaptive routers, the overhead of implementing DyAD is negligible (less than 7%), while the performance is consistently better.
Keywords: Networks-on-Chip, Systems-on-Chip, router design

SESSION: Business Day SESSION 100 Competitive Strategies for the Electronics Industry [p. 264]

Chair: Alan Naumann (CoWare, Inc.)
Organizer: Ellen M. Sentovich (Cadence Design Systems, Inc.)
Speakers: J. Ahuja (Cadence Design Systems, Inc.), P. Lippe (Silicon Image), Bernie Rosenthal (Tensilica, Inc.)

Competing in today's economy requires close examination of current business practices and careful strategies for moving into the future. This session will examine key areas for establishing competitive edge: globalization, patents and intellectual property management, and strategic marketing. The talks will cover a broad spectrum of approaches applicable to design, EDA, and IP-based companies. Conclusions will be drawn about which methods will be most successful for moving into the next era of electronics. The final portion of the session is devoted to discussion, debate, and Q&A.

SESSION: Business Day SESSION 150 Business Models in IP, Software Licensing, and Services [p. 264]

Chair: Lucio Lanza (Lanza techVentures, Inc.)
Organizer: Ellen M. Sentovich (Cadence Design Systems, Inc.)
Speakers: R. Camposano (Synopsys, Inc.), J. Douglas (ReShape, Inc.), A. Khan (Cadence Design Systems, Inc.)

A variety of business models in design and design automation are being employed today that have a dramatic effect on the success of individual companies and of multiple industries within the domain of electronics. The three talks in this session will study models for managing IP, software licensing, design and EDA services. A variety of approaches, constraints, and case studies will be presented in each talk, with conclusions about how to make these models most successful for all parties involved. The final portion of the session is devoted to discussion, debate, and Q&A.

SPECIAL SESSION 16: The Future of Timing Closure

Chair: Charles J. Alpert (IBM Corporation) Organizer: Charles J. Alpert

16.1 Timing Closure for Low-FO4 Microprocessor Design [p. 265]

D. S. Kung (IBM T.J. Watson Research Center)

In this paper, we discuss timing closure for high performance microprocessor designs. Aggressive cycle time and deep submicron technology scaling introduce a myriad of problems that are not present in the ASIC domain. The impact of these problems on floorplanning, placement, clocking and logic synthesis is described. We present ideas and potential solutions for tackling these problems.
Keywords: High Performance, FO4, Synthesis, Placement.

16.2 Forest vs. Trees: Where's the Slack? [p. 267]

P. Rodman (ReShape, Inc.)

16.3 Efficient Timing Closure Without Timing Driven Placement and Routing [p. 268]

M. Vujkovic, D. Wadkins (University of Washington), B. Swartz (InternetCAD.com, Inc.), C. Sechen (University of Washington)

We have developed a design flow from Verilog/VHDL to layout that mitigates the timing closure problem, while requiring no timing driven placement or routing tools. Timing issues are confined to the cell sizer, allowing the placement algorithm to focus solely on net lengths, resulting in superior layout densities and much lower power. The primary enablers to this new technology are: 1) gridded transistor sizing, 2) variable die routing that allows each net to be routed in the shortest possible length, 3) simultaneous cell placement, routing, gate sizing, and clock tree insertion, and 4) an effective incremental (ECO) placement that preserves net lengths from one iteration to the next. The variable die router is the key enabler. Superior placement and routing densities result from this new variable die approach. In addition, each net is routed at or near minimum length, and thus minor placement changes or cell size changes do not materially impact net lengths.
Keywords: Digital design flow, gate sizing, placement and routing, timing closure.

SESSION 17: Panel - What Works and What Doesn't [p. 274]

Chair: Robert Damiano (Synopsys, Inc.)
Organizer: Francine Bacchini (Thinkbold Corporate Communications)
Panelists: B. Bentley (Intel Corp.), K. Baty (WSFDB Consulting), K. Normoyle (Azul Systems, Inc.), M. Ishii (Sony Corp.), E. Yogev (Cisco Systems, Inc.)

Today's leading chip and system companies are faced with ever increasing design verification challenges; industry studies reveal that as much as 50% of the total schedule is being spent in verification. Large companies, with almost infinite resources, have shown that throwing CPU cycles and people at the simulation problem still doesn't guarantee a level of coverage desired by the design team. So, what is the answer? Are assertions, faster simulators, and testbench languages the "holy grail", or are those just micro-optimizations of a methodology that is fundamentally flawed? Is there hope for a verification methodology that completely covers a design with a predictable verification schedule? The panelists will describe their experiences of what works and what doesn't, providing insights into their methodologies and philosophies.

SESSION 18: Design Space Exploration and Scheduling for Embedded Software

Chair: Manmut Kandemir (Pennsylvania State University)
Organizers: Heinrich Meyr, Lothar Thiele, Mahmut Kandemir

18.1 Leakage Aware Dynamic Voltage Scaling for Real-Time Embedded Systems [p. 275]

R. Jejurikar (University of California at Irvine), C. Pereira, R. Gupta (University of California at San Diego)

A five-fold increase in leakage current is predicted with each technology generation. While DynamicVoltage Scaling (DVS) is known to reduce dynamic power consumption, it also causes increased leakage energy drain by lengthening the interval over which a computation is carried out. Therefore, for minimization of the total energy, one needs to determine an operating point, called the critical speed. We compute processor slowdown factors based on the critical speed for energy minimization. Procrastination scheduling attempts to maximize the duration of idle intervals by keeping the processor in a sleep/shutdown state even if there are pending tasks, within the constraints imposed by performance requirements. Our simulation experiments show that the critical speed slowdown results in up to 5% energy gains over a leakage oblivious dynamic voltage scaling. Procrastination scheduling scheme extends the sleep intervals to up to 5 times, resulting in up to an additional 18% energy gains, while meeting all timing requirements.
Keywords: leakage power, critical speed, low power scheduling, real-time systems, EDF scheduling, procrastication.

18.2 Retargetable Profiling for Rapid, Early System-Level Design Space Exploration [p. 281]

L. Cai, A. Gerstlauer, D. Gajski (University of California at Irvine)

Fast and accurate estimation is critical for exploration of any design space in general. As we move to higher levels of abstraction, estimation of complete system designs at each level of abstraction is needed. Estimation should provide a variety of useful metrics relevant to design tasks in different domains and at each stage in the design process. In this paper, we present such a system-level estimation approach based on a novel combination of dynamic profiling and static retargeting. Co-estimation of complete system implementations is fast while accurately reflecting even dynamic effects. Furthermore, retargetable profiling is supported at multiple levels of abstraction, providing multiple design quality metrics at each level. Experimental results show the applicability of the approach for efficient design space exploration.
Keywords: Profiling, Retargetable, System Level Design, Exploration.

18.3 High Level Cache Simulation for Heterogeneous Multiprocessors [p. 287]

J. J. Pieper (Carnegie Mellon University), A. Mellan (STMicroelectronics), J. M. Paul, D. E. Thomas (Carnegie Mellon University), F. Karim (STMicroelectronics)

As multiprocessor systems-on-chip become a reality, performance modeling becomes a challenge. To quickly evaluate many architectures, some type of high-level simulation is required, including high-level cache simulation. We propose to perform this cache simulation by defining a metric to represent memory behavior independently of cache structure and back annotate this into the original application. While the annotation phase is complex, requiring time comparable to normal address trace based simulation, it need only be performed once per application set and thus enables simulation to be sped up by a factor of 20 to 50 over trace based simulation. This is important for embedded systems, as software is often evaluated against many input sets and many architectures. Our results show the technique is accurate to within 20% of miss rate for uniprocessors and was able to reduce the die area of a multiprocessor chip by a projected 14% over a naive design by accurately sizing caches for each processor.

SESSION 19: Advances in Accelerated Simulation

Chair: Wolfgang Roesner (IBM Corporation)
Organizers: Avi Ziv, Yaron Kashi

19.1 Communication-Efficient Hardware Acceleration for Fast Functional Simulation [p. 293]

Y.-I. Kim, W. Yang, Y.-S. Kwon, C.-M. Kyung (Korea Advanced Institute of Science and Technology)

This paper presents new technology that accelerates system verification. Traditional methods for verifying functional designs are based on logic simulation, which becomes more time-consuming as design complexity increases. To accelerate functional simulation, hardware acceleration is used to offload calculation-intensive tasks from the software simulator. Hardware accelerated simulation dramatically reduces the simulation time. However, the communication overhead between the software simulator and hardware accelerator is becoming a new critical bottleneck. We reduce the communication overhead by exploiting burst data transfer and parallelism, which are obtained by splitting testbench and moving a part of testbench into hardware accelerator. Our experiments demonstrated that the proposed method reduces the communication overhead by a factor of about 40 compared to conventional hardware accelerated simulation while maintaining the cycle accuracy and compatibility with the original testbench.
Keywords: Functional verification, simulation acceleration, communication overhead

19.2 A Fast Hardware/Software Co-Verification Method for System-On-a-Chip by Using a C/C++ Simulator and FPGA Emulator with Shared Register Communication [p. 299]

Y. Nakamura, K. Hosokawa, I. Kuroda, K. Yoshikawa (NEC Corporation), T. Yoshimura (Waseda University)

This paper describes a new hardware/software co-verification method for System-On�a-Chip, based on the integration of a C/C++ simulator and an inexpensive FPGA emulator. Communication between the simulator and emulator occurs via a flexible interface based on shared communication registers. This method enables easy debugging, rich portability, and high verification speed, at a low cost. We describe the application of this environment to the verification of three different complex commercial SoCs, supporting concurrent hardware and embedded software development. In these projects, our verification methodology was used to perform complete system verification at 0.2-1.1 MHz, while supporting full graphical interface functions such as "waveform" or "signal dump" viewers, and debugging functions such as "step" or "break".
Keywords: Co-Verification, FPGA Emulation, C/C++ Simulator.

19.3 Circuit-Aware Architectural Simulation [p. 305]

S. Lee, S. Das, V. Bertacco, T. Austin, D. Blaauw, T. Mudge (The University of Michigan)

Architectural simulation has achieved a prominent role in the system design cycle by providing designers the ability to quickly examine a wide variety of design choices. How-ever, the recent trend in system design toward architectures that react to circuit-level phenomena has outstripped the capabilities of traditional cycle-based architectural simulators. In this paper, we present an architectural simulator design that incorporates a circuit modeling capability, permitting architectural-level simulations that react to circuit characteristics (such as latency, energy, or current draw) on a cycle-by-cycle basis. While these additional capabilities slow simulation speed, we show that the careful application of circuit simulation optimizations and simulation sampling techniques permit high levels of detail with sufficient speed to examine entire workloads.
Keywords: Architectural simulation, High-performance simulation, Circuit simulation

SESSION 20: Design for Manufacturing

Chair: Ruiqi Tian (Motorola, Inc.)
Organizers: Charlie Chung-Ping Chen, Martin D. F. Wong

20.1 Toward a Methodology for Manufacturability-Driven Design Rule Exploration [p. 311]

L. Capodieci (Advanced Micro Devices), P. Gupta, A. B. Kahng (University of California at San Diego), D. Sylvester, J. Yang (University of Michigan at Ann Arbor)

Resolution enhancement techniques (RET) such as optical proximity correction (OPC) and phase-shift mask (PSM) technology are deployed in modern processes to increase the fidelity of printed features, especially critical dimensions (CD) in polysilicon. Even given these exotic technologies, there has been momentum towards less flexibility in layout, in order to ensure printability. However, there has not been a systematic study of the performance and manufacturability impact of such a move towards restrictive design rules. In this paper we present a design flow that evaluates the application of various restricted design rule (RDR) sets in deep submicron ASIC designs in terms of circuit performance and parametric yield. Using such a framework, process and design engineers can identify potential solutions to maximize manufacturability by selectively applying RDRs while maintaining chip performance. In this work we focus attention on the device layer which is the most difficult design layer to manufacture. We quantify the performance, manufacturability and mask cost impact of several common design rules. For instance, we find that small increases in the minimum allowable poly line end extension beyond active provide high levels of immunity to lithographic defocus conditions. Also, modification of the minimum field poly to diffusion spacing can provide good manufacturability, while a single pitch single orientation design rule can reduce gate 3s uncertainty. Both of these improve in data volume as well, with little to no performance penalties. Reductions in data volume and worst-case edge placement error are on the order of 20- 30% and 30-50% respectively compared to a standard baseline design rule set.
Keywords: Process variation, Lithography, VLSI Manufacturability, OPC, RET, Yield

20.2 Phase Correct Routing for Alternating Phase Shift Masks [p. 317]

K. McCullen (IBM Microelectronics Division)

In this paper, we investigate routing restrictions that enable the generation of "correct by construction" layouts for Dark Field AltPSM. Existing routers produce designs in which coloring errors are numerous, and up to 15% of the pin locations in macros may cause unavoidable coloring errors. We identify routing restrictions which produce 100% phase-correct results, and quantify the impact on wire-lengths and via counts.
Keywords: Resolution Enhancement Techniques (RET), Lithography, Routing, Layout.

20.3 Toward a Systematic-Variation Aware Timing Methodology [p. 321]

P. Gupta (University of California at San Diego), F.-L. Heng (IBM T.J. Watson Research Center)

Variability of circuit performance is becoming a very important issue for ultra-deep sub-micron technology. Gate length variation has the most direct impact on circuit performance. Since many factors contribute to the variability of gate length, recent studies have modeled the variability using Gaussian distributions. In reality, the through-pitch and through-focus variations of gate length are systematic. In this paper, we propose a timing methodology which takes these systematic variations into account and we show that such an approach can reduce the timing uncertainty by up to 40%.
Keywords: Layout, Lithography, OPC, ACLV, Manufacturability

20.4 Selective Gate-Length Biasing for Cost-Effective Runtime Leakage Control [p. 327]

P. Gupta, A. B. Kahng, P. Sharma (University of California at San Diego), D. Sylvester (University of Michigan)

With process scaling, leakage power reduction has become one of the most important design concerns. Multi-threshold techniques have been used to reduce runtime leakage power without sacrificing performance. In this paper, we propose small biases of transistor gate-length to further minimize power in a manufacturable manner. Unlike multi-Vth techniques, gate-length biasing requires no additional masks and may be performed at any stage in the design process. Our results show that gate-length biasing effectively reduces leakage power by up to 25% with less than 4% delay penalty. We show the feasibility of the technique in terms of manufacturability and pin-compatibility for post-layout power optimization. We also show up to 54% reduction in leakage uncertainty due to inter-die process variation in circuits when biased gate-lengths, versus only unbiased one, are used. Circuits selectively biased show much less sensitivity to both intra and inter die variations.
Keywords: Layout, lithography, OPC, leakage, power, manufacturability

SESSION 21: Statistical Timing Analysis

Chair: Anirudh Devgan (IBM Corporation)
Organizers: Charlie Chung-Ping Chen, Sani R. Nassif

21.1 First-Order Incremental Block-Based Statistical Timing Analysis [p. 331]

C. Visweswariah (IBM T.J. Watson Research Center), K. Ravindran (University of California at Berkeley), K. Kalafala (IBM Microelectronics), S. G. Walker (IBM T.J. Watson Research Center), S. Narayan (IBM Microelectronics)

Variability in digital integrated circuits makes timing verification an extremely challenging task. In this paper, a canonical first order delay model is proposed that takes into account both correlated and independent randomness. A novel linear-time block-based statistical timing algorithm is employed to propagate timing quantities like arrival times and required arrival times through the timing graph in this canonical form. At the end of the statistical timing, the sensitivities of all timing quantities to each of the sources of variation are available. Excessive sensitivities can then be targeted by manual or automatic optimization methods to improve the robustness of the design. This paper also reports the first incremental statistical timer in the literature which is suitable for use in the inner loop of physical synthesis or other optimization programs. The third novel contribution of this paper is the computation of local and global criticality probabilities. For a very small cost in CPU time, the probability of each edge or node of the timing graph being critical is computed. Numerical results are presented on industrial ASIC chips with over two million logic gates.
Keywords: Statistical timing, incremental, variability

21.2 Fast Statistical Timing Analysis Handling Arbitrary Delay Correlations [p. 337]

M. Orshansky, A. Bandyopadhyay (The University of Texas at Austin),

An efficient statistical timing analysis algorithm that can handle arbitrary (spatial and structural) causes of delay correlation is described. The algorithm derives the entire cumulative distribution function of the circuit delay using a new mathematical formulation. Spatial as well as structural correlations between gate and wire delays can be taken into account. The algorithm can handle node delays described by non-Gaussian distributions. Because the analytical computation of an exact cumulative distribution function for a probabilistic graph with arbitrary distributions is infeasible, we find tight upper and lower bounds on the true cumulative distribution. An efficient algorithm to compute the bounds is based on a PERT-like single traversal of the sub-graph containing the set of N deterministically longest paths. The efficiency and accuracy of the algorithm is demonstrated on a set of ISCAS'85 benchmarks. Across all the benchmarks, the average rms error between the exact distribution and lower bound is 0.7%, and the average maximum error at 95th percentile is 0.6%. The computation of bounds for the largest benchmark takes 39 seconds.

21.3 STAC: Statistical Timing Analysis with Correlation [p. 343]

J. Le, X. Li, L. T. Pileggi (Carnegie Mellon University),

Current technology trends have led to the growing impact of both inter-die and intra-die process variations on circuit performance. While it is imperative to model parameter variations for sub- 100nm technologies to produce an upper bound prediction on timing, it is equally important to consider the correlation of these variations for the bound to be useful. In this paper we present an efficient block-based statistical static timing analysis algorithm that can account for correlations from process parameters and re-converging paths. The algorithm can also accommodate dominant interconnect coupling effects to provide an accurate compilation of statistical timing information. The generality and efficiency for the proposed algorithm is obtained from a novel simplification technique that is derived from the statistical independence theories and principal component analysis (PCA) methods. The technique significantly reduces the cost for mean, variance and covariance computation of a set of correlated random variables.
Keywords: Statistical timing, process variation

SESSION 22: Panel - System Level Design: Six Success Stories in Search of an Industry [p. 349]

Chair: Grant Martin (Tensilica, Inc.)
Organizer: Francine Bacchini (Thinkbold Corporate Communications)
Panelists: P. Paulin (STMicroelectronics), R. A. Bergamaschi (IBM), R. Pawate (Texas Instruments, India), A. Bernstein (Intel Corp., Israel), R. Chandra (Qualcomm), M. Ben-Romdhane (Conexant)

System-level design is being touted as the Holy Grail that the electronics industry has long sought, but most offers have been disappointing because they seldom deliver results. Many designers are fed up with the "Blah, Blah" on system-level design as they are waiting for design facts. Why? It seems that major breakthroughs are happening thanks to the adoption of standard directions for modeling design at higher than RTL level. These models are called TLM and new languages are being adopted (SystemC, SystemVerilog). The emergence of new standards may reshape completely the way design industry is organized. This panel will bring six speakers relating their success stories about design starting at the system-level. The format is an educational Panel aimed at informing DAC attendees of the challenges (difficulties and pitfalls) and opportunities (sizable benefits and lessons learned from these experiences).

SESSION 23: New Ideas in Placement

Chair: Patrick H. Madden (University of Kitakyushu)
Organizers: Carl Sechen, Igor L. Markov

23.1 Large-Scale Placement by Grid-Warping [p. 351]

Z. Xiu, J. D. Ma (Carnegie Mellon University), S. M. Fowler (Intel Corp.), R. A. Rutenbar (Carnegie Mellon University)

Grid-warping is a new placement algorithm based on a strikingly simple idea: rather than move the gates to optimize their location, we elastically deform a model of the 2-D chip surface on which the gates have been roughly placed, "stretching" it until the gates arrange themselves to our liking. Put simply: we move the grid, not the gates. Deforming the elastic grid is a surprisingly simple, low dimensional nonlinear optimization, and augments a traditional quadratic formulation. A preliminary implementation, WARP1, is already competitive with most recently published placers, e.g., placements that average 4% better wirelength, 40% faster than GORDIAN-L-DOMINO.
Keywords: Algorithms, Placement

23.2 Placement Feedback: A Concept and Method for Better Min-Cut Placements [p. 357]

A. B. Kahng, S. Reda (University of California at San Diego)

The advent of strong multi-level partitioners has made top-down min-cut placers a favored choice for modern placer implementations. We examine terminal propagation, an important step in min-cut placers, because it is responsible for translating partitioning results into global placement wire-length assumptions. In this work, we identify a previously overlooked problem - ambiguous terminal propagation - and propose a solution based on the concept of feedback from automatic control systems. Implementing our approach in Capo (version 8.7 [5, 10]) and applying it to standard benchmark circuits yields up to 14% wirelength reductions for the IBM benchmarks and 10% reductions for PEKO instances. Experiments also show consistent improvements for routed wirelength, yielding up to 9% wirelength reductions with practical increase in placement runtime. In addition, our method significantly improves routability without building congestion maps, and reduces the number of vias.
Keywords: min-cut placement, terminal propagation, feedback

23.3 Quantum-Dot Cellular Automata (QCA) Circuit Partitioning: Problem Modeling and Solutions [p. 363]

D. A. Antonelli, D. Z. Chen, T. J. Dysart, X. S. Hu (University of Notre Dame), A. B. Kahng (University of California at San Diego), P. M. Kogge, R. C. Murphy, M. T. Niemier (University of Notre Dame)

This paper presents the Quantum-Dot Cellular Automata (QCA) physical design problem, in the context of the VLSI physical design problem. The problem is divided into three subproblems: partitioning, placement, and routing of QCA circuits. This paper presents an ILP formulation and heuristic solution to the partitioning problem, and compares the two sets of results. Additionally, we compare the results of a human-generated circuit to the ILP and Heuristic formulations. The results demonstrate that the heuristic is a practical method of reducing partitioning run time while providing a result that is close to the optimal for a given circuit.
Keywords: Circuit Partitioning, Computer Aided Design, Quantum-Dot Cellular Automata (QCA)

SESSION 24: Model Order Reduction and Variational Techniques for Parasitic Analysis

Chair: Luca Daniel (Massachusetts Institute of Technology)
Organizers: Byron L. Krauter, Vikram Jandhyala

24.1 Passivity-Preserving Model Reduction Via a Computationally Efficient Project-And-Balance Scheme [p. 369]

N. Wong (The University of Hong Kong), V. Balakrishnan, C.-K. Koh (Purdue University)

This paper presents an efficient two-stage project-and-balance scheme for passivity-preserving model order reduction. Orthogonal dominant eigenspace projection is implemented by integrating the Smith method and Krylov subspace iteration. It is followed by stochastic balanced truncation wherein a novel method, based on the complete separation of stable and unstable invariant subspaces of a Hamiltonian matrix, is used for solving two dual algebraic Riccati equations at the cost of essentially one. A fast-converging quadruple-shift bulge-chasing SR algorithm is also introduced for this purpose. Numerical examples confirm the quality of the reduced-order models over those from conventional schemes. Keywords: Model reduction, Dominant Eigenspace, Projection, Stochastic Balanced Truncation, Riccati Equation, SR Algorithm

24.2 A Linear Fractional Transform (LFT) Based Model for Interconnect Parametric Uncertainty [p. 375]

J. M. Wang, O. A. Hafiz (University of Arizona at Tucson), J. Li (ETOP Design Technology)

As we scale toward nanometer technologies, the increase in interconnect parameter variations will bring significant performance variability. New design methodologies will emerge to facilitate construction of reliable systems from unreliable nanometer scale components. Such methodologies require new performance models which accurately capture the manufacturing realities. In this paper, we present a Linear Fractional Transform (LFT) based model for interconnect Parametric Uncertainty. This new model formulates the interconnect parameter uncertainty as a repeated scalar uncertainty structure. With the help of generalized Balanced Truncation Realization (BTR) based on Linear Matrix Inequalities (LMI's), the new model reduces the order of the original interconnect network while preserves the stability. This paper also shows that the LFT based model even guarantees passivity if the BTR reduction is based on solutions to a pair of Linear Matrix Inequalities (LMI's) which generalizes Lur'e equations .
Keywords: Parametric Uncertainty, Linear Fractional Transform

24.3 Variational Delay Metrics for Interconnect Timing Analysis [p. 381]

K. Agarwal, D. Sylvester, D. Blaauw (University of Michigan), F. Liu, S. Nassif (IBM Research), S. Vrudhula (University of Arizona)

In this paper we develop an approach to model interconnect delay under process variability for timing analysis and physical design optimization. The technique allows for closed-form computation of interconnect delay probability density functions (PDFs) given variations in relevant process parameters such as linewidth, metal thickness, and dielectric thickness. We express the resistance and capacitance of a line as a linear function of random variables and then use these to compute circuit moments. Finally, these variability-aware moments are used in known closed-form delay metrics to compute interconnect delay PDFs. We compare the approach to SPICE based Monte Carlo simulations and report an error in mean and standard deviation of delay of 1% and 4% on average, respectively.

24.4 Exploiting Input Infromation in a Model Reduction Algorithm for Massively Coupled Parasitic Networks [p. 385]

L. M. Silveira (Technical University of Lisbon), J. R. Phillips (Cadence Design Systems)

In this paper we present a model reduction algorithm that circumvents some of the issues encountered for parasitic networks with large numbers of input/output "ports". Our approach is based on the premise that for such networks, there are typically strong dependencies between the input waveforms at different network "ports". We present an approximate truncated balanced realizations procedure that, by exploiting such correlation information, produces much more compact models compared to standard algorithms such as PRIMA.
Keywords: Model order reduction, interconnect, parasitic.

SESSION 25: Compilation Techniques for Embedded Applications

Chair: Heinrich Meyr (RWTH Aachen/CoWare, Incorporated)
Organizer: Mahmut Kandemir

25.1 Automatic Translation of Software Binaries onto FPGAs [p. 389]

G. Mittal, D. C. Zaretsky, X. Tang, P. Banerjee (Northwestern University)

The introduction of advanced FPGA architectures, with built-in DSP support, has given DSP designers a new hardware alternative. By exploiting its inherent parallelism, it is expected that FPGAs can outperform DSP processors. This paper describes the process and considerations for automatically translating binaries targeted for general DSP processors into Register Transfer Level (RTL) VHDL or Verilog code to be mapped onto commercial FPGAs. The Texas Instruments C6000 DSP processor architecture is chosen as the DSP processor platform, and the Xilinx Virtex II as a target FPGA. Various optimizations are discussed, including data dependency analysis, procedure extraction, induction variable analysis, memory optimizations, and scheduling. Experimental results on resource usage and performance are shown for ten software binary benchmarks. Results show performance gains of 3-20X in the FPGA designs over that of the DSP processors in terms of reductions of execution cycles.
Keywords: Reconfigurable computing, compiler, hardware-software co-design, binary translation, decompilation.

25.2 Area-Efficient Instruction Set Synthesis for Reconfigurable System-on-Chip Designs [p. 395]

P. Brisk, A. Kaplan, M. Sarrafzadeh (University of California at Los Angeles)

Silicon compilers are often used in conjunction with Field Programmable Gate Arrays (FPGAs) to deliver flexibility, fast prototyping, and accelerated time-to-market. Many of these compilers produce hardware that is larger than necessary, as they do not allow instructions to share hardware resources. This study presents an efficient heuristic which transforms a set of custom instructions into a single hardware datapath on which they can execute. Our approach is based on the classic problems of finding the longest common subsequence and substring of two (or more) sequences. This heuristic produces circuits which are as much as 85.33% smaller than those synthesized by integer linear programming (ILP) approaches which do not explore resource sharing. On average, we obtained 55.41% area reduction for pipelined datapaths, and 66.92% area reduction for VLIW datapaths. Our solution is simple and effective, and can easily be integrated into an existing silicon compiler.
Keywords: Compiler, Field-Programmable Gate Array (FPGA), Integer Linear Programming (ILP), Resource Sharing

25.3 Data Compression for Improving SPM Behavior [p. 401]

O. Ozturk, M. Kandemir (Pennsylvania State University), I. Demirkiran (Syracuse University), G. Chen, M. J. Irwin (Pennsylvania State University)

Scratch-pad memories (SPMs) enable fast access to time-critical data. While prior research studied both static and dynamic SPM management strategies, not being able to keep all hot data (i.e., data with high reuse) in the SPM remains the biggest problem. This paper proposes data compression to increase the number of data blocks that can be kept in the SPM. Our experiments with several embedded applications show that our compression-based SPM management heuristic is very effective and outperforms prior static and dynamic SPM management approaches. We also present an ILP formulation of the problem, and show that the proposed heuristic generates competitive results with those obtained through ILP, while spending much less time in compilation.
Keywords scratch-pad memory, compilers, data compression

SPECIAL SESSION 26: Platform-Based System Design

Chair: Sujit Dey (University of California at San Diego)
Organizer: Sujit Dey

26.1 Platform Based Design: Does it Answer the Entire SoC Challenge? [p. 407]

G. Smith (Gartner Dataquest)

During the mid-1990s, design fads emerged as the design challenge was becoming significant and the EDA vendors were entered the methodology business. Trying to market methodologies helped produce today�s unfortunate design fad phenomenon.

26.2 Nomadic Platform Approach for Wireless Mobile Multimedia [p. 408]

M. Hopkins (STMicroelectronics, Inc.)

26.3 Benefits and Challenges for Platform-Based Design [p. 409]

A. Sangiovanni-Vincentelli, L. Carloni (University of California at Berkeley), F. De Bernardinis (University of California at Berkeley and Università di Pisa), M. Sgroi (DoCoMo Euro-Labs)

Platforms have become an important concept in the design of electronic systems. We present here the motivations behind the interest shown and the challenges that we have to face to make the Platform-based Design method a standard. As a generic term, platforms have meant different things to different people. The main challenges are to distill the essence of the method, to formalize it and to provide a framework to support its use in areas that go beyond the original domain of application.

26.4 Trends in the Use of Re-Configurable Platforms [p. 415]

M. Baron (Microprocessor Report)

SESSION 27: Innovations in Logic Synthesis

Chair: Rajeev Murgai (Fujitsu Laboratories)
Organizers: Marek Perkowski, Soha Hassoun

27.1 A Recursive Paradigm to Solve Boolean Relations [p. 416]

D. Bañeres, J. Cortadella (Universidad Politécnica de Catalunya), M. Kishinevsky (Intel Corp.)

A recursive algorithm for solving Boolean relations is presented. It provides several features: wide exploration of solutions, parametrizable cost function and efficiency. The experimental results show the applicability of the method and tangible improvements with regard to previous heuristic approaches.
Keywords: Boolean relations, decomposition, logic design.

27.2 A Robust Algorithm for Approximate Compatible Observability Don't Care (CODC) Computation [p. 422]

N. Saluja, S. P. Khatri (University of Colorado)

Compatible Observability Don't Cares (CODCs) are a powerful means to express the flexibility present at a node in a multi-level logic network. Despite their elegance, the applicability of CODCs has been hampered by their computational complexity. The CODC computation for a network involves several image computations, which require the construction of global BDDs of the circuit nodes. The size of BDDs of circuit nodes is unpredictable, and as a result, the CODC computation is not robust. In practice, CODCs cannot be computed for large circuits due to this limitation. In this paper, we present an algorithm to compute approximate CODCs (ACODCs). This algorithm allows us to compute compatible don't cares for significantly larger designs. Our ACODC algorithm is scalable in the sense that the user may trade off time and memory against the accuracy of the ACODCs computed. The ACODC is computed by considering a subnetwork rooted at the node of interest, up to a certain topological depth, and performing its don't care computation. We prove that the ACODC is an approximation of its CODC. We have proved the soundness of the approach, and performed extensive experiments to explore the trade-off between memory utilization, speed and accuracy. We show that even for small topological depths, the ACODC computation gives very good results. Our experiments demonstrate that our algorithm can compute ACODCs for circuits whose CODC computation has not been demonstrated to date. Also, for a set of benchmark circuits whose CODC computation yields an average 28% reduction in literals after optimization, our ACODC computation yields an average 22% literal reduction. Our algorithm has runtimes which are about 25X and memory utilization which is 33 X better that of the CODC computation of SIS.
Keywords: Compatible Observability Don't Cares (CODC), Multi-level Logic Optimization, Logic Synthesis

27.3 A Method to Decompose Multiple-Output Logic Functions [p. 428]

T. Sasao, M. Matsuura (Kyushu Institute of Technology)

This paper shows a method to decompose a given multiple-output circuit into two circuits with intermediate outputs. We use a BDD for characteristic function (BDD for CF) to represent a multiple-output function. Many benchmark functions were realized by LUT cascades with intermediate outputs. Especially, adders and a binary to BCD converter were successfully designed. Comparison with FPGAs is also presented. Categories and Subject Descriptors B.6.3 [Logic Design]: Design Aids General Terms Algorithms, Performance, Experimentation, Theory
Keywords Cascade, BDD, Characteristic function, FPGA

27.4 Symmetry Detection for Incompletely Specified Functions [p. 434]

K.-H. Wang, J.-H. Chen (Fu Jen Catholic University)

In this paper, we formulate symmetry detection for incompletely specified functions as an equation without using cofactor computation and equivalence checking. Based on this equation, a symmetry detection algorithm is proposed. This algorithm can simultaneously find nonequivalence and equivalence symmetries. Experimental results on a set of benchmarks show that our algorithm is indeed very effective in solving symmetry detection problem for incompletely specified functions.
Keywords: Equivalence Symmetry, Non-Equivalence Symmetry

27.5 Implicit Enumeration of Structural Changes in Circuit Optimization [p. 438]

V. N. Kravets, P. Kudva (IBM T.J. Watson Research Center)

We describe an implicit technique for enumerating structural choices in circuit optimization. The restructuring technique relies on the symbolic statements of functional decomposition which explores behavioral equivalence of circuit signals through rewiring and resubstitution. Using rigid, yet practical, formulation a rich variety of restructuring candidates is computed symbolically and applied incrementally to produce circuit changes with predictable structural effects. The restructuring technique is used to obtain much improved delays of the already optimized circuits along with their area savings. It is also applied to analyze benefits of optimizing circuit topology at the early steps of synthesis targeting its routability.
Keywords: Decomposition, Technology Mapping, Physical Synthesis, Re-synthesis, Optimization

SESSION 28: Yield Estimation and Optimization

Chair: John Cohn (IBM Corporation)
Organizers: Michael Orshansky, Sudhakar Bobba

28.1 Parametric Yield Estimation Considering Leakage Variability [p. 442]

R. R. Rao (University of Michigan), A. Devgan (IBM Corporation), D. Blaauw, D. Sylvester (University of Michigan)

Leakage current has become a stringent constraint in modern processor designs in addition to traditional constraints on frequency. Since leakage current exhibits a strong inverse correlation with circuit delay, effective parametric yield prediction must consider the dependence of leakage current on frequency. In this paper, we present a new chip-level statistical method to estimate the total leakage current in the presence of within-die and die-to-die variability. We develop a closed-form expression for total chip leakage that models the dependence of the leakage current distribution on a number of process parameters. The model is based on the concept of scaling factors to capture the effects of within-die variability. Using this model, we then present an integrated approach to accurately estimate the yield loss when both frequency and power limits are imposed on a design. Our method demonstrates the importance of considering both these limiters in calculating the yield of a lot.
Keywords: Leakage, variability, parametric yield

28.2 A Methodology to Improve Timing Yield in the Presence of Process Variations [p. 448]

S. Raj, S. B. K. Vrudhula, J. Wang (University of Arizona)

The ability to control the variations in IC fabrication process is rapidly diminishing as feature sizes continue towards the sub-100 nm regime. As a result, there is an increasing uncertainty in the performance of CMOS circuits. Accounting for the worst case values of all parameters will result in an unacceptably low timing yield. Design for Variability, which involves designing to achieve a given level of confidence in the performance of ICs, is fast becoming an indispensable part of IC design methodology. This paper1 describes a method to identify certain paths in the circuit that are responsible for the spread of timing performance. The method is based on defining a disutility function of the gate and path delays, which includes both the means and variances of the delay random variables. Based on the moments of this disutility function, an algorithm is presented which selects a subset of paths (called undominated paths) as being most responsible for the variation in timing performance. Next, a statistical gate sizing algorithm is presented, which is aimed at minimizing the delay variability of the nodes in the selected paths subject to constraints on the critical path delay and the area penalty. Monte-Carlo simulations with ISCAS'85 benchmark circuits shows that our statistical optimization approach results in significant improvements in timing yield over traditional deterministic sizing methods.
Keywords: Timing Analysis, Timing Yield, Gate Sizing.

28.3 Novel Sizing Algorithm for Yield Improvement under Process Variation in Nanometer Technology [p. 454]

S. H. Choi (Intel Corporation), B. C. Paul, K. Roy (Purdue University)

Due to process parameter variations, a large variability in circuit delay occurs in scaled technologies affecting the yield. In this paper, we propose a sizing algorithm to ensure the speed of a circuit under process variation with a certain degree of confidence while maintaining the area and power budget within a limit. This algorithm estimates the variation in circuit delay using statistical timing analysis considering both inter- and intra-die process variation and resizes the circuit to achieve a desired yield. Experimental results on several benchmark circuits show that one can achieve up to 19% savings in area (power) using our algorithm compared to the worst-case design.

28.4 Statistical Timing Analysis Based on a Timing Yield Model [p. 460]

F. N. Najm (University of Toronto), N. Menezes (Intel Corporation)

Starting from a model of the within-die systematic variations using principal components analysis, a model is proposed for estimation of the parametric yield, and is then applied to estimation of the timing yield. Key features of these models are that they are easy to compute, they include a powerful model of within-die correlation, and they are �full-chip� models in the sense that they can be applied with ease to circuits with millions of components. As such, these models provide a way to do statistical timing analysis without the need for detailed statistical analysis of every path in the design.
Keywords: Statistical timing analysis, timing yield, principal components

SESSION 29: High-Level Techniques for Signal Processing

Chair: Stephen A. Edwards (Columbia University)
Organizers: Reinaldo A. Bergamaschi, Stephen A. Edwards

29.1 System Design for DSP Applications in Transaction Level Modeling Paradigm [p. 466]

A. K. Deb, A. Jantsch, J. Öberg (Royal Institute of Technology)

In this paper, we systematically define three transaction level models (TLMs), which reside at different levels of abstraction between the functional and the implementation model of a DSP system. We also show a unique language support to build the TLMs. Our results show that the abstract TLMs can be built and simulated much faster than the implementation model at the expense of a reasonable amount of simulation accuracy.
Keywords: Transaction level modeling, system design, DSP, grammar

29.2 An Analytical Approach for Dynamic Range Estimation [p. 472]

B. Wu, J. Zhu, F. N. Najm (University of Toronto)

It has been widely recognized that the dynamic range information of an application can be exploited to reduce the datapath bitwidth of either processors or ASICs, and therefore the overall circuit area, delay and power consumption. While recent proposals of analytical dynamic range estimation methods have shown significant advantages over the traditional profiling-based method in terms of runtime, we argue that the rather simplistic treatment of input correlation may lead to significant error. We instead introduce a new analytical method based on a mathematical tool called Karhunen-Loéve Expansion (KLE), which enables the orthogonal decomposition of random processes. We show that when applied to linear systems, this method can not only lead to much more accurate result than previously possible, thanks to its capability to capture and propagate both spatial and temporal correlation, but also richer information than the value bounds previously produced, which enables the exploration of interesting trade-off between circuit performance and signal-to-noise ratio.
Keywords: Dynamic range, bitwidth, Karhunen-Loéve Expansion (KLE), correlation in the following text), has to be obtained.

29.3 Automated Fixed-Point Data-Type Optimization Tool for Signal Processing and Communication Systems [p. 478]

C. Shi, R. W. Brodersen (University of California at Berkeley)

A tool that automates the floating-point to fixed-point conversion (FFC) process for digital signal processing systems is described. The tool automatically optimizes fixed-point data types of arithmetic operators, including overflow modes, integer word lengths, fractional word lengths, and the number systems. The approach is based on statistical modeling, hardware resource estimation and global optimization based on an initial structural system description. The basic technique exploits the fact that the fixed point realization is a weak perturbation of the floating point realization which allows the development of a system model which can be used in the optimization process.
Keywords: Fixed-point arithmetic, optimization, communication systems, digital signal processing, FPGA, ASIC

29.4 An Algorithm for Converting Floating-Point Computations to Fixed-Point in MATLAB based FPGA Design [p. 484]

S. Roy, P. Banerjee (Northwestern University)

Most practical FPGA designs of digital signal processing applications are limited to fixed-point arithmetic owing to the cost and complexity of floating-point hardware. While mapping DSP applications onto FPGAs, a DSP algorithm designer, who often develops his applications in MATLAB, must determine the dynamic range and desired precision of input, intermediate and output signals in a design implementation to ensure that the algorithm fidelity criteria are met. The first step in a flow to map MATLAB applications into hardware is the conversion of the floating-point MATLAB algorithm into a fixed-point version. This paper describes an approach to automate this conversion, for mapping to FPGAs by profiling the expected inputs to estimate errors. Our algorithm attempts to minimize the hardware resources while constraining the quantization error within a specified limit.
Keywords: Quantization, Quantizer, Fixed point, Floating Point

29.5 Synthesizing Interconnect -Efficient Low Density Parity Check Codes [p. 488]

M. Mohiyuddin, A. Prakash, A. Aziz (The University of Texas at Austin), W. Wolf (Princeton University)

Error correcting codes are widely used in communication and storage applications. Codec complexity has usually been measured with a software implementation in mind. A recent hardware implementation of a Low Density Parity Check code (LDPC) indicates that interconnect complexity dominates the VLSI cost. We describe a heuristic interconnect-aware synthesis algorithm which generates LDPC codes that use an order of magnitude less wiring with little or no loss of coding efficiency.
Keywords: Error correcting codes, low-density parity-check codes, routing congestion.

SESSION 30: Advanced Test Solutions

Chair: Paolo Prinetto (Politecnico di Torino)
Organizers: TM Mak, Yervant Zorian

30.1 On Path-Based Learning and Its Applications in Delay Test and Diagnosis [p. 492]

L.-C. Wang (University of California at Santa Barbara), T. M. Mak (Intel Corporation), K.-T. Cheng (University of California at Santa Barbara), M. S. Abadir (Motorola, Inc.)

This paper describes the implementation of a novel path-based learning methodology that can be applied for two purposes: (1) In a presilicon simulation environment, path-based learning can be used to produce a fast and approximate simulator for statistical timing simulation. (2) In postsilicon phase, path-based learning can be used as a vehicle to derive critical paths based on the pass/fail behavior observed from the test chips. Our path-based learning methodology consists of four major components: a delay test pattern set, a logic simulator, a set of selected paths as the basis for learning, and a machine learner. We explain the key concepts in this methodology and present experimental results to demonstrate its feasibility and applications.
Keywords: Delay test, Statistical timing simulation, Machine learning

30.2 Efficient On-Line Testing of FPGAs with Provable Diagnosabilities [p. 498]

V. Verma (Xilinx, Inc.), S. Dutt, V. Suthar (University of Illinois at Chicago)

We present novel and efficient methods for on-line testing in FPGAs. The testing approach uses a ROving TEster (ROTE), which has provable diagnosabilities and is also faster than prior FPGA testing methods. We present 1- and 2-diagnosable built-in self-tester (BISTer) designs that make up the ROTE, and that avoid expensive adaptive diagnosis. To the best of our knowledge, this is the first time that a BISTer design with diagnosability greater than one has been developed for FPGAs. We also develop functional testing methods that test PLBs in only two circuit functions that will be mapped to them (as opposed to testing PLBs in all their operational modes) as the ROTE moves across a functioning FPGA. Simulation results show that our 1-diagnosable BISTer and our functional testing technique leads to significantly more accurate (98% (90.5%) fault coverage at a fault/defect density of 10% (25%)) and faster test-and-diagnosis of FPGAs than achieved by previous work. In general, it is expected that ROTE will achieve high fault coverages at fault/defect densities of up to 25% using our 1-diagnosable BISTer and up to 33% using our 2-diagnosable BISTer. Our methods should thus prove useful for testing current very deep submicron FPGAs as well as future nano-CMOS and molecular nanotechnology FPGAs in which defect densities are expected to be in the 10% range.
Keywords and Phrases: built-in self-tester (BISTer), diagnosability, FPGAs, functional testing, on-line testing, roving tester (ROTE).

30.3 On Test Generation for Transition Faults with Minimized Peak Power Dissipation [p. 504]

W. Li, S. M. Reddy (University of Iowa), I. Pomeranz (Purdue University)

This paper presents a method of generating tests for transition faults using tests for stuck-at faults such that the peak power is the minimum possible using a given set of tests for stuck-at faults. The proposed method is suitable for use in testing scan designs that employ enhanced scan. The method reduces the peak power consumption in benchmark circuits by 19% on the average with essentially the same test set size and the same fault coverage compared to an earlier method.
Keywords: Power Dissipation, Test Generation, Transition Faults

30.4 A New State Assignment Technique for Testing and Low Power [p. 510]

S. Park, S. Cho (Hanyang University at Ansan), S. Yang (Pusan University), M. Ciesielski (University of Massachusetts)

In order to improve the testabilities and power consumption, a new state assignment technique based on m-block partition is introduced in this paper. The length and number of feedback cycles are reduced with minimal switching activity on the state variables. Experiment shows significant improvement in power dissipation and testabilities for benchmark circuits.
Keywords: State Encoding, Fault Coverage, Low power, Scan design.

30.5 Automatic Generation of Breakpoint Hardware for Silicon Debug [p. 514]

B. Vermeulen (Philips Research Laboratories), M. Z. Urfianto (Royal Institute of Technology), S. K. Goel (Philips Research Laboratories)

Scan-based silicon debug is a technique that can be used to help find design errors in prototype silicon more quickly. One part of this technique involves the inclusion of break-point modules during the design stage of the chip. This paper focuses on an innovative approach to automatically generate breakpoint modules by means of a breakpoint description language. This language is illustrated using an example, and experimental results are presented that show the efficiency and effectiveness of this new method for generating breakpoint hardware.

SESSION 31: Advances in Boolean Analysis Techniques

Chair: Aarti Gupta (NEC Corporation)
Organizers: Pei-Hsin Ho, Tony Ma

31.1 AMUSE: A Minimally-Unsatisfiable Subformula Extractor [p. 518]

Y. Oh, M. N. Mneimneh, Z. S. Andraus, K. A. Sakallah, I. L. Markov (University of Michigan)

This paper describes a new algorithm for extracting unsatisfiable subformulas from a given unsatisfiable CNF formula. Such unsatisfiable "cores" can be very helpful in diagnosing the causes of infeasibility in large systems. Our algorithm is unique in that it adapts the �learning process� of a modern SAT solver to identify unsatisfiable subformulas rather than search for satisfying assignments. Compared to existing approaches, this method can be viewed as a bottom-up core extraction procedure which can be very competitive when the core sizes are much smaller than the original formula size. Repeated runs of the algorithm with different branching orders yield different cores. We present experimental results on a suite of large automotive benchmarks showing the performance of the algorithm and highlighting its ability to locate not just one but several cores.
Keywords: Minimally-unsatisfiable subformula, (MUS), conjunctive normal form (CNF), Boolean satisfiability (SAT), diagnosis.

31.2 A SAT-Based Algorithm for Reparameterization in Symbolic Simulation [p. 524]

P. Chauhan, E. M. Clarke, D. Kroening (Carnegie Mellon University)

Parametric representations used for symbolic simulation of circuits usually use BDDs. After a few steps of symbolic simulation, state set representation is converted from one parametric representation to another smaller representation, in a process called reparameterization. For large circuits, the reparametrization step often results in a blowup of BDDs and is expensive due to a large number of quantifications of input variables involved. Efficient SAT solvers have been applied successfully for many verification problems. This paper presents a novel SAT-based reparameterization algorithm that is largely immune to the large number of input variables that need to be quantified. We show experimental results on large industrial circuits and compare our new algorithm to both SAT-based Bounded Model Checking and BDD based symbolic simulation. We were able to achieve on average 3x improvement in time and space over BMC and able to complete many examples that BDD based approach could not even finish.
Keywords: Symbolic Simulation, SAT checkers, Bounded Model Checking, Parametric Representation, Safety Property Checking

31.3 Exploiting Structure in Symmetry Detection for CNF [p. 530]

P. T. Darga, M. H. Liffiton, K. A. Sakallah, I. L. Markov (University of Michigan)

Instances of the Boolean satisfiability problem (SAT) arise in many areas of circuit design and verification. These instances are typically constructed from some human-designed artifact, and thus are likely to possess much inherent symmetry and sparsity. Previous work [4] has shown that exploiting symmetries results in vastly reduced SAT solver run times, often with the search for the symmetries themselves dominating the total SAT solving time. Our contribution is twofold. First, we dissect the algorithms behind the venerable nauty [9] package, particularly the partition refinement procedure responsible for the majority of search space pruning as well as the majority of run time overhead. Second, we present a new symmetry-detection tool, saucy, which outperforms nauty by several orders of magnitude on the large, structured CNF formulas generated from typical EDA problems.
Keywords: Boolean satisfiability (SAT), symmetry, abstract algebra, backtrack search, partition refinement, graph automorphism.

31.4 Refining the SAT Decision Ordering for Bounded Model Checking [p. 535]

C. Wang, H. Jin, G. D. Hachtel, F. Somenzi (University of Colorado)

Bounded Model Checking (BMC) relies on solving a sequence of highly correlated Boolean satisfiability (SAT) problems, each of which checks for the existence of counter-examples of a bounded length. The performance of SAT search depends heavily on the variable decision ordering. We propose an algorithm to exploit the correlation among different SAT problems in BMC, by predicting and successively refining a partial variable ordering. This ordering is based on the analysis of all previous unsatisfiable instances, and is combined with the SAT solver's existing decision heuristic to determine the final variable decision ordering. Experiments on real designs from industry show that our new method improves the performance of SAT-based BMC significantly.
Keywords: Bounded Model checking, SAT, decision heuristic

31.5 Efficient Equivalance Checking with Partitions and Hierarchical Cut-points [p. 539]

D. Anastasakis, L. McIlwain (Synopsys, Inc.), S. Pilarski (University of Washington)

Previous results show that both flat and hierarchical methodologies present obstacles to effectively completing combinational equivalence checking. A new approach that combines the benefits while effectively dealing with the pitfalls of both styles of equivalence checking is presented.
Keywords: Logic Design, Verification, Equivalence Checking

SESSION 32: Panel - Were the Good Old Days all That Good? EDA Then and Now [p. 543]

Chair: William H. Joyner, Jr. (IBM Corporation/SRC)
Organizers: Shishpal Rawat, William H. Joyner, Jr.
Panelists: J. Darringer (IBM Corporation), D. Gajski (University of California at Irvine), P. O. Pistilli (MP Associates), H. De Man (IMEC), C. Harris (Kluwer Academic Publishers), J. Solomon (Independent Technology Advisor)

A long, long time ago, in a laboratory far, far away, EDA researchers and developers used paper tape instead of Linux, rubylith instead of GDS II, yellow wires instead of ten levels of metal. Sitting around a potbellied stove, in their rocking chairs, practitioners of that era (and this) will offer insight into why some great ideas were immediately put into practice while others stayed on the drawing board or in the ivory tower. They will share remembrances of things past, of simpler days when foundries made steel, when options meant CMOS or bipolar, when real parts were measured instead of benchmarks touted. Their stories of what it was like, what has changed, and whether the "good old days" were then or now will be followed by questions and, possibly, answers.

SESSION 33: Power Optimization for Real-Time and Media-Rich Embedded Systems

Chair: Sujit Dey (University of California at San Diego)
Organizers: Sujit Dey

33.1 Off-chip Latency-Driven Dynamic Voltage and Frequency Scaling for an MPEG Decoding [p. 544]

K. Choi, R. Soma, M. Pedram (University of Southern California)

This paper describes a dynamic voltage and frequency scaling (DVFS) technique for MPEG decoding to reduce the energy consumption using the computational workload decomposition. This technique decomposes the workload for decoding a frame into on-chip and off-chip workloads. The execution time required for the on-chip workload is CPU frequency-dependent, whereas the off-chip workload execution time does not change, regardless of the CPU frequency, resulting in the maximum energy savings by setting the minimum frequency during off-chip workload execution time, without causing any delay penalty. This workload decomposition is performed using a performance-monitoring unit (PMU) in the XScale-processor, which provides various statistics such as cache hit/miss and CPU stall, due to data dependency at run time. The on-chip workload for an incoming frame is predicted using a frame-based history so that the processor voltage and frequency can be scaled to provide the exact amount of computing power needed to decode the frame. To guarantee a quality of service (QoS) constraint, a prediction error compensation method, called inter-frame compensation, is proposed in which the on-chip workload prediction error is diffused into subsequent frames such that run time frame rates change smoothly. The proposed DVFS algorithm has been implemented on an XScale-based Testbed. Detailed current measurements on this platform demonstrate significant CPU energy savings ranging from 50% to 80% depending on the video clip.
Keywords: Low power, MPEG decoding, voltage and frequency scaling

33.2 Energy-Aware Deterministic Fault Tolerance in Distributed Real-Time Embedded Systems [p. 550]

Y. Zhang (Duke University), R. Dick (Northwestern University), K. Chakrabarty (Duke University)

We investigate a unified approach for fault tolerance and dynamic power management in distributed real-time embedded systems. Coordinated checkpointing is used to achieve fault tolerance, and power management is carried out using dynamic voltage scaling. We present feasibility-of-scheduling tests for coordinated checkpointing schemes for a constant processor speed as well as for DVS-enabled processors that can operate at variable speeds. Simulation results based on the CORDS hardware/software co-synthesis system show that, compared to fault-oblivious methods, the proposed approach significantly reduces power consumption while guaranteeing timely task completion in the presence of faults.
Keywords: Real-time systems, fault tolerance, checkpointing, voltage scaling

33.3 Proxy-based Task Partitioning of Watermarking Algorithms for Reducing Energy Consumption in Mobile Devices [p. 556]

A. Kejariwal, S. Gupta, A. Nicolau, N. Dutt (University of California at Irvine), R. Gupta (University of California at San Diego)

Digital watermarking is a process that embeds an imperceptible signature or watermark in a digital file containing audio, image, text or video data. The watermark is later used to authenticate the data file and for tamper detection. It is particularly valuable in the use and exchange of digital media such as audio and video on emerging handheld devices. However, watermarking is computationally expensive and adds to the drain of the available energy in handheld devices. We present an approach in which we partition the watermarking embedding and extraction algorithms and migrate some tasks to a proxy server. This leads to a lower energy consumption on the handheld without compromising the security of the watermarking process. Our results show that executing watermarking partitioned between the proxy and the handheld reduces the total energy consumed by 80% over running it only on the handheld and improves performance by over two orders of magnitude.
Keywords: Handhelds, Proxy, Partitioning, Watermarking

33.4 Adaptive Data Partitioning for Ambient Multimedia [p. 562]

X. Hu, R. Marculescu (Carnegie Mellon University)

In the near future, Ambient Intelligence (AmI) will become part of everyday life. Combining feature-rich multimedia with AmI (dubbed Ambient Multimedia for short) has the potential of changing the way we perceive and interact with our environment. One major difficulty, however, in designing Ambient Multimedia Systems (AMS) comes from the strong constraints imposed on system resources by the AmI application requirements. In this paper, we propose a method for mapping multimedia applications on systems with very limited resources (i.e. memory, computing capability and battery lifetime) by combining adaptive data partitioning with dynamic power management. The potential of the approach is illustrated through a case study of an object tracking application running on a resource-constrained platform.
Keywords: Ambient Intelligence, motion detection, power management

33.5 Energy Characterization of Filesystems for Diskless Embedded Systems [p. 566]

S. Choudhuri (University of California at Irvine), R. N. Mahapatra (Texas A & M University)

The need for low power, small form-factor, secondary storage devices in embedded systems has led to the widespread use of flash memory. Energy consumption due to processor and flash for such devices is critical to embedded system design. In this paper, we have proposed a quantitative account of energy consumption in both processor and flash due to overhead of filesystem related system calls. A macro-model for such energy consumption is derived using linear regression analysis. The results describing filesystem energy consumption have been obtained from Linux Kernel running Journaling Flash Filesystem 2 (JFFS2) and Extended 3 (Ext3) filesystems on StrongARM processor with flash as secondary storage device. Armed with such a macromodel, a designer can choose and partition filesystem, estimate the application energy consumption (due to processor and flash) due to filesystem during the early stage of system design.
Keywords: diskless, flash, macromodel

SESSION 34: Latency Tolerance and Asynchronous Design

Chair: Marios Papaefthymiou (University of Michigan)
Organizers: James C. Hoe, Leon Stok, Marios Papaefthymiou

34.1 A Method for Correcting the Functionality of a Wire-Pipelined Circuit [p. 570]

V. Nookala, S. S. Sapatnekar (University of Minnesota)

As across-chip interconnect delays can exceed a clock cycle, wire pipelining becomes essential in high performance designs. Although it allows higher clock frequencies, it may change the microarchitecture altogether because of the arbitrary increase in the latencies of the paths and cycles of the circuit. This paper proposes a method to regain the functionality of a wire-pipelined circuit. In this approach, increased cycle latencies are compensated by slowing down the issue rate of the inputs. Our method finds the optimal value of the slowdown required for a circuit as it directly affects the throughput of the circuit. We also incorporate area minimization in our formulation to minimize the number of extra flip-flops added to the circuit. The formulation is tested on circuits derived from ISCAS benchmarks and the results suggest that wire pipelining increases the overall throughput in most of the cases.
Keywords: Wire pipelining, Synchronous design

34.2 A New Approach to Latency Insensitive Design [p. 576]

M. R. Casu, L. Macchiarulo (Politecnico di Torino/CERCOM)

Latency Insensitive Protocols have been proposed as a viable mean to speed up large Systems-on-Chip where the limit in clock frequency is given by long global wires connecting together functional blocks. In this paper we keep the philosophy of Latency Insensitive Design and show that a drastic simplification can be done that results in even no need to implement any kind of protocol. By using a scheduling algorithm for the functional blocks activation we greatly reduce the routing resources demand of the old protocol, the area occupied by the sequential elements used to pipeline long interconnects and the complexity of the gating structure used to activate the modules.
Keywords: Interconnections, Pipelining, System-on-Chip

34.3 Pre-layout Wire Length and Congestion Estimation [p. 582]

Q. Liu, M. Marek-Sadowska (University of California at Santa Barbara)

In this paper, we study the pre-layout wire length and congestion estimation. We find that two structural metrics, mutual contraction and net range, can be used to predict wire lengths. These metrics have different application ranges and complement each other. We also propose a new metric, the structural pin density, to capture the peak routing congestion of designs. Larger maximum pin densities usually lead to larger peak congestions in circuits with similar average congestions. We demonstrate experimentally very good correlation of our pre-layout measures with post layout interconnect lengths. We also isolate the structural netlist properties which cause the peak congestion.
Keywords: Wire Length, Congestion, Prediction

34.4 The Best of Both Worlds: The Efficient Asynchronous Implementation of Synchronous Specifications [p. 588]

A. Davare (University of California at Berkeley), K. Lwin (Cadence Design Systems), A. Kondratyev (Cadence Berkeley Labs.), A. Sangiovanni-Vincentelli (University of California at Berkeley)

The desynchronization approach combines a traditional synchronous specification style with a robust asynchronous implementation model. The main contribution of this paper is the description of two optimizations that decrease the overhead of desynchronization. First, we investigate the use of clustering to vary the granularity of desynchronization. Second, by applying temporal analysis on a formal execution model of the desynchronized design, we uncover significant amounts of timing slack. These methods are successfully applied to industrial RTL designs.
Keywords: Desynchronization, Separation Analysis, Clustering

34.5 Fast Hazard Detection in Combinational Circuits [p. 592]

C. Jeong, S. M. Nowick (Columbia University)

In designing asynchronous circuits it is critical to ensure that circuits are free of hazards in the specified set of input transitions. In this paper, two new algorithms for detecting hazards in combinational circuits are proposed. These algorithms can be used to determine if the circuit is free of combinational hazards without exploring all the gates in the circuits, thus providing more efficient hazard detection. Experimental results indicate that the best new algorithm on average visits only 20.7% of the original gates, with an average runtime speedup of 1.69 and best speedup of 2.27 (for the largest example).
Keywords: Logic Design, Asynchronous circuits, Hazards.

SESSION 35: New Technologies in System Design

Chair: Petru Eles (Linkoping University)
Organizers: Ahmed A. Jerraya, Gila Kamhi

35.1 Defect Tolerant Probabilistic Design Paradigm for Nanotechnologies [p. 596]

M. Jacome, C. He, G. de Veciana, S. Bijansky (The University of Texas at Austin)

Recent successes in the development and self-assembly of nanoelectronic devices suggest that the ability to manufacture dense nanofabrics is on the near horizon. However, the tremendous increase in device density of nanoelectronics will be accompanied by a substantial increase in hard and soft faults, posing a major challenge to current design methodologies and tools. In this paper we propose a novel probabilistic design paradigm for defective but reconfigurable nanofabrics. The new design goal is to devise an appropriate structural/behavioral decomposition which improves scalability by constraining the reconfiguration process, while meeting a desired probability of successful instantiation, i.e, yield. Our approach not only addresses the scalability problem in configuring dense nanofabrics subject to defects, but gives a rich framework in which critical trade-offs among performance, yield, and per chip cost can be explored. We present a concrete instance of the approach and show extensive experimental results supporting these claims.
Keywords: Nanotechnologies, probabilistic design, defect tolerance

35.2 Architecture-Level Synthesis for Automatic Interconnect Pipelining [p. 602]

J. Cong, Y. Fan, Z. Zhang (University of California at Los Angeles)

For multi-gigahertz synchronous designs in nanometer technologies, multiple clock cycles are needed to cross the global interconnects, thus making it necessary to have pipelined global interconnects. In this paper we present an architecture-level synthesis solution to support automatic pipelining of on-chip interconnects. Specifically, we extend the recently proposed Regular Distributed Register (RDR) micro-architecture to support interconnect pipelining. We formulate a novel global interconnect sharing problem for global wiring minimization and show that it is polynomial time solvable by transformation to a special case of the real-time scheduling problem. Experimental results show that our approach matches or exceeds the RDR-based approach in performance, with a significant wiring reduction of 15% to 21%.
Keywords: High-level synthesis, multi-cycle communication, interconnect pipelining, scheduling, register binding

35.3 Automatic Generation of Equivalent Architecture Model from Functional Specification [p. 608]

S. Abdi, D. Gajski (University of California at Irvine)

This paper presents an algorithm for automatic generation of an architecture model from a functional specification, and proves its correctness. The architecture model is generated by distributing the intended system functionality over various components in the platform architecture. We then define simple transformations that preserve the execution semantics of system level models. Finally, the model generation algorithm is proved correct using our transformations. As a result, we have an automated path from a functional model of the system to an architectural one and we need to debug and verify only the functional specification model, which is smaller and simpler than the architecture model. Our experimental results show significant savings in both the modeling and the validation effort.
Keywords: System level design, Formal verification, Model Refinement

35.4 Divide-and-Concatenate: An Architecture Level Optimization Technique for Universal Hash Functions [p. 614]

B. Yang, R. Karri (Polytechnic University), D. A. McGrew (Cisco Systems, Inc.)

We present an architecture optimization technique called divide-and-concatenate for universal hash functions. The area of a multiplier increases quadratically and its speed increases gradually with the operand size and two universal hash functions are equivalent if they have the same collision probability property. Based on these observations, the divide-and-concatenate approach divides a 2w-bit data path (with collision probability 2-2w) into two w-bit data paths (each with collision probability 2-w), applies one message word to these two w-bit data paths and concatenates their results to construct an equivalent 2w-bit data path (with collision probability 2-2w). We demonstrate this technique on Linear Congruential Hash (LCH) family. When compared to the 100% overhead associated with duplicating a straightforward 32-bit LCH data path, the divide-and-concatenate approach that uses four equivalent 8-bit data paths yields a 101% increase in throughput with only 52% hardware overhead.

35.5 Performance Analysis of Different Arbitration Algorithms of the AMBA AHB Bus [p. 618]

M. Conti, M. Caldari, G. B. Vece, S. Orcioni, C. Turchetti (Università Politecnica delle Marche)

Bus performances are extremely important in a platform-based design. System Level analysis of bus performances gives important information for the analysis and choice between different architectures driven by functional, timing and power constraints of the System-on-Chip. This paper presents the effect of different arbitration algorithms and bus usage methodologies on the bus AMBA AHB performances in terms of effective throughput and power dissipation. SystemC and VHDL models have been developed and simulations have been performed.
Keywords: AMBA AHB BUS, SystemC, Arbitration Algorithm, Performance

SPECIAL SESSION 36: BioMEMS

Chair: Andrew B. Kahng (University of California at San Diego)
Organizer: Jacob K. White

36.1 Design Tools for BioMEMS [p. 622]

T. Korsmeyer, J. Zeng, K. Greiner (Coventor, Inc.)

Microsystems used for chemical analyses and biological assays are termed BioMEMS or labs-on-a-chip. These systems often require some of the traditional electromechanical capabilities of MEMS, and in addition require the manipulation of fluids in either continuous flow or droplet form. The distinction between continuous flow and droplets defines two broad categories of BioMEMS. Different applications call for one or the other of these approaches, but in either case, software for design and simulation can make a significant contribution to design optimization and reduction in time to market. A computer aided design and analysis approach is presented in which system-level analysis is favored over detailed analysis, although it is shown that this is not always possible, nor preferred. Examples of the use of design and analysis software in BioMEMS development are presented including: electrostatic actuation, a lab-on-a-chip for separation, on-chip optics, a digital fluidic processor, electrospray ionization, and a two-stage chemical reaction.
Keywords: MEMS, BioMEMS, lab-on-a-chip, �TAS, CAD, system-level modeling, BEM, FEM

36.3 CAD Challenges in BioMEMS Design [p. 629]

J. White (Massachusetts Institute of Technology)

The ability of micromachining, or MEMS, technology to directly manipulate micron- and nanometer-sized objects makes it an idea technology for a wide range of biological and biomedical applications, and has led to a subfield in bioMEMs design. Although BioMEMS is attracting substantial attention and research, development has been hampered by the lack of computer-aided design tools. The available tools are far behind those for integrated circuit design, and therefore successful bioMEMS designers require very sophisticated processing expertise. It is the purpose of this paper to encourage research in this rapidly evolving computer- aided design field, by providing the briefest of summaries and an extensive set of pointers to literature.

SESSION 37: Panel - Will Moore's Law Rule in the Land of Analog? [p. 633]

Chair: Rob A. Rutenbar (Carnegie Mellon University)
Organizer: Rob A. Rutenbar
Panelists: T. Bonaccio (IBM Corp.), T. Meng (Stanford University), E. Perea (STMicroelectronics), R. Pitts (Texas Instruments, Inc.), C. Sodini (Massachusetts Institute of Technology), J. Wieser (National Semiconductor Corp.)

Once upon a time there was a wise and benevolent ruler whose Law multiplied his subjects' wealth and happiness--about 2X, every couple of years, but the kingdom was divided. Those in the happy hamlet of Digital got fatter (and faster), year after year. Not so the talented artisans in the town of Analog complained constantly about "voltage headroom", "variability", "noise", "matching", "kT/C limits", the rising costs of supporting their neighbors' insatiable addiction to shrinking transistors, and how the grass looked greener just over the border, in Silicon-Germania. So, what's a King to do? Will we see billion transistor chips with integrated RF made from transistors that are 25 atoms wide? Or will the peasants in the land of Analog really revolt?

SESSION 38: Floorplanning

Chair: Malgorzata Marek-Sadowska (University of California at Santa Barbara)
Organizers: Chung-Kuan Cheng, Frank M. Johannes

38.1 Profile-Guided Microarchitectural Floorplanning for Deep Submicron Processor Design [p. 634]

M. Ekpanyapong, J. R. Minz (Georgia Institute of Technology), T. Watewai (University of California at Berkeley), H.-H. S. Lee, S. K. Lim (Georgia Institute of Technology)

As process technology migrates to deep submicron with feature size less than 100nm, global wire delay is becoming a major hindrance in keeping the latency of intra-chip communication within a single cycle, thus decaying the performance scalability substantially. An effective floorplanning algorithm can no longer ignore the information of dynamic communication patterns of applications. In this paper, using the profile information acquired at the architecture/microarchitecture level, we propose a "profile-guided microarchitectural floorplanner" that considers both the impact of wire delay and the architectural behavior, namely the intermodule communication, to reduce the latency of frequent routes inside a processor and to maintain performance scalability. Based on our simulation results, the profile-guided method shows a 5% to 40% average IPC improvement when clock frequency is fixed. From the perspective of instruction throughput (in BIPS), our floorplanner is much more scalable than a conventional wire length based floorplanner.
Keywords: Microarchitectural Planning, Floorplanning, Computer Architecture.

38.2 Floorplanning Optimization with Trajectory Piecewise-Linear Model for Pipelined Interconnects [p. 640]

C. Long, L. J. Simonson, W. Liao, L. He (University of California at Los Angeles)

Interconnect pipelining has a great impact on system performance, but has not been considered by automatic floorplanning. Considering interconnect pipelining, we study the floorplanning optimization problem to minimize system CPI (cycles per instruction) and in turn maximize system performance. We develop an efficient table-based model called trajectory piece-wise linear (TPWL) model to estimate CPI with interconnect pipelining. Experiments show that the TPWL model differs from cycle-accurate simulations by less than 3.0%. We integrate this model with a simulated-annealing based floorplan optimization to obtain CPI-aware floorplanning. Compared to the conventional floorplanning to minimize area and wire length, our CPI-aware floorplanning can reduce CPI by up to 28.6% with a small area overhead of 5.69% under 100nm technology and obtain better results under 70nm technology. To the best of our knowledge, this paper is the first in-depth study on floorplanning optimization with consideration of interconnect pipelining.

38.3 A Packing Algorithm for Non-Manhattan Hexagon/Triangle Placement Design by Using an Adaptive O-tree Representation [p. 646]

J. Li, T. Yan, B. Yang, J. Yu (University of Electronic Science and Technology of China), C. Li (Cadence Design Systems)

A non-Manhattan Hexagon/Triangle Placement (HTP for short) paradigm is proposed in the present paper. Main feature of this paradigm lies in adapting to the Y- architecture which is one of the promising non-Manhattan VLSI circuit layout architectures. Aim of the HTP is to place a set of equilateral triangles with given size onto a hexagonal chip with maximal chip area usage. Based on the O-tree representation, some adaptive packing rules are adopted to develop an effective placement algorithm for solving the HTP problem in BBL mode. Two examples with benchmark data transformed from the Manhattan BBL mode placement (ami33/49) are presented to justify the feasibility and effectiveness of our algorithms. Experiment results demonstrate that the chip area usage of 94% is achieved through simulated annealing optimization.
Keywords: VLSI circuit physical design, non-Manhattan layout, Y-architecture, O-tree representation, placement, diagonal wiring

SESSION 39: Issues in Timing Analysis

Chair: David Hathaway (IBM Corp.)
Organizers: Kenneth L. Shepard, Sudhakar Bobba

39.1 Worst-Case Circuit Delay Taking into Account Power Supply Variations [p. 652]

D. Kouroussis, R. Ahmadi, F. N. Najm (University of Toronto)

Current Static Timing Analysis (STA) techniques allow one to verify the timing of a circuit at different process corners which only consider cases where all the supplies are low or high. This analysis may not give the true maximum delay of a circuit because it neglects the possible mismatch between drivers and loads. We propose a new approach for timing analysis in which we first identify the critical path(s) of a circuit using a power-supply-aware timing model. Given these critical paths, we then take into account how the power nodes of the gates on the critical path are connected to the power grid, and re-analyze for the worst-case time delay. This reanalysis is posed as an optimization problem where the complete operation of the entire circuit is abstracted in terms of current constraints. We present our technique and report on the implementation results using benchmark circuits tied to a number of test-case power grids.
Keywords: Static timing analysis, power grid, voltage fluctuations

39.2 Statistical Gate Delay Model Considering Multiple Input Switching [p. 658]

A. Agarwal (University of Michigan), F. Dartu (Intel Corporation), D. Blaauw (University of Michigan)

There is an increased dominance of intra-die process variations, creating a need for an accurate and fast statistical timing analysis. Most of the recent proposed approaches assume a Single Input Switching model. Our experiments show that SIS underestimates the mean delay of a stage by up to 20% and overestimates the standard deviation up to 26%. We also show that Multiple Input Switching has a greater impact on statistical timing, than regular static timing analysis. Hence, we propose a modeling technique for gate delay variability, considering MIS. Our model can be efficiently incorporated into most of the statistical timing analysis frameworks. On average over all test cases, our approach underestimates mean delay of a stage by 0.01% and overestimates the standard deviation by only 2%, hence increasing the robustness to process variations. Our modeling technique is independent of the deterministic MIS model, and we show that its sensitivity to variations in the MIS model is small.

39.3 Static Timing Analysis using Backward Signal Propagation [p. 664]

D. Lee (University of Michigan), V. Zolotov (Motorola, Inc.), D. Blaauw (University of Michigan)

In this paper, we address the problem of signal pruning in static timing analysis (STA). Traditionally, signals are propagated through the circuit and are pruned, such that only the signal with the latest arrival time at each node is propagated forward. This signal pruning is a key to the linear run time of STA. However, it was previously observed that a signal with the latest arrival time may not be the most critical signal, as an earlier signal with a larger transition time can result in a longer delay in the down-stream logic. Hence, arrival time based pruning can result in an optimistic delay, incorrect critical paths, and discontinuities of the delay during circuit optimization. Although algorithms were proposed to remedy this issue, they rely on propagation of multiple signals and have an exponential worst-case complexity. In this paper, we propose a new timing analysis algorithm, which uses a two pass traversal of the circuit. In the initial backward traversal, we construct delay tables which record the required time at a node as a function of the transition time at that node. This is followed by a forward traversal where signals are pruned not based on arrival times but based on slack. The proposed algorithm corrects the accuracy problems of the arrival time based pruning while at the same time maintaining the linear run time of STA. We implemented our algorithm and demonstrated its accuracy and efficiency.

SPECIAL SESSION 40: ISSCC Highlights

Chair: Grant E. Martin (Tensilica, Inc.)
Organizer: Grant E. Martin

40.1 Design and Implementation of the POWER5� Microprocessor [p. 670]

J. Clabes, J. Friedrich, M. Sweet, J. DiLullo, S. Chu, D. Plass, J. Dawson, P. Muench, L. Powell, M. Floyd, B. Sinharoy, M. Lee, M. Goulet, J. Wagoner, N. Schwartz, S. Runyon, G. Gorman (IBM Systems Group), P. Restle (IBM Research), R. Kalla, J. McGill, S. Dodson (IBM Systems Group)

POWER5 offers significantly increased performance over previous POWER designs by incorporating simultaneous multi-threading, an enhanced memory subsystem, and extensive RAS and power management support. The 276M transistor processor is implemented in 130nm silicon-on-insulator technology with 8-level of Cu metallization and operates at >1.5 GHz.
Keywords: POWER5, Microprocessor Design, Simultaneous Multi-threading (SMT), Temperature Sensor, Power Reduction, Clock Gating

40.2 A Dual-Core 64b UltraSPARC Microprocessor for Dense Server Applications [p. 673]

T. Takayanagi, J. L. Shin, B. Petrick, J. Su, A. S. Leon (Sun Microsystems, Inc.)

A processor core, previously implemented in a 0.25 μm Al process, is redesigned for a 0.13μm Cu process to create a dual-core processor with 1MB integrated L2 cache, offering an efficient performance/power ratio for compute-dense server applications. Deep submicron circuit design challenges, including negative bias temperature instability (NBTI), leakage and coupling noise, and L2 cache implementation are discussed.
Keywords: Multiprocessor, Dual-core, UltraSPARC, Dense server, Throughput computing, Deep submicron technology, cache, leakage, Negative Bias Temperature Instability, NBTI, Coupling noise, L2

40.3 Low Voltage Swing Logic Circuits for a Pentium� 4 Processor Integer Core [p. 678]

D. J. Deleganes, M. Barany, G. Geannopoulos, K. Kreitzer, A. P. Singh, S. Wijeratne (Intel Corporation)

The Pentium® 4 processor architecture uses a 2x frequency core clock[1] to implement low latency integer ops. Low Voltage Swing logic circuits implemented in 90nm technology[2] meet the frequency demands of a third generation integer-core design.
Keywords: Pentium� 4 processor, integer core, sense-amp, adder, rotator, microprocessor, Low Voltage Swing, LVS

SPECIAL SESSION 41: Multiprocessor SoC MPSoC Solutions/Nightmare

Chair: Grant E. Martin (Tensilica, Incorporation)
Organizer: Grant E. Martin

41.1 The Future Multiprocessor Systems-on-Chips [p. 681]

W. Wolf (Princeton University)

This paper surveys the state-of-the-art and pending challenges in MPSoC design. Standards in communications, multimedia, networking, and other areas encourage the development of high performance platforms that can support a range of implementations of the standard. A multiprocessor system-on-chip includes embedded processors, digital logic, and mixed-signal circuits combined into a heterogeneous multiprocessor. This mix of technologies creates a major challenge for MPSoC design teams. We will look at some existing MPSoC designs and then describe some hardware and software challenges for MPSoC designers.
Keywords: MPSOC, system-on-chip, real-time, low power, embedded software.

41.2 Heterogeneous MP-SoC - The Solution to Energy-Efficient Signal Processing [p. 686]

T. Kogel (CoWare, Inc.), H. Meyr (CoWare, Inc. and Aachen University of Technology)

To meet conflicting flexibility, performance and cost constraints of demanding signal processing applications, future designs in this domain will contain an increasing number of application specific programmable units combined with complex communication and memory infrastructures. Novel architecture trends like Application Specific Instruction-set Processors (ASIPs) as well as customized buses and Network-on-Chip based communication promise enormous potential for optimization. However, state-of-the-art tooling and design practice is not in a shape to take advantage of this advances in computer architecture and silicon technology. Currently, EDA industry develops two diverging strategies to cope with the design complexity of such application specific, heterogeneous MP-SoC platforms. First, the Ip driven approach emphasizes the composition of MP-SoC platforms from configurable off-the-shelf Intellectual Property blocks. On the other hand, the design-driven approach strives to take design efficiency to the required level by use of system level design methodologies and IP generation tools. In this paper, we discuss technical and economical aspects of both strategies. Based on the analysis of recent trends in computer architecture and system level design, we envision a hand-in-hand approach of signal processing platform architectures and design methodology to conquer the complexity crisis in emerging MP-SoC developments.
Keywords: MP-SoC, Design Space Exploration, Signal Processing, Energy Efficiency, Network-on-Chip

41.3 Flexible Architectures for Engineering Successful SOCs [p. 692]

C. Rowen, S. Leibson (Tensilica, Inc.)

This paper focuses on a particular SOC design technology and methodology, here called the advanced or processor-centric SOC design method, which reduces the risk of SOC design and increases ROI by using configurable processors to implement on-chip functions while increasing the SOC's flexibility through software programmability. The essential enabler for this design methodology is automatic processor generation�the rapid and easy creation of new microprocessor architectures, complete with efficient hardware designs and comprehensive software tools. The high speed of the generation process and the great flexibility of the generated architectures underpin a fundamental shift of the role of processors in system architecture.
Keywords: RTL, SOC, MPSOC, processor cores, RISC

SESSION 42: Panel - Is Statistical Timing Statistically Significant? [p. 698]

Chair: Andrew B. Kahng (University of California at San Diego)
Organizers: Rich Goldman, Kurt Keutzer
Panelists: C. Bittlestone (Texas Instruments, Inc.), A. Bootehsaz (Synopsys, Inc.), S. Y. Borkar (Intel Corp.), E. Chen (TSMC), L. Scheffer (Cadence Design Systems, Inc.), C. Visweswariah (IBM Corp.)

Process variations - which affect critical electrical parameters and lead to both random and systematic changes in circuit performance � have always posed significant challenges to semiconductor design. In the past, within-die process variation was relatively small, and methods such as corner-based analysis were sufficient. This allowed timing analysis tools to calculate delays, slew times, coupling and power in a straightforward way. Today, the International Technology Roadmap for Semiconductors suggests that the semiconductor industry's historical ability to control process variations is under siege, for both devices and interconnects. As statistical variation increases, will corner-casing lead to too much conservatism, and hence a requirement for new statistical timing and noise analysis tools? In other words, is the design flow inevitably moving to "delay is no longer a number; it's a distribution"? Or are the urgency and the advantages of statistical timing analysis overstated?

SESSION 43: Timing Issues in Placement

Chair: Bill Halpin (Intel Corporation)
Organizers: Carl Sechen, Phiroze Parakh

43.1 Modeling Repeaters Explicitly Within Analytical Placement [p. 699]

P. Saxena (Intel Labs), B. Halpin (Synplicity, Inc.)

Recent works have shown that scaling causes the number of repeaters to grow rapidly. We demonstrate that this growth leads to massive placement perturbations that break the convergence of today�s interleaved placement and repeater insertion flows. We then present two new force models for repeaters targeted towards analytical placement algorithms. Our experiments demonstrate the effectiveness of our repeater modeling technique in preserving placement convergence (often also accompanied by wirelength improvement) at the 45 and 32 nm technology nodes.
Keywords: Analytical placement, Buffering, Force-directed placement, Interconnect, Placement, Repeater insertion, Scaling.

43.2 Quadratic Placement Using an Improved Timing Model [p. 705]

B. Obermeier, F. M. Johannes (Technical University of Munich)

The performance of timing-driven placement methods depends strongly on the choice of the net model. In this paper a more precise net model is presented that does not increase numerical complexity. We introduce a method that replaces the clique model of a net by a tree model in the quadratic placement formulation. This improvement enables us to control the length of every tree segment separately. Furthermore, we present an analysis of the effects of every tree segment to the net delay. The result is in turn used to control the placement engine. Our presented results are based on legal placements. They show significant improvements over state-of-the art methods.
Keywords: timing driven placement, Quadratic placement, Steiner tree net model, sensitivity, optimization potential

43.3 An Approach to Placement-Coupled Logic Replication [p. 711]

M. Hrkic, J. Lillis, G. Beraudo (University of Illinois at Chicago)

We present a set of techniques for placement-coupled, timing-driven logic replication. Two components are at the core of the approach. First is an algorithm for optimal timing-driven fanin tree embedding; the algorithm is very general in that it can easily incorporate complex objective functions (e.g., placement costs) and can perform embedding on any graph-based target. Second we introduce the Replication Tree which allows us to induce large fanin trees from a given circuit which can then be optimized by the embedder. We have built an optimization engine around these two ideas and report promising results for the FPGA domain including clock period reductions of up to 36% compared with a timing-driven placement from VPR [12] and almost double the average improvement of local replication from [1]. These results are achieved with modest area and runtime overhead.
Keywords: Timing Optimization, Logic Replication, Placement, Programmable Logic

SESSION 44: Design Methodologies for ASIPs

Chair: Masaharu Imai (Osaka University)
Organizers: Joachim Gerlach, Margarida Jacome

44.1 A Novel Approach for Flexible and Consistent ADL-driven ASIP Design [p. 717]

G. Braun, A. Nohl (CoWare, Inc.), W. Sheng, J. Ceng, M. Hohenauer, H. Scharwächter, R. Leupers, H. Meyr (Institute for Integrated Systems)

Architecture description languages (ADL) have been established to aid the design of application-specific instruction-set processors (ASIP). Their main contribution is the automatic generation of a software toolkit, including C compiler, assembler, linker, and instruction-set simulator. Hence, the challenge in the design of such ADLs is to unambiguously capture the architectural information required for the toolkit generation in a single model. This is particularly difficult for C compiler and simulator, as both require information about the instructions� semantics, however, while the C compiler needs to know what an instructions does, the simulator needs to know how. Existing ADLs solve this problem by either introducing redundancy or by limiting the language�s flexibility. This paper presents a novel, mixed-level approach for ADL-based instruction-set description, which offers maximum flexibility while preventing from inconsistencies. Moreover, it enables capturing instruction- and cycle-accurate descriptions in a single model. The feasibility and design efficiency of our approach is demonstrated with a number of contemporary, real-world processor architectures.
Keywords: ADL, ASIP, Embedded Processors

44.2 Characterizing Embedded Applications for Instruction-Set Extensible Processors [p. 723]

P. Yu, T. Mitra (National University of Singapore)

Extensible processors, which allow customization for an application domain by extending the core instruction set architecture, are becoming increasingly popular for embedded systems. However, existing techniques restrict the set of possible candidates for custom instructions by imposing a variety of constraints. As a result, the true extent of performance improvement achievable by extensible processors for embedded applications remains unknown. Moreover, it is unclear how the interplay among these restrictions impacts the performance potential. Our careful examination of this issue shows that significant speedup can only be obtained by relaxing some of the constraints to a reasonable extent. In particular, to the best of our knowledge, ours is the first work that studies the impact of relaxing control flow constraint by identifying instructions across basic blocks and indicates 5-148% relative speedup for different applications.
Keywords: Customization processors, Instruction-set extensions.

44.3 Introduction of Local Memory Elements in Instruction Set Extensions [p. 729] P. Biswas (University of California at Irvine),

V. Choudhary, K. Atasu, L. Pozzi, P. Ienne (Swiss Federal Institute of Technology), N. Dutt (University of California at Irvine)

Automatic generation of Instruction Set Extensions (ISEs), to be executed on a custom processing unit or a coprocessor is an important step towards processor customization. A typical goal of a manual designer is to combine a large number of atomic instructions into an ISE satisfying microarchitectural constraints. However, memory operations pose a challenge for previous ISE approaches by limiting the size of the resulting instruction. In this paper, we introduce memory elements into custom units which result in ISEs closer to those sought after by the designers. We consider two kinds of memory elements for mapping to the specialized hardware: small hardware tables and architecturally-visible state registers. We devised a genetic algorithm to specifically exploit opportunities of introducing memory elements during ISE generation. Finally, we demonstrate the effectiveness of our approach by a detailed study of the variation in performance, area and energy in the presence of the generated ISEs, on a number of MediaBench, EEMBC and cryptographic applications. With the introduction of memory, the average speedup varied from 2.7X to 5X depending on the architectural configuration with a nominal area overhead. Moreover, we obtained an average energy reduction of 26% with respect to a 32-KB cache.
Keywords: Customizable processors, ASIPs, Instruction Set Extensions, Ad-hoc Functional Units, Coprocessors, Genetic Algorithm.

SESSION 45: FPGA-Based Systems

Chair: Katherine Compton (University of Wisconsin)
Organizers: Jens Palsberg, Scott Hauck

45.1 FPGA Power Reduction Using Configurable Dual-Vdd [p. 735]

F. Li, Y. Lin, L. He (University of California at Los Angeles)

Power optimization is of growing importance for FPGAs in nanometer technologies. Considering dual-Vdd technique, we show that configurable power supply is required to obtain a satisfactory performance and power tradeoff. We design FPGA circuits and logic fabrics using configurable dual-Vdd and develop the corresponding CAD flow to leverage such circuits and logic fabrics. We then carry out a highly quantitative study using area, delay and power models obtained from detailed circuit design and SPICE simulation in 100nm technology. Compared to single-Vdd FPGAs with optimized Vdd level for the same target clock frequency, configurable dual-Vdd FPGAs with full and partial supply programmability for logic blocks reduce logic power by 35.46% and 28.62% respectively and reduce total FPGA power by 14.29% and 9.04% respectively. To the best of our knowledge, it is the first in-depth study on FPGAs with configurable dual-Vdd for power reduction.
Keywords: FPGA, low power, power efficient, dual-Vdd, configurable

45.2 Multi-Resource Aware Partitioning Algorithms for FPGAs with Heterogeneous Resources [p. 741]

N. Selvakkumaran (University of Minnesota), A. Ranjan, S. Raje (HierDesign Inc.), G. Karypis (University of Minnesota)

As FPGA densities increase, partitioning-based FPGA placement approaches are becoming increasingly important as they can be used to provide high-quality and computationally scalable placement solutions. However, modern FPGA architectures incorporate heterogeneous resources, which place additional requirements on the partitioning algorithms because they now need to not only minimize the cut and balance the partitions, but also they must ensure that none of the resources in each partition is oversubscribed. In this paper, we present a number of multilevel multi-resource hypergraph partitioning algorithms that are guaranteed to produce solutions that balance the utilization of the different resources across the partitions. We evaluate our algorithms on twelve industrial benchmarks ranging in size from 5,236 to 140,118 vertices and show that they achieve minimal degradation in the min-cut while balancing the various resources. Comparing the quality of the solution produced by some of our algorithms against that produced by hMETIS, we show that our algorithms are capable of balancing the different resources while incurring only a 3.3%�5.7% higher cut.
Keywords: Partitioning, Placement, multi-constraint, multi-resource, FPGA, hierarchical

45.3 An SoC Design Methodology Using FPGAs and Embedded Microprocessors [p. 747]

N. Ohba, K. Takano (IBM Japan Ltd.)

In System on Chip (SoC) design, growing design complexity has forced designers to start designs at higher abstraction levels. This paper proposes an SoC design methodology that makes full use of FPGA capabilities. Design modules in different abstraction levels are all combined and run together in an FPGA prototyping system that fully emulates the target SoC. The higher abstraction level design modules run on microprocessors embedded in the FPGAs, while lower-level synthesizable RTL design modules are directly mapped onto FPGA reconfigurable cells. We made a hardware wrapper that gets the embedded microprocessors to interface with the fully synthesized modules through IBM CoreConnect buses. Using this methodology, we developed an image processor SoC with cryptographic functions, and we verified the design by running real firmware and application programs. For the designs that are too large to be fit into an FPGA, dynamic reconfiguration method is used.
Keywords: SoC, ASIC, FPGA prototyping, and mixed-level verification.

SPEICAL SESSION 46: Security as a New Dimension in Embedded System Design [p. 753]

Chair: Srivaths Ravi (NEC Corporation)
Organizer: Srivaths Ravi P. Kocher (Cryptography Research), R. Lee (Princeton University), G. McGraw (Cigital), A. Raghunathan, S. Ravi (NEC Laboratories America)

: The growing number of instances of breaches in information security in the last few years has created a compelling case for efforts towards secure electronic systems. Embedded systems, which will be ubiquitously used to capture, store, manipulate, and access data of a sensitive nature, pose several unique and interesting security challenges. Security has been the subject of intensive research in the areas of cryptography, computing, and networking. However, security is often mis-construed by embedded system designers as the addition of features, such as specific cryptographic algorithms and security protocols, to the system. In reality, it is an entirely new metric that designers should consider throughout the design process, along with other metrics such as cost, performance, and power. This paper is intended to introduce embedded system designers and design tool developers to the challenges involved in designing secure embedded systems. We attempt to provide a unified view of embedded system security by first analyzing the typical functional security requirements for embedded systems from an end-user perspective. We then identify the implied challenges for embedded system architects, as well as hardware and software designers (e.g., tamper-resistant embedded system design, processing requirements for security, impact of security on battery life for battery-powered systems, etc.). We also survey solution techniques to address these challenges, drawing from both current practice and emerging research, and identify open research problems that will require innovations in embedded system architecture and design methodologies.
Keywords: Embedded Systems, Security, Cryptography, Security Protocols, Security Processing, Design, Design Methodologies, Architectures, Tamper Resistance, Software Attacks, Viruses, Trusted Computing, Digital Rights Management, Performance, Battery Life

SESSION 47: Leakage Power Optimization

Chair: Farid N. Najm (University of Toronto)
Organizers: Enrico Macii, Naehyuck Chang

47.1 Tradeoffs between Gate Oxide Leakage and Delay for Dual Tox Circuits [p. 761]

A. K. Sultania (University of Minnesota), D. Sylvester (University of Michigan), S. S. Sapatnekar (University of Minnesota)

Gate oxide tunneling current (Igate) will become the dominant component of leakage in CMOS circuits as the physical oxide thickness (Tox) goes below 15Å. Increasing the value of Tox reduces the leakage at the expense of an increase in delay, and a practical tradeoff between delay and leakage can be achieved by assigning one of the two permissible Tox values to each transistor. In this paper, we propose an algorithm for dual Tox assignment to optimize the total leakage power under delay constraints, and generate a leakage/delay tradeoff curve. As compared to the case where all transistors are set to low Tox, our approach achieves an average leakage reduction of 83% under 100nm models.
Keywords: Leakage power, Dual Tox Circuits

47.2 Implicit Pseudo Boolean Enumeration Algorithms for Input Vector Control [p. 767]

K. Chopra, S. B. K. Vrudhula (University of Arizona)

In a CMOS combinational logic circuit, the subthreshold leakage current in the standby state depends on the state of the inputs. In this paper we present a new approach to identify the minimum leakage set of input vectors (MLS). Applying a vector in the MLS is known as Input Vector Control (IVC), and has proven to be very useful in reducing gate oxide leakage and subthreshold leakage in standby mode of operation. The approach presented here is based on Implicit Enumeration of integer-valued decision diagrams. Since the search space for minimum leakage vector increases exponentially with the number of primary inputs, the enumeration is done with respect to the minimum balanced cut of the digraph representation of the circuit. To reduce the switching power dissipated when the inputs are driven to a given state (during entry into and exit from the standby state), we extend the MLS algorithm to compute a bounded leakage set (BLS). Given a bound of standby leakage, we present an algorithm for computing minimal switching cost partial input vectors such that the leakage of the circuit is always less than the upper bound.
Keywords: CMOS, Leakage, Power, Binary Decision Diagrams, Symbolic Methods, SAT

47.3 Statistical Optimization of Leakage Power Considering Process Variations using Dual-Vth and Sizing [p. 773]

A. Srivastava, D. Sylvester, D. Blaauw (University of Michigan)

Increasing levels of process variability in sub-100nm CMOS design has become a critical concern for performance and power constraint designs. In this paper, we propose a new statistically aware Dual-Vt and sizing optimization that considers both the variability in performance and leakage of a design. While extensive work has been performed in the past on statistical analysis methods, circuit optimization is still largely performed using deterministic methods. We show in this paper that deterministic optimization quickly looses effectiveness for stringent performance and leakage constraints in designs with significant variability. We then propose a statistically aware dual-Vt and sizing algorithm where both delay constraints and sensitivity computations are performed in a statistical manner. We demonstrate that using this statistically aware optimization, leakage power can be reduced by 15-35% compared to traditional deterministic analysis. The improvements increase for strict delay constraints making statistical optimization especially important for high performance designs.
Keywords: Leakage, variability, optimization

47.4 Leakage- and Crosstalk-Aware Bus Encoding for Total Power Reduction [p. 779]

H. S. Deogun, R. R. Rao, D. Sylvester, D. Blaauw (University of Michigan)

Power consumption, particularly runtime leakage, in long on-chip buses has grown to an unacceptable portion of the total power budget due to heavy buffer insertion to combat RC delays. In this paper, we propose a new bus encoding algorithm and circuit scheme for on-chip buses that eliminates capacitive crosstalk while simultaneously reducing total power. We introduce a new buffer design approach with selective use of high threshold voltage transistors and couple this buffer design with a novel bus encoding scheme. The proposed encoding scheme significantly reduces total power by 26% and runtime leakage power by 42% while also eliminating capacitive crosstalk. In addition, the proposed encoding is specifically optimized to reduce the complexity of the encoding logic, allowing for a significant reduction in overhead which has not been considered in previous bus encoding work.
Keywords: Leakage reduction, encoding, low power

47.5 Power Minimization using Simultaneous Gate Sizing, Dual-Vdd and Dual-Vth Assignment [p. 783]

A. Srivastava, D. Sylvester, D. Blaauw (University of Michigan),

We develop an approach to minimize total power in a dual-Vdd and dual-Vth design. The algorithm runs in two distinct phases. The first phase relies on upsizing to create slack and maximize low Vdd assignments in a backward topological manner. The second phase proceeds in a forward topological fashion and both sizes and re-assigns gates to high Vdd to enable significant static power savings through high Vth assignment. The proposed algorithm is implemented and tested on a set of combinational benchmark circuits. A comparison with traditional CVS and dual-Vth/sizing algorithms demonstrate the advantage of the algorithm over a range of activity factors, including an average power reduction of 30% (50%) at high (nominal) primary input activities.
Keywords: Power dissipation, optimization, multiple voltages

SESSION 48: Interconnect Extraction

Chair: Yehea Massoud (Rice University)
Organizers: Sachin S. Sapatnekar, Jamil Kawa

48.1 Sparse Transformations and Preconditioners for Hierarchical 3-D Capacitance Extraction with Multiple Dielectrics [p. 788]

S. Yan, V. Sarin, W. Shi (Texas A&M University)

Capacitance extraction is an important problem that has been extensively studied. This paper presents a significant improvement for the fast multipole accelerated boundary element method. We first introduce an algebraic transformation to convert the n � n dense capacitance coefficient matrix into a sparse matrix with O(n) nonzero entries. We then use incomplete Cholesky factorization or incomplete LU factorization to produce an effective preconditioner for the sparse linear system. Simulation results show that our algorithm drastically reduces the number of iterations needed to solve the linear system associated with the boundary element method. For the k�k bus crossing benchmark, our algorithm uses 3-4 iterations, compared to 10-20 iterations used by the previous algorithms such as FastCap [1] and HiCap [2]. As a result, our algorithm is 2-20 times faster than those algorithms. Our algorithm is also superior to the multi-scale method [3] because our preconditioner reduces the number of iterations further and applies to multiple dielectrics.
Keywords: Capacitance, extraction, interconnect, iterative methods, preconditioning

48.2 A Fast Parasitic Extractor Based on Low-Rank Multilevel Matrix Compression for Conductor and Dielectric Modeling in Microelectronics and MEMS [p. 794]

D. Gope, S. Chakraborty, V. Jandhyala (University of Washington)

Parasitic parameter extraction is a crucial issue in Integrated Circuit design. Integral equation based solvers, which guarantee high accuracy, suffer from a time and memory bottleneck arising from the dense matrices generated. In this paper we present a hybrid FMM-QR algorithm that combines the best features of the Fast Multipole Method and the QR based matrix compression method to achieve faster setup and solve time and lower memory requirements. The method is applied to extract parasitic capacitances from the layout of arbitrarily shaped conductors and dielectrics. Examples demonstrating the accuracy and the superior time and memory performances as compared to existing solvers are also presented.
Keywords: Parasitics, Multilevel, Low-rank, conductors and dielectrics

48.3 CHIME: Coupled Hierarchical Inductance Model Evaluation [p. 800]

S. Gupta, L. T. Pileggi (Carnegie Mellon University)

Modeling inductive effects accurately and efficiently is a critical necessity for design verification of high performance integrated systems. While several techniques have been suggested to address this problem, they are mostly based on sparsification schemes for the L or L-inverse matrix. In this paper, we introduce CHIME, a methodology for non-local inductance modeling and simulation. CHIME is based on a hierarchical model of inductance that accounts for all inductive couplings at a linear cost, without requiring any window size assumptions for sparsification. The efficacy of our approach stems from representing the mutual inductive couplings at various levels of hierarchy, rather than discarding some of them. A prototype implementation demonstrates orders of magnitude speedup over a full, flat model and significant accuracy improvements over a truncated model. Importantly, this hierarchical circuit simulation capability produces a solution that is as accurate as the hierarchically extracted circuits, thereby providing a "golden standard" against which simpler truncation based models can be validated.
KEYWORDS: Inductance Modeling, Circuit Simulation, Extraction

48.4 Large-Scale Full-Wave Simulation [p. 806]

S. Kapur, D. E. Long (Integrand Software, Inc.)

We describe a new extraction tool, EMX (Electro-Magnetic eXtractor), for the analysis of RF, analog and high-speed digital circuits. EMX is a fast full-wave field solver. It incorporates two new techniques which make it significantly faster and more memory-efficient than previous solvers. First, it takes advantage of layout regularity in typical designs. Second, EMX uses a new method for computing the vector-potential component in the mixed potential integral equation. These techniques give a speed-up of more than a factor of ten, together with a corresponding reduction in memory.
Keywords: Layout extraction, multipole, integral equations

48.5 Closed-Form Expressions of Distributed RLC Interconnects for Analysis of On-Chip Inductance Effects [p. 810]

Y. Tanji (Kagawa University), H. Asai (Shizuoka University)

The closed-form expressions of distributed RLC interconnects are proposed for analysis of on-chip inductance effects in order to insert optimally the repeaters. The transfer function of a circuit with driver-interconnect-load structure is approximated by the 5th order rational functions. The step responses computed by using the proposed expressions give the good agreement with the SPICE simulations.
Keywords: Inductance Effects, RLC Distributed Interconnects

SESSION 49: New Frontiers in Logic Synthesis

Chair: Jordi Cortadella (University of Polytechnica De Catalunya)
Organizers: Leon Stok, Steven M. Nowick

49.1 Re-synthesis for Delay Variation Tolerance [p. 814]

S.-C. Chang, C.-T. Hsieh, K.-C. Wu (National Tsing Hua University)

Several factors such as process variation, noises, and delay defects can degrade the reliabilities of a circuit. Traditional methods add a pessimistic timing margin to resolve delay variation problems. In this paper, instead of sacrificing the performance, we propose a re-synthesis technique which adds redundant logics to protect the performance. Because nodes in the critical paths have zero slacks and are vulnerable to delay variation, we formulate the problem of tolerating delay variation to be the problem of increasing the slacks of nodes. Our re-synthesis technique can increase the slacks of all nodes or wires to be larger than a pre-determined value. Our experimental results show that additional area penalty is around 21% for 10% of delay variation tolerance.
Keywords: Delay variation, tolerance

49.2 Post-Layout Logic Optimization of Domino Circuits [p. 820]

A. Cao, C.-K. Koh (Purdue University)

Logic duplication, a commonly used synthesis technique to remove trapped inverters in reconvergent paths of Domino circuits, incurs high area and power penalties. In this paper, we propose a synthesis scheme to reduce the duplication cost by allowing inverters in Domino logic under certain timing constraints. In order to guarantee the robustness of such Domino circuits, we perform the reduction of logic duplication at the physical level. Experimental results show significant reduction in duplication cost, which translates into significant improvements in area, power, and/or delay.
Keywords: Domino logic, synthesis, optimization, layout

49.3 Multiple Constant Multiplication By Time-Multiplexed Mapping of Addition Chains [p. 826]

P. Tummeltshammer (University of Technology, Vienna), J. C. Hoe, M. Püschel (Carnegie Mellon University)

An important primitive in the hardware implementations of linear DSP transforms is a circuit that can multiply an input value by one of several different preset constants. We propose a novel implementation of this circuit based on combining the addition chains of the constituent constants. We present an algorithm to automatically generate such a circuit for a given set of constants. The quality of the resulting circuits is evaluated after synthesis for a commercial 0.18 �m standard cell library. We compare the area and latency efficiency of this addition chain based approach against a straightforward approach based on a constant table and a full multiplier.
Keywords: Addition chains, multiplierless, directed acyclic graph, fusion

49.4 Decomposing Specifications with Concurrent Outputs to Resolve State Coding Conflicts in Asynchronous Logic Synthesis [p. 830]

H. K. Kapoor, M. B. Josephs (London South Bank University)

Synthesis of asynchronous logic using the tool Petrify requires a state graph with a complete state coding. It is common for specifications to exhibit concurrent outputs, but Petrify is sometimes unable to resolve the state coding conflicts that arise as a result, and hence cannot synthesise a circuit. A pair of decomposition heuristics (expressed in the language of Delay- Insensitive Sequential Processes) are given that helps one to obtain a synthesisable specification. The second heuristic has been successfully applied to a set of nine benchmarks to obtain significant reductions both in area and in synthesis time, compared with synthesis performed on the original specifications.
Keywords: Asynchronous logic synthesis, delay-insensitive decomposition

49.5 A New Heuristic Algorithm for Reversible Logic Synthesis [p. 834]

P. Kerntopf (Warsaw University of Technology)

Reversible logic has applications in many fields, including quantum computing. Synthesis techniques for reversible circuits are not well developed, even for functions with a small number of inputs and outputs. This paper proposes an approach to reversible logic synthesis using a new complexity measure based on shared binary decision diagrams with complemented edges (instead of truth tables, as in the previous algorithms). The approach can be used with arbitrary libraries of reversible logic gates and arbitrary cost functions. Experiments show promising results in comparison with the known approaches.
Keywords: Reversible Logic Circuits, Synthesis.

49.6 Quantum Logic Synthesis by Symbolic Reachability Analysis [p. 838]

W. N. N. Hung (Intel Corporation), X. Song, G. Yang (Portland State University), J. Yang (Intel Corporation), M. Perkowski (Portland State University)

Reversible quantum logic plays an important role in quantum computing. In this paper, we propose an approach to optimally synthesize quantum circuits by symbolic reachability analysis where the primary inputs are purely binary. We present an exact synthesis method with optimal quantum cost and a speedup method with non-optimal quantum cost. Both our methods guarantee the synthesizeability of all reversible circuits. Unlike previous works which use permutative reversible gates, we use a lower level library which includes non-permutative quantum gates. Our approach obtains the minimum cost quantum circuits for Miller's gate, half-adder, and full-adder, which are better than previous results. In addition, we prove the minimum quantum cost (using our elementary quantum gates) for Fredkin, Peres, and Toffoli gates. Our work constitutes the first successful experience of applying satisfiability with formal methods to quantum logic synthesis.
Keywords: Reversible Logic, Quantum Computing, Formal Verification, Model Checking, Satisfiability.

SESSION 50: Numerical Techniques for Simulation

Chair: Joel R. Philips (Cadence Design Systems, Incorporated)
Organizers: Jacob, K. White, Kartikeya Mayaram

50.1 A Frequency Relaxation Approach for Analog/RF System-Level Simulation [p. 842]

X. Li, Y. Xu, P. Li, P. Gopalakrishnan, L. T. Pileggi (Carnegie Mellon University)

The increasing complexity of today's mixed-signal integrated circuits necessitates both top-down and bottom-up system-level verification. Time-domain state-space modeling and simulation approaches have been successfully applied for such purposes (e.g. Simulink); however, analog circuits are often best analyzed in the frequency domain. Circuit-level analyses, such as harmonic balance, have been successfully extended to the frequency domain [2], but these algorithms are impractical for simulating large systems with wide-band input and noise signals. In this paper we proposed a frequency-domain approach for analog/RF system-level simulation that is capable of capturing various second order effects (e.g. nonlinearity, noise, etc.) for both time-invariant and time-varying systems with wide-band inputs. The simulator directly evaluates the frequency domain response at each node via a relaxation scheme that is proven to be convergent under typical circuit conditions. Our experimental results demonstrate the accuracy and efficiency of the proposed simulator under various wide-band input and noise excitations.
Keywords: System-Level Simulation, Analog/RF Circuits

50.2 Robust, Stable Time-Domain Methods for Solving MPDEs of Fast/Slow Systems [p. 848]

T. Mei, J. Roychowdhury (University of Minnesota), T. S. Coffey, S. A. Hutchinson, D. M. Day (Sandia National Laboratories)

In this paper, we explore in detail the stability properties of timedomain numerical methods for multi-time partial differential equations (MPDEs). We demonstrate that simple techniques for numerical discretization can lead easily to instability. By investigating the underlying eigenstructure of several discretization techniques along different artificial time scales, we show that not all combinations of techniques are stable. We identify choices of discretization method and of step size along slow time scales that lead to robust, stable time-domain integration methods for the MPDE. One of our results is that applying overstable methods along one timescale can compensate for unstable discretization along others. Our novel integration schemes bring robustness to time-domain MPDE solution methods, as we demonstrate with examples.
Keywords: MPDE, stability, eigenstructure, time-domain discretization, envelope

50.3 High-Level Simulation of Substrate Noise in High-Ohmic Substrates with Interconnect and Supply Effects [p. 854]

G. Van der Plas (IMEC), M. Badaroglu (IMEC, K. U. Leuven), G. Vandersteen (IMEC, Vrije Universiteit Brussel), P. Dobrovolny (IMEC), P. Wambacq (IMEC, Vrije Universiteit Brussel), S. Donnay (IMEC), G. Gielen (ESAT, K.U. Leuven), H. De Man (IMEC, ESAT, K.U. Leuven)

Substrate noise is a major obstacle for mixed-signal integration. In this paper we propose a fast and accurate high-level methodology to simulate substrate noise generated by large digital circuits. The methodology can handle any substrate type, e.g. bulk-type or EPI-type, and takes into account the effects of interconnect and supply. For each standard cell a substrate macromodel is used in order to efficiently simulate the total system, which consists of a network of such macromodels. For a 40K gates telecom circuit fabricated in a 0.18 μm CMOS process, measurements indicate that substrate noise is simulated by using our methodology within 20% error but several orders of magnitude faster in CPU time than a full SPICE simulation.
Keywords: model order reduction, modeling, substrate noise

50.4 Hierarchical Approach to Exact Symbolic Analysis of Large Analog Circuits [p. 860]

S. X.-D. Tan, W. Guo, Z. Qi (University of California at Riverside)

This paper provides a novel approach to exact symbolic analysis of very large analog circuits. The new method is based on determinant decision diagrams (DDDs) to represent symbolic product terms. But instead of constructing DDD graphs directly from a flat circuit matrix, the new method constructs DDD graphs in a hierarchical way based on hierarchically defined circuit structures. The resulting algorithm can analyze much larger analog circuits exactly than before. Theoretically, we show that exact symbolic expressions of a circuit are cancellation-free expressions when the circuit is analyzed hierarchically. Practically we propose a novel hierarchical DDD graph construction algorithm. Our experimental results show that very large analog circuits, which can't be analyzed exactly before like μA725 and other unstructured circuits up to 100 nodes, can be analyzed by the new approach for the first time. The new approach significantly improves the exact symbolic capacity and promises huge potentials for the new applications of symbolic analysis in analog circuit design automation.
Keywords Behavioral Modeling, Symbolic Analysis, Circuit Simulation

50.5 An Essentially Non-Oscillatory (ENO) High-Order Accurate Adaptive Table Model for Device Modeling [p. 864]

B. Yang, B. McGaughy (Cadence Design Systems, Inc.)

Modern analytical device models become more and more complicated and expensive to evaluate in circuit simulation. Interpolation based table look-up device models become increasingly important for fast circuit simulation. Traditional table model trades accuracy for speed and is only used in fast-Spice simulators but not good enough for prime-time Spice simulators such as SPECTRE. We propose a novel table model technology that uses high-order essentially nonoscillatory (ENO) polynomial interpolation in multi-dimensions to guarantee smoothness in multi-dimensions and high accuracy in approximating i - v/q - v curves. An efficient transfinite blending technique for the reconstruction of multi-dimensional tables is used. Interpolation stencil is adaptively determined by automatic accuracy control. The method has been proved to be superior to traditional ones and successfully applied in Spectre and Ultrasim for simulating digital, analog, RF, and mixed-signal circuits.

SESSION 51: Energy and Thermal-Aware Design

Chair: Jihong Kim (Seoul National University)
Organizers: Chaitali Chakrabarti, Sarma B. Vrudhula

51.1 Theoretical and Practical Limits of Dynamic Voltage Scaling [p. 868]

B. Zhai, D. Blaauw, D. Sylvester (University of Michigan), K. Flautner (ARM Ltd.)

Dynamic voltage scaling (DVS) is a popular approach for energy reduction of integrated circuits. Current processors that use DVS typically have an operating voltage range from full to half of the maximum Vdd. However, it is possible to construct designs that operate over a much larger voltage range: from full Vdd to subthreshold voltages. This possibility raises the question of whether a larger voltage range improves the energy efficiency of DVS. First, from a theoretical point of view, we show that for subthreshold supply voltages leakage energy becomes dominant, making "just in time completion" energy inefficient. We derive an analytical model for the minimum energy optimal voltage and study its trends with technology scaling. Second, we use the proposed model to study the workload activity of an actual processor and analyze the energy efficiency as a function of the lower limit of voltage scaling. Based on this study, we show that extending the voltage range below 1/2 Vdd will improve the energy efficiency for most processor designs, while extending this range to subthreshold operation is beneficial only for very specific applications. Finally, we show that operation deep in the subthreshold voltage range is never energy-efficient.
Keywords dynamic voltage scaling, minimum energy point

51.2 Enabling Energy Efficiency in Via-Patterned Gate Array Devices [p. 874]

R. R. Taylor (Carnegie Mellon University), H. Schmit (Tabula, Inc.)

In an attempt to enable the cost-effective production of low-and mid-volume application-specific chips, researchers have proposed a number of so-called structured ASIC architectures. These architectures represent a departure from traditional standard-cell-based ASIC designs in favor of techniques which present more physical and structural regularity. If structured ASICs are to become a viable alternative to standard cells, they must deliver performance and energy efficiency which is competitive with standard-cell-based design techniques. This paper focuses on one family of structured ASICs known as via-patterned gate arrays, or VPGAs. In this paper, we present circuit structures and power optimization algorithms which can be applied to VPGA chips in an effort to reduce their operational power dissipation.
Keywords: Structured ASIC, VPGA, Low-Power, Voltage Scaling, Power Optimization

51.3 Compact Thermal Modeling for Temperature-Aware Design [p. 878]

W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, S. Velusamy (University of Virginia)

Thermal design in sub-100nm technologies is one of the major challenges to the CAD community. In this paper, we first introduce the idea of temperature-aware design. We then propose a compact thermal model which can be integrated with modern CAD tools to achieve a temperature-aware design methodology. Finally, we use the compact thermal model in a case study of microprocessor design to show the importance of using temperature as a guideline for the design. Results from our thermal model show that a temperature-aware design approach can provide more accurate estimations, and therefore better decisions and faster design convergence.
Keywords: temperature-aware design, temperature-aware computing, thermal model, power-aware design, leakage, reliability.

51.4 Simultaneous Optimization of Supply and Threshold Voltages for Low-Power and High-Performance Circuits in the Leakage Dominant Era [p. 884]

A. Basu, S.-C. Lin, V. Wason (University of California at Santa Barbara), A. Mehrotra (Berkeley Design Automation, Inc.), K. Banerjee (University of California at Santa Barbara)

Electrothermal couplings between supply voltage, operating frequency, power dissipation and die temperature have been shown to significantly impact the energy-delay-product (EDP) based simultaneous optimization of supply (Vdd) and threshold (Vth) voltages. We present for the first time, the implications of an electrothermally aware EDP optimization on circuit operation in leakage dominant nanometer scale CMOS technologies. It is demonstrated that electrothermal EDP (EEDP) optimization restricts the operation of the circuit to a certain region in the Vdd-Vth plane. Also, the significance of EEDP optimization has been shown to increase with increase in leakage power and/or process variations.
Keywords: Electrothermal couplings, energy delay product, subthreshold leakage, temperature aware design

SESSION 52: Noise-Tolerant Design and Analysis Techniques

Chair: David Blaauw (University of Michigan)
Organizers: Anirudh Devgan, Dennis Sylvester

52.1 Noise Characterization of Static CMOS Gates [p. 888]

R. Kanj (University of Illinois at Urbana-Champaign), T. Lehner, B. Agrawal (IBM Corporation), E. Rosenbaum (University of Illinois at Urbana-Champaign)

We present new macromodeling techniques for capturing the response of a CMOS logic gate to noise pulses at the input. Two approaches are presented. The first one is a robust mathematical model which enables the hierarchical generation of noise abstracts for circuits composed of the precharacterized cells. The second is a circuit equivalent model which generates accurate noise waveforms for arbitrarily shaped and timed multiple-input glitches, arbitrary loads, and external noise coupling.
Keywords: Cell model, noise analysis, sensitivity, circuit-equivalent model, mathematical model, simulation

52.2 A Scalable Soft Spot Analysis Methodology for Compound Noise Effects in Nano-meter Circuits [p. 894]

C. Zhao, X. Bai, S. Dey (University of California at San Diego)

Circuits using nano-meter technologies are becoming increasingly vulnerable to signal interference from multiple noise sources as well as radiation-induced soft errors. One way to ensure reliable functioning of chips is to be able to analyze and identify the spots in the circuit which are susceptible to such effects (called �soft spots� in this paper), and to make sure such soft spots are �hardened� so as to resist multiple noise effects and soft errors. In this paper, we present a scalable soft spot analysis methodology to study the vulnerability of digital ICs exposed to nano-meter noise and transient soft errors. First, we define "softness" as an important characteristic to gauge system vulnerability. Then several key factors affecting softness are examined. Finally an efficient Automatic Soft Spot Analyzer (ASSA) is developed to obtain the softness distribution which reflects the unbalanced noise-tolerant capability of different regions in a design. The proposed methodology provides guidelines to reduction of severe nano-meter noise effects caused by aggressive design in the premanufacturing phase, and guidelines to selective insertion of online protection schemes to achieve higher robustness. The quality of the proposed soft-spot analysis technique is validated by HSPICE simulation, and its scalability is demonstrated on a commercial embedded processor.
Keywords: Nano-meter technology, compound noise effect, robustness, softness distribution.

52.3 A Novel Technique to Improve Noise Immunity of CMOS Dynamic Logic Circuits [p. 900]

L. Ding, P. Mazumder (University of Michigan)

Dynamic CMOS logic circuits are widely employed in high performance VLSI chips in pursuing very high system performance. However, dynamic circuits are inherently less resistant to noises than static CMOS gates. With the increasing stringent noise requirement due to aggressive technology scaling, the noise tolerance of dynamic circuits has to be first improved for the overall reliable operation of VLSI systems. In this paper, we present a novel noise-tolerant design technique using circuitry exhibiting a negative differential resistance effect. We have demonstrated that using the proposed method the noise tolerance of dynamic logic gates can be improved beyond the level of static CMOS logic gates while the performance advantage of dynamic circuits is still retained.
Keywords: Dynamic circuits, Domino logic style, Noise-tolerant design, Negative differential resistance, Digital integrated circuits

52.4 Statistical Timing Analysis in Sequential Circuit for On-Chip Global Interconnect Pipelining [p. 904]

L. Zhang, Y. Hu (University of Wisconsin), C. Chen (National Taiwan University)

With deep-sub-micron (DSM) technology, statistical timing analysis becomes increasingly crucial to characterize signal transmission over global interconnect wires. In this paper, a novel statistical timing analysis approach has been developed to analyze the behavior of two important pipelined architectures for multiple clock-cycle global interconnect, namely, the flip-flop inserted global wire and the latch inserted global wire. We present analytical formula that is based on parameters obtained using Monte Carlo simulation. These results enable a global interconnect designer to explore design trade-o.s between clock frequency and probability of bit-error during data transmission.
Keywords: Statistical Timing Analysis, Interconnect Pipelining

SESSION 53: New Tools and Methods for Future Embedded SoC

Chair: Giovanni De Micheli (Stanford University)
Organizers: Marcello Coppola, Pai H. Chou

53.1 Debugging HW/SW Interface for MPSoC: Video Encoder System Design Case Study [p. 908]

M.-W. Youssef, S. Yoo, A. Sasongko, Y. Paviot, A. A. Jerraya (TIMA Laboratory)

This paper reports a case study of multiprocessor SoC (MPSoC) design of a complex video encoder, namely OpenDivX. OpenDivX is a popular version of MPEG4. It requires massive computation resources and deals with complex data structures to represent video streams. In this study, the initial specification is given in sequential C code that had to be parallelized to be executed on four different processors. High level programming model, namely Message Passing Interface (MPI) was used to enable inter-task communication among parallelized C code. A four processor hardware prototyping platform was used to debug the parallelized software before final SoC hardware is ready. The targeting of abstract parallel code using MPI to the multiprocessor architecture required the design of an additional hardware-dependent software layer to refine the abstract programming model. The design was made by a team work of three types of designer: application software, hardware-dependent software and hardware platform designers. The collaboration was necessary to master the whole flow from the specification to the platform. The study showed that HW/SW interface debug was the most time-consuming step. This is identified as a potential killer for application-specific MPSoC design. To further investigate the ways to accelerate the HW/SW interface debug, we analyzed bugs found in the case study and the available debug environments. Finally, we address a debug strategy that exploits efficiently existing debug environments to reduce the time for HW/SW interface debug.
Keywords: Multiprocessor system-on-chip, Hardware-software interface, Hardware-dependant software, Debug

53.2 SUNMAP: A Tool for Automatic Topology Selection and Generation for NoCs [p. 914]

S. Murali, G. De Micheli (Stanford University)

Increasing communication demands of processor and memory cores in Systems on Chips (SoCs) necessitate the use of Networks on Chip (NoC) to interconnect the cores. An important phase in the design of NoCs is the mapping of cores onto the most suitable topology for a given application. In this paper, we present SUNMAP a tool for automatically selecting the best topology for a given application and producing a mapping of cores onto that topology. SUNMAP explores various design objectives such as minimizing average communication delay, area, power dissipation subject to bandwidth and area constraints. The tool supports different routing functions (dimension ordered, minimum-path, traffic splitting) and uses floorplanning information early in the topology selection process to provide feasible mappings. The network components of the chosen NoC are automatically generated using cycle-accurate SystemC soft macros from xpipes architecture. SUNMAP automates NoC selection and generation, bridging an important design gap in building NoCs. Several experimental case studies are presented in the paper, which show the rich design space exploration capabilities of SUNMAP.
Keywords: Systems On Chip, Networks on Chip, Topology, Mapping, SystemC.

53.3 FITS: Framework-based Instruction-set Tuning Synthesis for Embedded Application Specific Processors [p. 920]

A. Cheng (The University of Michigan), G. Tyson (Florida State University), T. Mudge (The University of Michigan)

We propose a new instruction synthesis paradigm that falls between a general-purpose embedded processor and a synthesized application specific processor (ASP). This is achieved by replacing the fixed instruction and register decoding of general purpose embedded processor with programmable decoders that can achieve ASP performance with the fabrication advantages of a mass produced single chip solution.
Keywords: 16-bit ISA, Instruction synthesis, Embedded processor, ASP, Instruction encoding, Reconfigurable processors, Configurable architecture, Code density, Low-power, Energy efficient

53.4 Mapping a Domain Specific Language to a Platform FPGA [p. 924]

C. Kulkarni, G. Brebner (Xilinx Inc.), G. Schelle (University of Colorado)

A domain specific language (DSL) enables designers to rapidly specify and implement systems for a particular domain, yielding designs that are easy to understand, reason about, re-use and maintain. However, there is usually a significant overhead in the required infrastructure to map such a DSL on to a programmable logic device. In this paper, we present a mapping of an existing DSL for the networking domain on to a platform FPGA by embedding the DSL into an existing language infrastructure. In particular, we will show that, using few basic concepts, we are able to achieve a successful mapping of the DSL on to a platform FPGA and create a reusable structure that also makes it easy to extend the DSL. Finally we will present some results of mapping the DSL on to a platform FPGA and comment on the resulting overhead.
Keywords: Domain Specific Language, Platform FPGA, Network Processing

SESSION 54: New Scan-Based Test Techniques

Chair: Bernd Koenemann (Cadence Design)
Organizers: Erik Jan Marinissen, Seiji Kajihara

54.1 On the Generation of Scan-Based Test Sets with Reachable States for Testing under Functional Operation Conditions [p. 928]

I. Pomeranz (Purdue University)

Design-for-testability (DFT ) for synchronous sequential circuits allows the generation and application of tests that rely on non-functional operation of the circuit. This can result in unnecessary yield loss due to the detection of faults that do not affect normal circuit operation. Considering single stuck-at faults in full-scan circuits, a test vector consists of a primary input vector U and a state S . We say that the test vector consisting of U and S relies on non-functional operation if S is an unreachable state, i.e., a state that cannot be reached from all the circuit states. Our goal is to obtain test sets with states S that are reachable states. Given a test set C, the solution we explore is based on a simulation-based procedure to identify reachable states that can replace unreachable states in C. No modifications are required to the test generation procedure and no sequential test generation is needed. Our results demonstrate that the proposed procedure is able to produce test sets that detect many of the circuit faults, which are detectable using scan, and practically all the sequentially irredundant faults, by using test vectors with reachable states. The procedure is applicable to any type of scan-based test set, including test sets for delay faults.
Keywords: functional tests, reachable states, scan design.

54.2 Scalable Selector Architecture for X-Tolerant Deterministic BIST [p. 934]

P. Wohl, J. A. Waicukauski, S. Patel (Synopsys Inc.)

X-tolerant deterministic BIST (XDBIST) was recently presented as a method to efficiently compress and apply scan patterns generated by automatic test pattern generation (ATPG) in a logic built-in self-test architecture. In this paper we introduce a novel selector architecture that allows arbitrary compression ratios, scales to any number of scan chains and minimizes area overhead. XDBIST test-coverage, full Xtolerance and scan-based diagnosis ability are preserved and are the same as deterministic scan-ATPG.
Keywords: test-data compression, test-generation (ATPG).

54.3 Scan-BIST Based on Transition Probabilities [p. 940]

I. Pomeranz (Purdue University)

We demonstrate that it is possible to generate a deterministic test set that detects all the detectable single stuck-at faults in a fullscan circuit such that each test contains a small number of transitions from 0 to 1 or from 1 to 0 when considering consecutive input values. Using this result we show that built-in test-pattern generation for scan circuits can be based on transition probabilities instead of probabilities of specific bits in the test set being 0 or 1. The resulting approach associates only two parameters with every set of test vectors: an initial value and a transition probability. We demonstrate that this approach is effective in detecting all the detectable single stuck-at faults in benchmark circuits.
Keywords: built-in self-test, scan design.

54.4 Combining Dictionary Coding and LFSR Reseeding for Test Data Compression [p. 944]

X. Sun, L. Kinney, B. Vinnakota (University of Minnesota)

In this paper we describe a method to combine dictionary coding and partial LFSR reseeding to improve the compression efficiency for test data compression. We also present a fast matrix calculation method which significantly reduces the computation time to find a solution for partial LFSR reseeding. Experimental results on ISCAS89 benchmark circuits show that our approach is better than either dictionary coding or LFSR reseeding, and outperforms several test data compression methods proposed recently.
Keywords: VLSI test, Built-In Self Test.

SESSION 55: CAD for Reconfigurable Computing

Chair: Jason Cong (Magma Design Automation, Inc.)
Organizers: Jens Palsberg, Scott Hauck

55.1 Virtual Memory Window for Application-Specific Reconfigurable Coprocessors [p. 948]

M. Vuletic, L. Pozzi, P. Ienne (Swiss Federal Institute of Technology Lausanne)

Reconfigurable Systems-on-Chip (SoCs) on the market consist of full-fledged processors and large Field-Programmable Gate-Arrays (FPGAs). The latter can be used to implement the system glue logic, various peripherals, and application-specific coprocessors. Using FPGAs for application-specific coprocessors has certain speedup potentials, but it is less present in practice because of the complexity of interfacing the software application with the coprocessor. Another obstacle is the lack of portability across different systems. In this work, we present a virtualisation layer consisting of an operating-system extension and a hardware component. It lowers the complexity of interfacing and increases portability potentials, while it also allows the coprocessor to access the user virtual memory through a virtual memory window. The burden of moving data between processor and coprocessor is shifted from the programmer to the operating system. Since the virtualisation layer components hide physical details of the system, user designed hardware and software become perfectly portable. A reconfigurable SoC running Linux is used to prove the viability of the concept. Two applications are ported to the system for testing the approach, with their critical functions mapped to the specific coprocessors. We show a significant speedup compared to the software versions, while limited penalty is paid for virtualisation.
Keywords: Reconfigurable Computing, OS, Coprocessors, Codesign.

55.2 Dynamic FPGA Routing for Just-in-Time FPGA Compilation [p. 954]

R. Lysecky, F. Vahid, S. X.-D. Tan (University of California at Riverside)

Just-in-time (JIT) compilation has previously been used in many applications to enable standard software binaries to execute on different underlying processor architectures. However, embedded systems increasingly incorporate Field Programmable Gate Arrays (FPGAs), for which the concept of a standard hardware binary did not previously exist, requiring designers to implement a hardware circuit for a single specific FPGA. We introduce the concept of a standard hardware binary, using a just-in-time compiler to compile the hardware binary to an FPGA. A JIT compiler for FPGAs requires the development of lean versions of technology mapping, placement, and routing algorithms, of which routing is the most computationally and memory expensive step. We present the Riverside On-Chip Router (ROCR) designed to efficiently route a hardware circuit for a simple configurable logic fabric that we have developed. Through experiments with MCNC benchmark hardware circuits, we show that ROCR works well for JIT FPGA compilation, producing good hardware circuits using an order of magnitude less memory resources and execution time compared with the well known Versatile Place and Route (VPR) tool suite. ROCR produces good hardware circuits using 13X less memory and executing 10X faster than VPR's fastest routing algorithm. Furthermore, our results show ROCR requires only 10% additional routing resources, and results in circuit speeds only 32% slower than VPR's timing-driven router, and speeds that are actually 10% faster than VPR's routability-driven router.
Keywords: Place and route, just-in-time compilation, hardware/software partitioning, FPGA, configurable logic, platforms, system-on-a-chip, dynamic optimization, codesign, warp processors.

55.3 An Efficient Algorithm for Finding Empty Space for Online FPGA Placement [p. 960]

M. Handa, R. Vemuri (University of Cincinnati)

A fast and efficient algorithm for finding empty area is necessary for online placement, task relocation and defragmentation on a partially reconfigurable FPGA. We present an algorithm that finds empty area as a list of overlapping maximal rectangles. Using an innovative representation of the FPGA, we are able to predict possible locations of the maximal empty rectangles. Worst-case time complexity of our algorithm is O(xy) where x is the number of columns, y is the number of rows and x.y is the total number of cells on the FPGA. Experiments show that, in practice, our algorithm needs to scan less than 15% of the FPGA cells to make a list of all maximal empty rectangles.
Keywords: Online Placement, Partially Reconfigurable FPGAs, Reconfigurable Computing