
DAC 2004 ABSTRACTS
Sessions:
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
Chair: A. Richard Newton (University of California at Berkeley)
Organizers: Robert Dahlberg, Kurt Keutzer
Panelists: R. Bingham (Cadence Design Systems, Inc.), A. De Geus (Synopsys, Inc.),
W. C. Rhines (Mentor Graphics Corp.)

Over the last 20 years, electronic design has grown to annual revenues of $3B per year and
dominates the Technical and Systems Sector of the Software Industry. Cadence and Synopsys
are among the top 20 independent software vendors when ranked by market capitalization. In
short, the EDA industry has come of age, but the future of the industry is far from certain.
The panel participants are the three CEOs from the three largest EDA companies: Ray Bingham,
Cadence Design Systems, Inc.; Aart de Geus, Synopsys, Inc.; and Wally Rhines, Mentor
Graphics Corp.. The panel will open with each of the CEOs sharing their outlook for the industry
as a whole and their visions to foster growth of their respective companies. After the opening
statements by the CEOs, Professor Richard Newton, University of California at Berkeley, will
moderate a series of questions posed by a panel of six questioners…
Chair: Massoud Pedram (University of Southern California)
Organizer: Lei He

2.1 Design Optimizations for Microprocessors at Low Temperature [p. 2]

A. Vassighi, A. Keshavarzi, S. Narendra, G. Schrom, Y. Ye (Circuits Research), S. Lee (Intel Labs., OR),
G. Chrysler (Intel Labs., AZ), M. Sachdev (University of Waterloo), V. De (Circuits Research)
We investigate tradeoffs in microprocessor frequency and system power achievable for low
temperature operation in scaled high leakage technologies by combining refrigeration with
supply voltage selection, body bias, transistor sizing and shorter channel length. Reducing
channel length provides better frequency and power improvement than forward body bias. When,
the leakage power is more than 30% of chip power, combining refrigeration with enhancing
technology by shorter channel length provides the best tradeoff for power and frequency.
KEYWORDS: Low temperature, Electrothermal modeling, microprocessor, power, frequency,
CMOS, cooling, refrigeration

2.2 Leakage in NanoScale Technologies: Mechanisms, Impact
and Design Considerations [p. 6]

A. Agarwal, C. H. Kim, S. Mukhopadhyay, K. Roy (Purdue University)
The high leakage current in nanometer regimes is becoming a significant portion of power
dissipation in CMOS circuits as threshold voltage, channel length, and gate oxide thickness are
scaled. Consequently, the identification of different leakage components is very important for
estimation and reduction of leakage. Moreover, the increasing statistical variation in the process
parameters has led to significant variation in the transistor leakage current across and within
different dies. Designing with the worst case leakage may cause excessive guardbanding,
resulting in a lower performance. This paper explores various intrinsic leakage mechanisms
including weak inversion, gateoxide tunneling and junction leakage etc. Various circuit level
techniques to reduce leakage energy and their design tradeoff are discussed. We also explore
process variation compensating techniques to reduce delay and leakage spread, while meeting
power constraint and yield.
Keywords: Leakage current, Circuit design, Process variation.

2.3 System Level Leakage Reduction Considering the Interdependence
of Temperature and Leakage [p. 12]

L. He, W. Liao (University of California at Los Angeles),
M. R. Stan (University of Virginia)
The high leakage devices in nanometer technologies as well as the low activity rates in systemonachip (SOC) contribute to the growing significance of leakage power at the system level. We
first present systemlevel leakagepower modeling and characteristics and discuss ways to reduce
leakage for caches. Considering the interdependence between leakage power and temperature,
we then discuss thermal runaway and dynamic power and thermal management (DPTM) to
reduce power and prevent thermal violations. We show that a thermalindependent leakage
model may hide actual failures of DPTM. Finally, we present voltage scaling considering DPTM
for different packaging options. We show that the optimal Vdd for the best throughput may be
smaller than the largest Vdd allowed by the given packaging platform, and that advanced cooling
techniques can improve throughput significantly.
Keywords: Microarchitecture, Leakage power, Temperature
Chair: John Lillis (University of Illinois)
Organizers: Dirk Stroobandt, Raymond Nijssen

3.1 Reducing Clock Skew Variability via Cross Links [p. 18]

A. Rajaram, J. Hu, R. Mahapatra (Texas A&M University),
Increasingly significant variational effects present a great challenge for delivering desired clock
skew reliably. Nontree clock network has been recognized as a promising approach to overcome
the variation problem. Existing nontree clock routing methods are restricted to a few simple or
regular structures, and often consume excessive amount of wirelength. In this paper, we suggest
to construct a low cost nontree clock network by inserting cross links in a given clock tree. The
effects of the link insertion on clock skew variability are analyzed. Based on the analysis, we
propose two link insertion schemes that can quickly convert a clock tree to a nontree with
significantly lower skew variability and very limited wirelength increase. In these schemes, the
complicated nontree delay computation is circumvented. Further, they can be applied to the
recently popular nonzero skew routing easily. Experimental results on benchmark circuits show
that this approach can achieve significant skew variability reduction with less than 2% increase
of wirelength.
Keywords: VLSI, physical design, variation, clock network synthesis

3.2 Fast and Flexible Buffer Trees that Navigate the Physical Layout Environment [p. 24]

C. J. Alpert (IBM Corporation), M. Hrkic (University of Illinois at Chicago),
J. Hu (Texas A&M University), S. T. Quay (IBM Corporation)
Buffer insertion is an increasingly critical optimization for achieving timing closure, and the
number of buffers required increases significantly with technology migration. It is imperative for
an automated buffer insertion algorithm to be able to efficiently optimize tens of thousands of
nets. One must also be able to effectively navigate the existing layout, including handling large
blockages, blockages with holes specifically for buffers, specially allocated buffer blocks,
placement porosity, and routing congestion. The algorithm must also be flexible enough to know
when to use and when not to use expensive layout resources. Although several previous works
have addressed buffer insertion in the presence of blockages, this is the first to present a
complete solution that can manage the physical layout environment.
Keywords: Buffer insertion, physical synthesis, global routing.

3.3 Practical Repeater Insertion For Low Power: What Repeater
Library Do We Need? [p. 30]

X. Liu, Y. Peng (North Carolina State University), M. C. Papaefthymiou (University of Michigan)
In this paper, we investigate the problem of repeater insertion for low power under a given
timing budget. We propose a novel repeater insertion algorithm to compute the optimal repeater
number and width in the discrete solution space, as defined by a given repeater library. Using our
algorithm, we show that rounding the solution under the continuity assumption to the closest
discrete solution candidate may result in suboptimal designs, or it may even fail to find an
existing solution. Given a certain tolerance to the degradation of repeater power dissipation, we
address two practical and highly important questions: (1) How coarse could the repeater size
granularity be? (2) What range should the repeater size be in?
Experimental results demonstrate the high effectiveness of the proposed scheme and provide
valuable insights into repeater library design. Our approach achieves up to 23% power reduction
in comparison to roundingbased approaches. With a 4% power degradation tolerance, repeater
size granularity as coarse as 8 λ can be used, reducing the library size by more than 87%. For
interconnects with various wire lengths and timing targets, our investigation reveals that the
range of optimal repeater sizes for lowpower is limited, indicating that a lowcost smallsize
repeater library, if well designed, is adequate to provide high quality repeater insertion solutions.
Keywords: Interconnect, Repeater Insertion, Low power
Chair: Jacob Abraham (University of Texas)
Organizers: Adnan Aziz, Yaron Kashi

4.1 Industrial Experience with Test Generation Languages
for Processor Verification [p. 36]

M. Behm, J. Ludden (IBM Development Center Austin),
Y. Lichtenstein, M. Rimon, M. Vinov (IBM Research Laboratory, Haifa)
We report on our experience with a new test generation language for processor verification. The
verification of two superscalar multiprocessors is described and we show the ease of expressing
complex verification tasks. The cost and benefit are demonstrated: training takes up to six
months; the simulation time required for a desired level of coverage has decreased by a factor of
twenty; the number of escape bugs has been reduced.
Keywords: Functional verification, Processor verification, Test generation

4.2 Defining Coverage Views to Improve Functional Coverage Analysis [p. 41]

S. Asaf, E. Marcus, A. Ziv (IBM Research Laboratory, Haifa)
Coverage analysis is used to monitor the quality of the verification process. Reports provided by
coverage tools help users identify areas in the design that have not been adequately tested.
Because of their sheer size, the analysis of large coverage models can be an intimidating and
timeconsuming task. Practically, it can only be done by focusing on specific parts of the model.
This paper presents a method for defining views onto the coverage data of crossproduct
functional coverage models. The proposed method allows users to focus on certain aspects of the
coverage data to extract relevant, useful information, thereby improving the quality of the
coverage analysis. A number of examples are provided that show how the proposed method
improved the verification of actual designs.
Keywords: Functional verification, Coverage analysis

4.3 Systematic Functional Coverage Metric Synthesis
from Hierarchical Temporal Event Relation Graph [p. 45]

Y.S. Kwon, Y.I. Kim, C.M. Kyung (KAIST)
Functional coverage is a technique for checking the completeness of test vectors in HDL
simulation. Temporal events are used to monitor the sequence of events in the specification. In
this paper, automatic generation of temporal events for functional coverage is proposed. The
HiTER is the graph where nodes represent basic temporal properties or subgraph and edges
represent timeshift value between two nodes. Hierarchical temporal events are generated by
traversing HiTER such that invalid, or irrelevant properties are eliminated. Concurrent edge
groups make it possible to generate more comprehensive temporal properties and hierarchical
structure makes it easy to describe large design by combining multiple subgraphs. Automatically
generated temporal events describe almost all the possible temporal properties of the design
under verification.
Keywords: Functional coverage, Temporal Event

4.4 Probabilistic Regression Suites for Functional Verification [p. 49]

S. Fine, S. Ur, A. Ziv (IBM Research Laboratory, Haifa)
Random test generators are often used to create regression suites onthefly. Regression suites
are commonly generated by choosing several specifications and generating a number of tests
from each one, without reasoning which specification should be used and how many tests should
be generated from each specification. This paper describes a technique for building high quality
random regression suites. The proposed technique uses information about the probability of each
test specification covering each coverage task. This probability is used, in turn, to determine
which test specifications should be included in the regression suite and how many tests should be
generated from each specification. Experimental results show that this practical technique can be
used to improve the quality, and reduce the cost, of regression suites. Moreover, it enables better
informed decisions regarding the size and distribution of the regression suites, and the risk
involved.
Keywords: Functional Verification, Coverage Analysis, Regression Suite
Chair: Prashant Sawkar (Intel Corporation)
Organizers: Gila Kamhi, Krzysztof Kuchcinski

5.1 Modular Scheduling of Guarded Atomic Actions [p. 55]

D. L. Rosenband, Arvind (Massachusetts Institute of Technology)
A modular synthesis flow is essential for a scalable and hierarchical design methodology. This
paper considers a particular modular flow where each module has interface methods and the
internal behavior of the module is described in terms of a set of guarded atomic actions on the
state elements of the module. A module can also read and update the state of other modules but
only by invoking the interface methods of those modules. This paper extends the past work on
hardware synthesis of a set of guarded atomic actions by Hoe and Arvind to modules of such
actions. It presents an algorithm that, given the scheduling constraints on the interface methods
of the called modules, derives the "glue logic" and the scheduling constraints for the interface
methods of the calling module such that the atomicity of the guarded actions is preserved across
module boundaries. Such modules provide reusable IP which facilitates "correctness by
construction" design methodology. It also reduces compiletimes dramatically in comparison to
the compilation that flattens all the modules first.

5.2 Automatic Correct Scheduling of Control Flow Intensive
Behavioral Descriptions in Formal Synthesis [p. 61]

K. Kapp, V. Sabelfeld (University of Karlsruhe)
Formal synthesis ensures the correctness of hardware synthesis by automatically deriving the
circuit implementation by behavior preserving transformations within a theorem prover. In this
paper, we present a new approach to formal synthesis that is able to handle control flow intensive
descriptions. In particular, we consider here the scheduling phase, which is a central task in highlevel
synthesis. We describe a methodology employing a new control flow oriented
representation which allows the fully automatic scheduling of control flow intensive descriptions
in formal synthesis. To obtain scheduled circuits of high quality, the scheduling is guided by
conventional scheduling algorithms.
Keywords: Formal Synthesis, Transformational Design, Scheduling

5.3 A TimingDriven ModuleBased Chip Design Flow [p. 67]

F. Mo, R. K. Brayton (University of California at Berkeley)
A ModuleBased design flow for digital ICs with hard and soft modules is presented. Versions of
the soft modules are implemented with different area/delay characteristics. The versions
represent flexibility that can be used in the physical design to meet timing
requirements. The
flow aims at minimizing the clock cycle of the chip while providing quicker
turnaround time.
Unreliable wiring estimation is eliminated and costly iterations are reduced resulting in
substantial reductions in run time as well as a significant decrease in the clock periods.
Keywords: Design flow, Physical synthesis, Timing constraints

5.4 Timing Closure through a Globally Synchronous, Timing
Partitioned Design Methodology [p. 71]

A. Edman, C. Svensson (Link&ooul;ping University)
A method to mitigate timing problems due to global wire delays is proposed. The method
follows closely a fully synchronous design flow and utilizes only true digital library elements.
The design is partitioned into isochronous blocks at system level, where a few clock cycles
latency is inserted between the isochronous blocks. This latency is then utilized to automatically
mitigate unknown global wire delays, unknown global clock skews and other timing
uncertainties occurring in backend design. The new method is expected to considerably reduce
the timing closure effort in large high frequency digital designs in deep submicron technologies.
Keywords: Timing closure, wire delays, clock skew
Chair: Leon Stok (IBM Corporation)
Organizer: Naresh Shanbhag

6.1 Design and Reliability Challenges in Nanometer Technologies [p. 75]

S. Borkar, T. Karnik, V. De (Intel Laboratories)
CMOS technology scaling is causing the channel lengths to be subwavelength of light.
Parameter variation, caused by subwavelength lithography, will pose a major challenge for
design and reliability of future high performance microprocessors in nanometer technologies. In
this paper, we present the impact of these variations on processor functionality, predictability and
reliability. We propose design and CAD solutions for variation tolerance. We conclude this
paper with soft error rate scaling trends and soft error tolerant circuits for reliability
enhancement.
Keywords: Lowpower, variation tolerance, leakage tolerance, reliability, soft errors.

6.2 A CommunicationTheoretic Design Paradigm for Reliable SOCs [p. 76]

N. R. Shanbhag (University of Illinois at UrbanaChampaign)
Presented is a design paradigm, pioneered at the University of Illinois in 1997, for reliable and
energy efficient systemonachip (SOC) in nanometer process technologies. These technologies
are characterized by nonidealities such as coupling, leakage, soft errors, and process variations,
which contribute to a reliability problem. Increasing complexity of systemsonachip (SOC)
leads to a related power problem. The proposed paradigm provides solutions to both problems by
viewing SOCs as communication networks, and employs ideas from errorcontrol coding,
communications, and information theory in order to achieve the dual goals of reliability and
energyefficiency.
Keywords: Lowpower, systemonachip, reliability, communications, coding.

6.3 Reliable Communication in Systems on Chips [p. 77]

G. De Micheli (Stanford University)
KeywordsSoCs, VLSI, Networking

6.4 Designing Robust Microarchitectures [p. 78]

T. M. Austin (University of Michigan)
A faulttolerant approach to microprocessor design, developed at the University of Michigan, is
presented. Our approach is based on the use of insitu checker components that validate the
functional and electrical characteristics of complex microprocessor designs. Two design
techniques are highlighted: a lowcost doublesampling latch design capable of eliminating
powerhungry voltage margins, and a formally verifiable checker coprocessor that validates all
computation produced by a complex microprocessor core. By adopting a "better than worstcase"
approach to system design, it is possible to address reliability and uncertainty concerns that arise
during design, manufacturing and system operation.
Keywords: Lowpower, systemonachip, reliable microarchitecture design.

6.5 Hierarchical Application Aware Error Detection and Recovery [p. 79]

R. K. Iyer (University of Illinois at UrbanaChampaign)
Proposed is a fourtired approach to develop and integrate detection and recovery support at
different levels of the system hierarchy. The proposed mechanisms exploit support provided by
(i) embedded hardware, (ii) operating system, (iii) compiler, and (iv) application.
Keywords: Hierarchical error detection and recovery, embedded hardware, software
implemented fault tolerance.
Chair: Lucio Lanza (Lanza techVentures)
Organizer: Andrzej Strojwas
Panelists: M. Campbell (Qualcomm, Inc.), V. C. Gerousis (Infineon Tech.),
J. Hogan (Telos Venture Partners), J. Kibarian (PDF Solutions, Inc.),
M. Levitt (Cadence Design Systems, Inc.), W. Ng (Chartered Semiconductor Manufacturing, Inc.),
D. Pramanik (Synopsys, Inc.), M. Templeton (Artisan Components, Inc.)

Silicon yield once was dominated by contaminants and particulates, making yield a process
issue. But with today's electronics supply chain, multiple suspects may be indicted on
manufacturability issues. Who is responsible for preventative actions in manufacturability and
yield? The panelists, representing a foundry, a fabless company, an IP provider, two EDA
vendors, and an IC design team, will discuss the problems and the solutions for achieving
manufacturability and yield goals.
Chair: Massoud Pedram (University of Southern California)
Organizers: Luca Benini, Trevor Mudge

8.1 Memory Access Scheduling and Binding Considering Energy
Minimization in MultiBank Memory Systems [p. 81]

C.G. Lyuh (Electronics and Telecommunications Research Institute), T. Kim (Seoul National University)
Memoryrelated activity is one of the major sources of energy consumption in embedded
systems. Many types of memories used in embedded systems allow multiple operating modes
(e.g., active, standby, nap, powerdown) to facilitate energy saving. Furthermore, it has been
known that the potential energy saving increases when the embedded systems use multiple
memory banks in which their operating modes are controlled independently. In this paper, we
propose (a compilerdirected) integrated approach to the problem of maximally utilizing the
operating modes of multiple memory banks by solving the three important tasks simultaneously:
(1) assignment of variables to memory banks, (2) scheduling of memory access operations, and
(3) determination of operating modes of banks. Specifically, for an instance of tasks 1 and 2, we
formulate task 3 as a shortest path(SP) problem in a network and solved it optimally. We then
develop an SPbased heuristic that solves tasks 2 and 3 efficiently in an integrated fashion. We
then extend the proposed approach to address the limited register constraint in processor. From
experiments with a set of benchmark programs, we confirm that the proposed approach is able to
reduce the energy consumption by 15.76% over that by the conventional greedy approach.
Keywords: low energy design, scheduling, binding

8.2 Profilebased Optimal Intratask Voltage Scheduling
for Hard RealTime Applications [p. 87]

J. Seo (KAIST), T. Kim (Seoul National University), K.S. Chung (Hanyang University)
This paper presents a set of comprehensive techniques for the intratask voltage scheduling
problem to reduce energy consumption in hard realtime tasks of embedded systems. Based on
the execution profile of the task, a voltage scheduling technique that optimally determines the
operating voltages to individual basic blocks in the task is proposed. The obtained voltage
schedule guarantees minimum average energy consumption. The proposed technique is then
extended to solve practical issues regarding transition overheads, which are totally or partially
ignored in the existing approaches. Finally, a technique involving a novel extension of our
optimal scheduler is proposed to solve the scheduling problem in a discretely variable voltage
environment. In summary, it is confirmed from experiments that the proposed optimal
scheduling technique reduces energy consumption by 20.2% over that of one of the stateoftheart
schedulers [11] and, further, the extended technique in a discrete voltage environment reduces
energy consumption by 45.3% on average.
Keywords: low energy design, DVS, intratask voltage scheduling

8.3 RequirementBased Design Methods for Adaptive Communications Links [p. 93]

J. A. Carballo, K. Nowka, S.M. Yoo, I. Vo (IBM Austin Research Laboratory),
C. Cranford, R. Norman (IBM Systems and Technology)
Highspeed communications link cores must consume lowpower, feature low biterrorrates
(BER), and address many applications. We present a methodology to design adaptive link
architectures, whereby the link.s internal logic complexity, frequency, and supply are
simultaneously adapted to application requirements. The requirement space is mapped to the
design space using requirements measurement circuits and configurable logic blocks. CMOS
results indicate that power savings of 60% versus the worst case are possible, while the area
overhead is kept under 5%.
Keywords: Energy Efficient Design, Communication Architectures.

8.4 Automated Energy/Performance Macromodeling of Embedded Software [p. 99]

A. Muttreja (Princeton University), A. Raghunathan, S. Ravi (NEC Laboratories),
N. K. Jha (Princeton University)
Efficient energy and performance estimation of embedded software is a critical part of any
systemlevel design flow. Macromodeling based estimation is an attempt to speed up estimation
by exploiting reuse that is inherent in the design process. Macromodeling involves precharacterizing
reusable software components to construct highlevel models, which express the
execution time or energy consumption of a subprogram as a function of suitable parameters.
During simulation, macromodels can be used instead of detailed hardware models, resulting in
orders of magnitude simulation speedup. However, in order to realize this potential, significant
challenges need to be overcome in both the generation and use of macromodels— including how
to identify the parameters to be used in the macromodel, how to define the template function to
which the macromodel is fitted, etc. This paper presents an automatic methodology to perform
characterizationbased highlevel software macromodeling, which addresses the aforementioned
issues. Given a subprogram to be macromodeled for execution time and/or energy consumption,
the proposed methodology automates the steps of parameter identification, data collection
through detailed simulation, macromodel template selection, and fitting. We propose a novel
technique to identify potential macromodel parameters and perform data collection, which draws
from the concept of data structure serialization used in distributed programming. We utilize
symbolic regression techniques to concurrently filter out irrelevant macromodel parameters,
construct a macromodel function, and derive the optimal coefficient values to minimize fitting
error. Experiments with several realistic benchmarks suggest that the proposed methodology
improves estimation accuracy and enables wide applicability of macromodeling to complex
embedded software, while realizing its potential for estimation speedup. We describe a case
study of how macromodeling can be used to rapidly explore algorithmlevel energy tradeoffs, for
the zlib data compression library.
Keywords: Data Serialization, Embedded Software, Genetic Programming, Macromodeling,
Symbolic Regression

8.5 Coding for SystemonChip Networks: A Unified Framework [p. 103]

S. R. Sridhara, N. R. Shanbhag (University of Illinois at UrbanaChampaign)
In this paper, we present a coding framework derived from a communicationtheoretic view of a
DSM bus to jointly address power, delay, and reliability. In this framework, the data is first
passed through a nonlinear source coder that reduces self and coupling transition activity and
imposes a constraint on the peak coupling transitions on the bus. Next, a linear error control
coder adds redundancy to enable error detection and correction. The framework is employed to
efficiently combine existing codes and to derive novel codes that span a wide range of tradeoffs
between bus delay, codec latency, power, area, and reliability. Simulation results, for a 1cm 32
bit bus in a 0.18&um;m CMOS technology, show that 31% reduction in energy and 62% reduction
in energydelay product are achievable.
Keywords: Bus coding, crosstalk avoidance, lowpower, lowswing, errorcorrecting codes
Chair: Joerg Henkel (University of Karlsruhe)
Organizers: Radu Marculescu, Rainer Leupers

9.1 Abstraction of Assembler Programs for Symbolic Worst Case
Execution Time Analysis [p. 107]

T. Schuele, K. Schneider (University of Kaiserslautern)
Various techniques have been proposed to determine the worst case execution time of realtime
systems. For most of these approaches, it is not necessary to capture the complete semantics of
the system. Instead, it suffices to analyze an abstract model provided that it reflects the system's
execution time correctly. To this end, we present an abstraction technique based on program
slicing that can be used to simplify software systems at the level of assembler programs. The key
idea is to determine a minimal set of instructions such that the control flow of the program is
maintained. This abstraction is essential for reducing the runtime of the analysis algorithms, in
particular, when symbolic methods are used to perform a complete state space exploration.
Keywords: Real–time systems, worst case execution time, abstraction, program slicing,
assembler programs, symbolic simulation

9.2 Extending the Transaction Level Modeling Approach
for Fast Communication Architecture Exploration [p. 113]

S. Pasricha, N. Dutt (University of California at Irvine),
M. BenRomdhane (Conexant Systems Inc.)
SystemonChip (SoC) designs are increasingly becoming more complex. Efficient onchip
communication architectures are critical for achieving desired performance in these systems.
System designers typically use Bus Cycle Accurate (BCA) models written in high level
languages such as C/C++ to explore the communication design space. These models capture all
of the bus signals and strictly maintain cycle accuracy, which is useful for reliable performance
exploration but results in slow simulation speeds for complex designs, even when they are
modeled using high level languages. Recently there have been several efforts to use the
Transaction Level Modeling (TLM) paradigm for improving simulation performance in BCA
models. However these BCA models capture a lot of details that can be eliminated when
exploring communication architectures.
In this paper we extend the TLM approach and propose a new and faster transactionbased
modeling abstraction level (CCATB) to explore the communication design space. Our abstraction
level bridges the gap between the TLM and BCA levels, and yields an average performance
speedup of 55% over BCA models. We demonstrate how fast and accurate exploration of
tradeoffs is possible for highperformance shared bus architectures such as AMBA 2.0 and
AMBA 3.0 (AXI) in industrial strength designs at the proposed abstraction level.
Keywords: Communication Architecture Exploration, Transaction Level Modeling, Bus Cycle
Accurate Modeling, Shared Bus Architectures, AMBA

9.3 Specific Scheduling Support to Minimize the Reconfiguration Overhead
of Dynamically Reconfigurable Hardware [p. 119]

J. Resano, D. Mozos (Universidad Complutense de Madrid),
D. Verkest (IMEC vzw, Vrije Universiteit Brussel, Katholieke Universiteit Leuven),
F. Catthoor (IMEC vzw, Vrije Universiteit Brussel), S. Vernalde (IMEC vzw)
Dynamically Reconfigurable Hardware (DRHW) platforms present both flexibility and high
performance. Hence, they can tackle the demanding requirements of current dynamic multimedia
applications, especially for embedded systems where it is not affordable to include specific HW
support for all the applications. However, DRHW reconfiguration latency represents a major
drawback that can make the use of DRHW resources inefficient for highly dynamic applications.
To alleviate this problem, we have developed a set of techniques that provide specific support for
DRHW devices and we have integrated them into an existing multiprocessor scheduling
environment. In our experiments, with actual multimedia applications, we have reduced the
original overhead due to the reconfiguration latency by at least 93%.
Keywords: Dynamic reconfigurable hardware, runtime scheduling.

9.4 LODS: LocalityOriented Dynamic Scheduling for OnChip Multiprocessors [p. 125]

M. Kandemir (The Pennsylvania State University)
Current multiprocessor SoC applications like network protocol codes, multimedia processing and
baseband telecom circuits have tight timetomarket and performance constraints, which require
an efficient design cycle. Consequently, automated techniques such as those oriented towards
exploiting data locality are critical. In this paper, we demonstrate that existing loop scheduling
techniques provide performance improvements even on onchip multiprocessors, but they fall
short of generating the best results since they do not take data locality into account as an explicit
optimization parameter. Based on this observation, we propose a data localityoriented loopscheduling
algorithm. The idea is to assign loop iterations to processors in such a fashion that
each processor makes maximum reuse of the data it accesses.
Keywords: Data Locality, Parallelization, Loop Scheduling, Embedded System.

9.5 An Area Estimation Methodology for FPGA Based Designs at SystemCLevel [p. 129]

C. Brandolese, W. Fornaciari, F. Salice (Politecnico di Milano  DEI)
This paper presents a parametric area estimation methodology at SystemC level for FPGAbased
designs. The approach is conceived to reduce the effort to adapt the area estimators to the
evolutions of the EDA design environments. It consists in identifying the subset of measures
that can be derived form the system level description and that are also relevant at VHDLRT
level. Estimators' parameters are then automatically derived from a set of benchmarks.
Keywords: FPGAs, SystemC, Area metrics.
Chair: Koen Lampaert (Mindspeed Technologies Inc.)
Organizers: Georges G. Gielen, Koen Lampaert

10.1 Automated Design of Operational Transconductance Amplifiers
using Reversed Geometric Programming [p. 133]

J. P. Vanderhaegen, R. W. Brodersen (University of California at Berkeley)
We present a method for designing operational amplifiers using reversed geometric
programming, which is an extension of geometric programming that allows both convex and
nonconvex constraints. Adding a limited set of nonconvex constraints can improve the
accuracy of convex equationbased optimization, without compromising global optimality.
These constraints allow increased accuracy for critical modeling equations, such as the
relationship between gm and IDS . To demonstrate the design methodology, a foldedcascode
amplifier is designed in a 0.18 μm technology for varying speed requirements and is compared
with simulations and designs obtained from geometric programming.
Keywords: CMOS integrated circuits, operational transconductance amplifiers, reversed
geometric programming

10.2 CorrectbyConstruction LayoutCentric Retargeting of Large Analog Designs [p. 139]

S. Bhattacharya, N. Jangkrajarng, R. Hartono, C.J. R. Shi (University of Washington)
Aggressive design cycles in the semiconductor industry demand a designreuse principle for
analog circuits. The strong impact of layout intricacies on analog circuit performance
necessitates design reuse with special focus on layout aspects. This paper presents a computeraided
design tool and the methodology for a layoutcentric reuse of large analog intellectual property
blocks. From an existing layout representation, an analog circuit is retargeted to
different processes and performances; the corresponding correctbyconstruction layouts are
generated automatically and have performances comparable to manually crafted layouts. The
tool and the methodology are validated on large analog intellectualproperty blocks. While
manual redesign and relayout is known to take weeks to months, our reuse toolsuite achieves
comparable performance in hours.
Keywords: Analog Integrated Circuit Design, Analog Layout Automation, Layout Symmetry,
Analog Synthesis and Optimization.

10.3 Fast and Accurate Parasitic Capacitance Models
for LayoutAware Synthesis of Analog Circuits [p. 145]

A. Agarwal, H. Sampath, V. Yelamanchili, R. Vemuri (University of Cincinnati)
Considering layout effects early in the analog design process is becoming increasingly important.
We propose techniques for estimating parasitic capacitances based on lookup tables and multivariate
linear interpolation. These models enable fast and accurate estimation of parasitic
capacitances and are very suitable for use in a synthesis flow. A layout aware methodology for
synthesis of analog CMOS circuits using these parasitic models is presented. Results indicate
that the proposed synthesis system is fast as compared to a layoutinclusive synthesis approach.
Keywords: Analog Synthesis, Layout Aware, Parasitic Estimation

10.4 ORACLE: Optimization with Recourse of Analog Circuits Including
Layout Extraction [p. 151]

Y. Xu, L. T. Pileggi (Carnegie Mellon University), S. P. Boyd (Stanford University)
Long design cycles due to the inability to predict silicon realities is a wellknown problem that
plagues analog/RF integrated circuit product development. As this problem worsens for
technologies below 100nm, the high cost of design and multiple manufacturing spins causes
fewer products to have the volume required to support full custom implementation. Design reuse
and analog synthesis make analog/RF design more affordable; however, the increasing process
variability and lack of modeling accuracy remains extremely challenging for nanoscale
analog/RF design. We propose an analog/RF circuit design methodology ORACLE, which is a
combination of reuse and shareduse by formulating the synthesis problem as an optimization
with recourse problem. Using a twostage geometric programming with recourse approach,
ORACLE solves for both the globally optimal shared and applicationspecific variables.
Concurrently, we demonstrate ORACLE for novel metalmask configurable designs, where a
range of applications share common underlying structure and applicationspecific customization
is performed using the metalmask layers. We also include the silicon validation of the metalmask
configurable designs.
Keywords: Optimization with recourse

10.5 A Synthesis Flow Toward Fast Parasitic Closure
for RadioFrequency Integrated Circuits [p. 155]

G. Zhang (Carnegie Mellon University), A. Dengi, R. A. Rohrer (Neolinear Inc.),
R. A. Rutenbar, L. R. Carley (Carnegie Mellon University)
An electrical and physical synthesis flow for highspeed analog and radiofrequency circuits is
presented in this paper. Novel techniques aiming at fast parasitic closure are employed
throughout the flow. Parasitic corners generated based on the earlier placement statistics are
included for circuit resizing to enable parasitic robust designs.
A performancedriven placement with simultaneous fast incremental global routing is proposed
to achieve accurate parasitic estimation. Device tuning is utilized during layout to compensate
for layout induced performance degradations. This methodology allows sophisticated
macromodels of performances versus device variables and parasitics to be used during layout
synthesis to make it truly performancedriven. Experimental results of a 4GHz LNA and a mixer
demonstrate fast parasitic closure with this methodology.
Keywords: Synthesis, Sizing, Layout, Parasitic, Modeling, Radio Frequency
Chair: Eli Chiprout (Intel Corporation)
Organizers: Abhijit Dharchoudhury, Lei He

11.1 Buffer Sizing for Clock Power Minimization Subject
to General Skew Constraints [p. 159]

K. Wang, M. MarekSadowska (University of California at Santa Barbara)
In this paper, we investigate the problem of buffer sizing for clock power minimization subject to
general skew constraints. A novel approach based on sequential linear programming is presented.
By taking the firstorder Taylor's expansion of clock path delay with respect to buffer widths, the
original nonlinear problem is transformed to a sequence of linear programs, which incorporate
clock skew scheduling and buffer sizing to minimize clock power dissipation. For each linear
program, the sensitivities of clock path delay with respect to buffer widths are efficiently updated
by applying timedomain analysis to the clock network in a divideandconquer fashion. Our
approach can take process variations and power supply noise into account. We demonstrate
experimentally that the proposed technique is not only capable of effectively reducing clock
power consumption, but also able to provide more accurate delay and skew results compared to
the traditional approach.
Keywords: Clock skew scheduling, sizing, sequential linear programming

11.2 Optimal Placement of Power Supply Pads and Pins [p. 165]

M. Zhao, Y. Fu, V. Zolotov, S. Sundareswaran, R. Panda (Motorola, Inc.)
Power delivery networks of VLSI chips require adequate input supply connections to ensure
reliable performance. This paper addresses the problem of finding an optimum set of pads, pins,
and onchip voltage regulators, and their placement in a given power supply network, subject to
constraints on the voltage drops in the network and maximum currents through the pads, pins and
regulators. The problem is modeled as a mixed integer linear program using macromodeling
techniques and several heuristic techniques are proposed to make the problem tractable. The
effectiveness of the proposed techniques is demonstrated on several real chips and memories
used in lowpower and highperformance applications.
Keywords: Pad optimization, pad placement

11.3 A Stochastic Approach to Power Grid Analysis [p. 171]

S. Pant, D. Blaauw (University of Michigan), V. Zolotov, S. Sundareswaran, R. Panda (Motorola, Inc.)
Power supply integrity analysis is critical in modern high performance designs. In this paper, we
propose a stochastic approach to obtain statistical information about the collective IR and LdI/dt
drop in a power supply network. The currents drawn from the power grid by the blocks in a
design are modelled as stochastic processes and their statistical information is extracted,
including correlation information between blocks in both space and time. We then propose a
method to propagate the statistical parameters of the block currents through the linear model of
the power grid to obtain the mean and standard deviation of the voltage drops at any node in the
grid. We show that the run time is linear with the length of the current waveforms allowing for
extensive vectors, up to millions of cycles, to be analyzed. We implemented the approach on a
number of grids, including a grid from an industrial microprocessor and demonstrate its accuracy
and efficiency. The proposed statistical analysis can be use to determine which portions of the
grid are most likely to fail as well as to provide information for other analyses, such as statistical
timing analysis.
Keywords: IR drop, Ldi/dt, power supply networks.

11.4 Efficient Power/Ground Network Analysis
for Power IntegrityDriven Design Methodology [p. 177]

S.W. Wu (Elan Microelectronics Corporation), Y.W. Chang (National Taiwan University)
As technology advances, the metal width is decreasing with the length increasing, making the
resistance along the power line increase substantially. Together with the nonlinear scaling of the
threshold voltage that makes the ratio of the threshold voltage to the supply voltage rise, the
voltage (IR) drop become a serious problem in modern VLSI design. Traditional power/ground
(P/G) network analysis methods are typically very computationally expensive and thus not
feasible to be integrated into floorplanning. To make the integration of the P/G analysis with
floorplanning feasible, we need a very efficient, yet sufficiently accurate analysis method. In this
paper, we present the methods for the fast analysis of the P/G networks at the floorplanning stage
and integrate our analyzer into a commercial tool to develop a power integrity (IR drop) driven
design methodology. Experimental results based on three realworld circuit designs show that
our P/G network analyzer is accurate enough and very efficient. In particular, with our floorplanbased
P/G network analyzer, the power integritydriven design flow successfully fixed the IRdrop
errors earlier at the floorplanning stage and thus enabled the singlepass design
methodology.
Keywords: Floorplanning, Power/Ground Network

11.5 ReliabilityDriven Layout Decompaction for Electromigration
Failure Avoidance in Complex MixedSignal IC Designs [p. 181]

G. Jerke (Robert Bosch GmbH), J. Lienig (Dresden University of Technology),
J. Scheible (Robert Bosch GmbH)
The negative effect of electromigration on signal and power line lifetime and functional
reliability is an increasingly important problem for the physical design of integrated circuits. We
present a new approach that addresses this electromigration issue by considering current density
and inhomogeneous currentflow within arbitrarily shaped metallization patterns during physical
design. Our proposed methodology is based on a postroute modification of critical layout
structures that utilizes currentdensity data from a previously performed currentdensity
verification. It is especially tailored to overcome the lack of currentflow consideration within
existing routing tools. We also present experimental results obtained after successfully
integrating our methodology into a commercial IC design flow.
Keywords: Physical design, electromigration, interconnect reliability, layout decomposition,
compaction, decompaction
Chair: Kurt Keutzer (University of California at Berkeley)
Organizers: Nitin Deo, Behrozz Zahiri
Panelists: I. Bolsens (Xilinx, Inc.), J. Cong (Magma Design Automation, Inc.),
B. Gupta (STMicroelectronics), P. Lopresti (NEC Electronics America), C. Reynolds (IBM Corporation),
C. Rowen (Tensilica, Inc.), R. Simar (Texas Instruments, Inc.)


Increasing design cost and risk is causing more and more designers to build configurable
platforms that will amortize design and manufacturing costs across many generations. Several
reconfiguration schemes exist and none of them seems to be a clear winner. Most notably there
are three forms of technology: structured ASIC, configurable software programmable,
reconfigurable FPGAlike platform. New solutions are emerging and may induce drastic changes
in the current industry organization. The panel will examine factors in the success of the various
options, and look to what the future might bring.
Chair: K. S. Leung (Magma Design Automation, Inc.)
Organizers: Charles J. Alpert, Dirk Stroobandt, Raymond Nijssen

13.1 Optical Proximity Correction (OPC)Friendly Maze Routing [p. 186]

L.D. Huang (Texas Instruments), M. D. F. Wong (University of Illinois at UrbanaChampaign)
As the technology migrates into the deep submicron manufacturing (DSM) era, the critical
dimension of the circuits is getting smaller than the lithographic wavelength. The unavoidable
light diffraction phenomena in the subwavelength technologies have become one of the major
factors in the yield rate. Optical proximity correction (OPC) is one of the methods adopted to
compensate for the light diffraction effect as a post layout process. However, the process is timeconsuming
and the results are still limited by the original layout quality. In this paper, we
propose a maze routing method that considers the optical effect in the routing algorithm. By
utilizing the symmetrical property of the optical system, the light diffraction is efficiently
calculated and stored in tables. The costs that guide the router to minimize the optical
interferences are obtained from these lookup tables. The problem is first formulated as a
constrained maze routing problem, then it is shown to be a multiple constrained shortest path
problem. Based on the Lagrangian relaxation method, an effective algorithm is designed to solve
the problem.
Keywords: microlithography, VLSI, maze routing, optical system, manufacturing, OPC

13.2 Design Automation for Mask Programmable Fabrics [p. 192]

N. V. Shenoy, J. Kawa, R. Camposano (Synopsys, Inc.)
Programmable circuit design has played an important role in improving design productivity over
the last few decades. By imposing structure on the design, efficient automation of synthesis,
placement and routing is possible. We focus on a class of programmable circuits known as mask
programmable circuits. In this paper, we describe key issues in design and tool methodology that
need to be addressed in creating a programmable fabric. We construct an efficient design flow
that can explore different logic and routing architectures. The main advantage of our work is that
we tailor tools designed for standard cell design, that are readily available in the market, to work
on a programmable fabric. Our flow requires some additional software capability. A special
router that understands programmable routing constructs to complete connections is described. In
addition, a tool that packs logic efficiently after synthesis is also presented.
Keywords: Integrated circuits, Mask programmable fabrics

13.3 On Designing ViaConfigurable Cell Blocks for Regular Fabrics [p. 198]

Y. Ran, M. MarekSadowska (University of California at Santa Barbara)
In this paper we describe the design process of a viaconfigurable block for regular fabrics. The
block consists of viaconfigurable functional cells, viadecomposable flipflops, and viaconfigured
sizable repeaters. The fabric has fixed layers up to M2. An M1M2 via mask is used
to define the block's functionality. The upperlevel metals are customized. Compared to other
structures based on LUTs or PLAs, and fixed flipflops, our block has much smaller area, higher
performance and lower power consumption.
Keywords: Regular fabric, via configurable, layout

13.4 Routing Architecture Exploration for Regular Fabrics [p. 204]

V. Kheterpal, A. J. Strojwas, L. Pileggi (Carnegie Mellon University)
In an effort to control the parameter variations and systematic yield problems that threaten the
affordability of applicationspecific ICs, new forms of design regularity and structure have been
proposed. For example, there has been speculation [6] that regular logic fabrics [1] based on
regular geometry patterns [2] can offer tighter control of variations and greater control of
systematic manufacturing failures. In this paper we describe a routing framework that
accommodates arbitrary descriptions of regular and structured routing architectures. We further
propose new regular routing architectures and explore the various performance vs.
manufacturability tradeoffs. Results demonstrate that a more regular, restricted routing
architecture can provide a substantial advantage in terms of manufacturability and predictability
while incurring a moderate performance penalty.
Keywords: BEOL, Regularity

13.5 Accurate Prelayout Estimation of Standard Cell Characteristics [p. 208]

H. Yoshida, K. De, V. Boppana (Zenasis Technologies, Inc.)
With the advent of deepsubmicron technologies, it has become essential to model the impact of
physical/layout effects up front in all design flows [1]. The effect of layout parasitics is
considerable even at the intracell level in standard cells. Hence, it has become critically
important for any transistorlevel optimization to consider the effect of these layout parasitics as
an integral part of the optimization process. However, since it is not computationally feasible for
the actual layout to be a part of any such optimization procedures, we propose a technique that
estimates cell layout characteristics without actually performing the layout and subsequent
extraction steps. We demonstrate in this work that it is indeed feasible to estimate the layout
effects to get timing characteristics that are on average within about 1.5% of postlayout timing
and that the technique is thousands of times faster than the actual creation of layout.
Keywords: Standard cell, cell characterization, transistorlevel optimization
Chair: Vigyan Singhal (Jasper Design Automation, Inc.)
Organizers: Karem A. Sakallah, Rajeev Ranjan

14.1 An Efficient FiniteDomain Constraint Solver for Circuits [p. 212]

G. Parthasarathy, M. K. Iyer, K.T. Cheng, L.C. Wang (University of California at Santa Barbara)
This paper presents a novel hybrid finitedomain constraint solving engine for RTL circuits. We
describe how DPLL search is modified for search in combined integer and Boolean domains by
using efficient finitedomain constraint propagation. This enables efficient combination of
Boolean SAT and linear integer arithmetic solving techniques. We automatically use control and
datapath abstraction in RTL descriptions. We use conflictbased learning using the variables on
the boundary of control and datapath for additional performance benefits. Finally, we analyze
the hybrid constraint solver experimentally using some example circuits.
Keywords: Design Verification, Decision Procedures, Boolean Satisfiability, Integer Linear
Programming, Bitvector arithmetic

14.2 Automatic Abstraction and Verification of Verilog Models [p. 218]

Z. S. Andraus, K. A. Sakallah (University of Michigan)
Abstraction plays a critical role in verifying complex systems. A number of languages have been
proposed to model hardware systems by, primarily, abstracting away their wide datapaths while
keeping the lowlevel details of their control logic. This leads to a significant reduction in the
size of the state space and makes it possible to verify intricate control interactions formally.
These languages, however, require that the abstraction be done manually, a tedious and errorprone
process. In this paper we describe Vapor, a tool that automatically abstracts behavioral
RTL Verilog to the CLU language used by the UCLID system. Vapor performs a sound
abstraction with emphasis on minimizing false errors. Our method is fast, systematic, and
complements UCLID by serving as a backend for dealing with UCLID counterexamples.
Preliminary results show the feasibility of automatic abstraction and its utility in formal
verification.
Keywords: Register Transfer Level (RTL), Verilog, Abstraction, Logic of Counter Arithmetic
with Lambda Expressions and Uninterpreted Functions (CLU), UCLID.

14.3 Abstraction Refinement by Controllability and Cooperativeness Analysis [p. 224]

F. Y. C. Mang, P.H. Ho (Synopsys, Inc.)
We present a new abstraction refinement algorithm to better refine the abstract model for formal
property verification. In previous work, refinements are selected either based on a set of counter
examples of the current abstract model, as in [5][6][7][8][9][19][20], or independent of any
counter examples, as in [17]. We (1) introduce a new "controllability" analysis that is
independent of any particular counter examples, (2) apply a new "cooperativeness" analysis that
extracts information from a particular set of counter examples and (3) combine both to better
refine the abstract model. We implemented the algorithm and applied it to verify several real world
designs and properties. We compared the algorithm against the abstraction refinement
algorithms in [19] and [20] and the interpolationbased reachability analysis in [14]. The
experimental results indicate that the new algorithm outperforms the other three algorithms in
terms of runtime, abstraction efficiency (as defined in [19]) and the number of proven properties.
Keywords: Formal Verification, Abstraction Refinement, Controllability, Cooperativeness.

14.4 Verifying A Gigabit Ethernet Switch Using SMV [p. 230]

Y. Lu, M. Jorda (Broadcom Corporation)
We use model checking techniques to verify a switching block in a new Gigabit Ethernet switch
– BCM5690. Due to its dynamic nature, this block has been traditionally difficult to verify.
Formal techniques are far more efficient than simulation for this particular design. Among 26
design errors discovered, 22 are found using formal methods. We then improve our model
checking capability to analyze switch latency. We also use induction to avoid state explosion in
the model checker.

14.5 A General Decomposition Strategy for Verifying Register Renaming [p. 234]

H. I. Shehata, M. D. Aagaard (University of Waterloo)
This paper describes a strategy for verifying datahazard correctness of outoforder processors
that implement registerrenaming. We define a set of predicates to characterize registerrenaming
techniques and provide a set of modelchecking obligations that are sufficient to guarantee that a
registerrenaming technique satisfies datahazard correctness. We demonstrate how two register
renaming techniques (retirementregisterfile and dualRAT) instantiate our predicates, and
present model checking results for the DualRAT technique, which is based on the Intel Pentium
R 4 processor.
Keywords: Register renaming, Formal design verification, Pipelined circuits.
Chair: Faraydon Karim (STMicroelectronics)
Organizers: Adam Donlin, Grant E. Martin

15.1 An Integrated Hardware/Software Approach For RunTime
Scratchpad Management [p. 238]

P. Francesco (University of Bologna), P. Marchal (IMED vzw), D. Atienza (DACYA/UCM),
L. Benini (University of Bologna), F. Catthoor (IMEC vzw), J. M. Mendias (DACYA/UCM)
An ever increasing number of dynamic interactive applications are implemented on portable
consumer electronics. Designers depend largely on operating systems to map these applications
on the architecture. However, today's embedded operating systems abstract away the precise
architectural details of the platform. As a consequence, they cannot exploit the energy efficiency
of scratchpad memories. We present in this paper a novel integrated hardware/software solution
to support scratchpad memories at a high abstraction level. We exploit hardware support to
alleviate the transfer cost from/to the scratchpad memory and at the same time provide a highlevel
programming interface for runtime scratchpad management. We demonstrate the
effectiveness of our approach with a casestudy.
Keywords: Scratchpad, DMA, Dynamic Allocation, AMBA AHB.

15.2 MultiProfile Based Code Compression [p. 244]

E. Wanderley Netto, R. Azevedo, P. Centoducatte (UNICAMP),
G. Araujo (Seoul National University, TIMA Laboratory)
Code compression has been shown to be an effective technique to reduce code size in memory
constrained embedded systems. It has also been used as a way to increase cache hit ratio, thus
reducing power consumption and improving performance. This paper proposes an approach to
mix static/dynamic instruction profiling in dictionary construction, so as to best exploit tradeoffs
in compression ratio/performance. Compressed instructions are stored as variablesize indices
into fixedsize codewords, eliminating compressed code misalignments. Experimental results,
using the Leon (SPARCv8) processor and a program mix from MiBench and Mediabench, show
that our approach halves the number of cache accesses and power consumption while produces
compression ratios as low as 56%.
Keywords: Code compression, compression, code density.

15.3 An Efficient Scalable and Flexible Data Transfer Architecture
for Multiprocessor SoC with Massive Distributed Memory [p. 250]

S.I. Han (Seoul National University), A. Baghdadi (ENST Bretagne), M. Bonaciu (TIMA Laboratory),
S.I. Chae (Seoul National University), A. A. Jerraya (TIMA Laboratory)
Massive data transfer encountered in emerging multimedia embedded applications requires
architecture allowing both highly distributed memory structure and multiprocessor computation
to be handled. The key issue that needs to be solved is then how to manage data transfers
between large numbers of distributed memories. To overcome this issue, our paper proposes a
scalable Distributed Memory Server (DMS) for multiprocessor SoC (MPSoC). The proposed
DMS is composed of: (1) highperformance and flexible memory service access points
(MSAPs), which execute data transfers without intervention of the processing elements, (2) data
network, and (3) control network. It can handle direct massive data transfer between the
distributed memories of an MPSoC. The scalability and flexibility of the proposed DMS are
illustrated through the implementation of an MPEG4 video encoder for QCIF and CIF formats.
The experiments show clearly how DMS can be adapted to accommodate different SoC
configurations requiring various data transfer bandwidths. Synthesis results show that bandwidth
can scale up to 28.8 GB/sec.
Keywords: Multiprocessor SoC, Message passing, Data transfer architecture, Memory Server,
Network on chip, Network Interface.

15.4 OperatingSystem Controlled Network on Chip [p. 256]

V. Nollet, T. Marescaux, D. Verkest, J.Y. Mignolet, S. Vernalde (IMEC vzw)
Managing a NetworkonChip (NoC) in an efficient way is a challenging task. To succeed, the
operating system (OS) needs to be tuned to the capabilities and the needs of the NoC. Only by
creating a tight interaction can we combine the necessary flexibility with the required efficiency.
This paper illustrates such an interaction by detailing the management of communication
resources in a system containing a packetswitched NoC and a closely integrated OS. Our NoC
system is emulated by linking an FPGA to a PDA. We show that, with the right NoC support, the
OS is able to optimize communication resource usage. Additionally, the OS is able to diminish
or remove the interference between independent applications sharing a common NoC
communication resource.
Keywords: Network on Chip, Operating System, MPSoC

15.5 DyAD  Smart Routing for NetworksonChip [p. 260]

J. Hu, R. Marculescu (Carnegie Mellon University)
In this paper, we present and evaluate a novel routing scheme called DyAD which combines the
advantages of both deterministic and adaptive routing schemes. More precisely, we envision a
new routing technique which judiciously switches between deterministic and adaptive routing
based on the network's congestion conditions. The simulation results show the effectiveness of
DyAD by comparing it with purely deterministic and adaptive routing schemes under different
traffic patterns. Moreover, a prototype router based on the DyAD idea has been designed and
evaluated Compared to purely adaptive routers, the overhead of implementing DyAD is
negligible (less than 7%), while the performance is consistently better.
Keywords: NetworksonChip, SystemsonChip, router design
Chair: Alan Naumann (CoWare, Inc.)
Organizer: Ellen M. Sentovich (Cadence Design Systems, Inc.)
Speakers: J. Ahuja (Cadence Design Systems, Inc.), P. Lippe (Silicon Image),
Bernie Rosenthal (Tensilica, Inc.)

Competing in today's economy requires close examination of current business practices and
careful strategies for moving into the future. This session will examine key areas for establishing
competitive edge: globalization, patents and intellectual property management, and strategic
marketing. The talks will cover a broad spectrum of approaches applicable to design, EDA, and
IPbased companies. Conclusions will be drawn about which methods will be most successful for
moving into the next era of electronics. The final portion of the session is devoted to discussion,
debate, and Q&A.
Chair: Lucio Lanza (Lanza techVentures, Inc.)
Organizer: Ellen M. Sentovich (Cadence Design Systems, Inc.)
Speakers: R. Camposano (Synopsys, Inc.), J. Douglas (ReShape, Inc.),
A. Khan (Cadence Design Systems, Inc.)

A variety of business models in design and design automation are being employed today that
have a dramatic effect on the success of individual companies and of multiple industries within
the domain of electronics. The three talks in this session will study models for managing IP,
software licensing, design and EDA services. A variety of approaches, constraints, and case
studies will be presented in each talk, with conclusions about how to make these models most
successful for all parties involved. The final portion of the session is devoted to discussion,
debate, and Q&A.
Chair: Charles J. Alpert (IBM Corporation)
Organizer: Charles J. Alpert

16.1 Timing Closure for LowFO4 Microprocessor Design [p. 265]

D. S. Kung (IBM T.J. Watson Research Center)
In this paper, we discuss timing closure for high performance microprocessor designs.
Aggressive cycle time and deep submicron technology scaling introduce a myriad of problems
that are not present in the ASIC domain. The impact of these problems on floorplanning,
placement, clocking and logic synthesis is described. We present ideas and potential solutions for
tackling these problems.
Keywords: High Performance, FO4, Synthesis, Placement.

16.2 Forest vs. Trees: Where's the Slack? [p. 267]

P. Rodman (ReShape, Inc.)

16.3 Efficient Timing Closure Without Timing Driven Placement and Routing [p. 268]

M. Vujkovic, D. Wadkins (University of Washington), B. Swartz (InternetCAD.com, Inc.),
C. Sechen (University of Washington)
We have developed a design flow from Verilog/VHDL to layout that mitigates the timing closure
problem, while requiring no timing driven placement or routing tools. Timing issues are confined
to the cell sizer, allowing the placement algorithm to focus solely on net lengths, resulting in
superior layout densities and much lower power. The primary enablers to this new technology
are: 1) gridded transistor sizing, 2) variable die routing that allows each net to be routed in the
shortest possible length, 3) simultaneous cell placement, routing, gate sizing, and clock tree
insertion, and 4) an effective incremental (ECO) placement that preserves net lengths from one
iteration to the next. The variable die router is the key enabler. Superior placement and routing
densities result from this new variable die approach. In addition, each net is routed at or near
minimum length, and thus minor placement changes or cell size changes do not materially
impact net lengths.
Keywords: Digital design flow, gate sizing, placement and routing, timing closure.
Chair: Robert Damiano (Synopsys, Inc.)
Organizer: Francine Bacchini (Thinkbold Corporate Communications)
Panelists: B. Bentley (Intel Corp.), K. Baty (WSFDB Consulting), K. Normoyle (Azul Systems, Inc.),
M. Ishii (Sony Corp.), E. Yogev (Cisco Systems, Inc.)

Today's leading chip and system companies are faced with ever increasing design verification
challenges; industry studies reveal that as much as 50% of the total schedule is being spent in
verification. Large companies, with almost infinite resources, have shown that throwing CPU
cycles and people at the simulation problem still doesn't guarantee a level of coverage desired by
the design team.
So, what is the answer? Are assertions, faster simulators, and testbench languages the "holy
grail", or are those just microoptimizations of a methodology that is fundamentally flawed? Is
there hope for a verification methodology that completely covers a design with a predictable
verification schedule?
The panelists will describe their experiences of what works and what doesn't, providing insights
into their methodologies and philosophies.
Chair: Manmut Kandemir (Pennsylvania State University)
Organizers: Heinrich Meyr, Lothar Thiele, Mahmut Kandemir

18.1 Leakage Aware Dynamic Voltage Scaling for RealTime Embedded Systems [p. 275]

R. Jejurikar (University of California at Irvine),
C. Pereira, R. Gupta (University of California at San Diego)
A fivefold increase in leakage current is predicted with each technology generation. While
DynamicVoltage Scaling (DVS) is known to reduce dynamic power consumption, it also causes
increased leakage energy drain by lengthening the interval over which a computation is carried
out. Therefore, for minimization of the total energy, one needs to determine an operating point,
called the critical speed. We compute processor slowdown factors based on the critical speed for
energy minimization. Procrastination scheduling attempts to maximize the duration of idle
intervals by keeping the processor in a sleep/shutdown state even if there are pending tasks,
within the constraints imposed by performance requirements. Our simulation experiments show
that the critical speed slowdown results in up to 5% energy gains over a leakage oblivious
dynamic voltage scaling. Procrastination scheduling scheme extends the sleep intervals to up to 5
times, resulting in up to an additional 18% energy gains, while meeting all timing requirements.
Keywords: leakage power, critical speed, low power scheduling, realtime systems, EDF
scheduling, procrastication.

18.2 Retargetable Profiling for Rapid, Early SystemLevel Design Space Exploration [p. 281]

L. Cai, A. Gerstlauer, D. Gajski (University of California at Irvine)
Fast and accurate estimation is critical for exploration of any design space in general. As we
move to higher levels of abstraction, estimation of complete system designs at each level of
abstraction is needed. Estimation should provide a variety of useful metrics relevant to design
tasks in different domains and at each stage in the design process.
In this paper, we present such a systemlevel estimation approach based on a novel combination
of dynamic profiling and static retargeting. Coestimation of complete system implementations is
fast while accurately reflecting even dynamic effects. Furthermore, retargetable profiling is
supported at multiple levels of abstraction, providing multiple design quality metrics at each
level. Experimental results show the applicability of the approach for efficient design space
exploration.
Keywords: Profiling, Retargetable, System Level Design, Exploration.

18.3 High Level Cache Simulation for Heterogeneous Multiprocessors [p. 287]

J. J. Pieper (Carnegie Mellon University), A. Mellan (STMicroelectronics),
J. M. Paul, D. E. Thomas (Carnegie Mellon University), F. Karim (STMicroelectronics)
As multiprocessor systemsonchip become a reality, performance modeling becomes a
challenge. To quickly evaluate many architectures, some type of highlevel simulation is
required, including highlevel cache simulation. We propose to perform this cache simulation by
defining a metric to represent memory behavior independently of cache structure and back annotate
this into the original application. While the annotation phase is complex, requiring time
comparable to normal address trace based simulation, it need only be performed once per
application set and thus enables simulation to be sped up by a factor of 20 to 50 over trace based
simulation. This is important for embedded systems, as software is often evaluated against many
input sets and many architectures. Our results show the technique is accurate to within 20% of
miss rate for uniprocessors and was able to reduce the die area of a multiprocessor chip by a
projected 14% over a naive design by accurately sizing caches for each processor.
Chair: Wolfgang Roesner (IBM Corporation)
Organizers: Avi Ziv, Yaron Kashi

19.1 CommunicationEfficient Hardware Acceleration for Fast Functional Simulation [p. 293]

Y.I. Kim, W. Yang, Y.S. Kwon, C.M. Kyung (Korea Advanced Institute of Science and Technology)
This paper presents new technology that accelerates system verification. Traditional methods for
verifying functional designs are based on logic simulation, which becomes more timeconsuming
as design complexity increases. To accelerate functional simulation, hardware acceleration is
used to offload calculationintensive tasks from the software simulator. Hardware accelerated
simulation dramatically reduces the simulation time. However, the communication overhead
between the software simulator and hardware accelerator is becoming a new critical bottleneck.
We reduce the communication overhead by exploiting burst data transfer and parallelism, which
are obtained by splitting testbench and moving a part of testbench into hardware accelerator. Our
experiments demonstrated that the proposed method reduces the communication overhead by a
factor of about 40 compared to conventional hardware accelerated simulation while maintaining
the cycle accuracy and compatibility with the original testbench.
Keywords: Functional verification, simulation acceleration, communication overhead

19.2 A Fast Hardware/Software CoVerification Method for SystemOnaChip
by Using a C/C++ Simulator and FPGA Emulator with Shared Register
Communication [p. 299]

Y. Nakamura, K. Hosokawa, I. Kuroda, K. Yoshikawa (NEC Corporation),
T. Yoshimura (Waseda University)
This paper describes a new hardware/software coverification method for SystemOn–aChip,
based on the integration of a C/C++ simulator and an inexpensive FPGA emulator.
Communication between the simulator and emulator occurs via a flexible interface based on
shared communication registers. This method enables easy debugging, rich portability, and high
verification speed, at a low cost. We describe the application of this environment to the
verification of three different complex commercial SoCs, supporting concurrent hardware and
embedded software development. In these projects, our verification methodology was used to
perform complete system verification at 0.21.1 MHz, while supporting full graphical interface
functions such as "waveform" or "signal dump" viewers, and debugging functions such as "step"
or "break".
Keywords: CoVerification, FPGA Emulation, C/C++ Simulator.

19.3 CircuitAware Architectural Simulation [p. 305]

S. Lee, S. Das, V. Bertacco, T. Austin, D. Blaauw, T. Mudge (The University of Michigan)
Architectural simulation has achieved a prominent role in the system design cycle by providing
designers the ability to quickly examine a wide variety of design choices. However, the recent
trend in system design toward architectures that react to circuitlevel phenomena has outstripped
the capabilities of traditional cyclebased architectural simulators. In this paper, we present an
architectural simulator design that incorporates a circuit modeling capability, permitting
architecturallevel simulations that react to circuit characteristics (such as latency, energy, or
current draw) on a cyclebycycle basis. While these additional capabilities slow simulation
speed, we show that the careful application of circuit simulation optimizations and simulation
sampling techniques permit high levels of detail with sufficient speed to examine entire
workloads.
Keywords: Architectural simulation, Highperformance simulation, Circuit simulation
Chair: Ruiqi Tian (Motorola, Inc.)
Organizers: Charlie ChungPing Chen, Martin D. F. Wong

20.1 Toward a Methodology for ManufacturabilityDriven Design Rule Exploration [p. 311]

L. Capodieci (Advanced Micro Devices), P. Gupta, A. B. Kahng (University of California at San Diego),
D. Sylvester, J. Yang (University of Michigan at Ann Arbor)
Resolution enhancement techniques (RET) such as optical proximity correction (OPC) and
phaseshift mask (PSM) technology are deployed in modern processes to increase the fidelity of
printed features, especially critical dimensions (CD) in polysilicon. Even given these exotic
technologies, there has been momentum towards less flexibility in layout, in order to ensure
printability. However, there has not been a systematic study of the performance and
manufacturability impact of such a move towards restrictive design rules. In this paper we
present a design flow that evaluates the application of various restricted design rule (RDR) sets in
deep submicron ASIC designs in terms of circuit performance and parametric yield. Using such a
framework, process and design engineers can identify potential solutions to maximize
manufacturability by selectively applying RDRs while maintaining chip performance. In this
work we focus attention on the device layer which is the most difficult design layer to
manufacture. We quantify the performance, manufacturability and mask cost impact of several
common design rules. For instance, we find that small increases in the minimum allowable poly
line end extension beyond active provide high levels of immunity to lithographic defocus
conditions. Also, modification of the minimum field poly to diffusion spacing can provide good
manufacturability, while a single pitch single orientation design rule can reduce gate 3s
uncertainty. Both of these improve in data volume as well, with little to no performance
penalties. Reductions in data volume and worstcase edge placement error are on the order of 20
30% and 3050% respectively compared to a standard baseline design rule set.
Keywords: Process variation, Lithography, VLSI Manufacturability, OPC, RET, Yield

20.2 Phase Correct Routing for Alternating Phase Shift Masks [p. 317]

K. McCullen (IBM Microelectronics Division)
In this paper, we investigate routing restrictions that enable the generation of "correct by
construction" layouts for Dark Field AltPSM. Existing routers produce designs in which coloring
errors are numerous, and up to 15% of the pin locations in macros may cause unavoidable
coloring errors. We identify routing restrictions which produce 100% phasecorrect results, and
quantify the impact on wirelengths and via counts.
Keywords: Resolution Enhancement Techniques (RET), Lithography, Routing, Layout.

20.3 Toward a SystematicVariation Aware Timing Methodology [p. 321]

P. Gupta (University of California at San Diego), F.L. Heng (IBM T.J. Watson Research Center)
Variability of circuit performance is becoming a very important issue for ultradeep submicron
technology. Gate length variation has the most direct impact on circuit performance. Since many
factors contribute to the variability of gate length, recent studies have modeled the variability
using Gaussian distributions. In reality, the throughpitch and throughfocus variations of gate
length are systematic. In this paper, we propose a timing methodology which takes these
systematic variations into account and we show that such an approach can reduce the timing
uncertainty by up to 40%.
Keywords: Layout, Lithography, OPC, ACLV, Manufacturability

20.4 Selective GateLength Biasing for CostEffective Runtime Leakage Control [p. 327]

P. Gupta, A. B. Kahng, P. Sharma (University of California at San Diego),
D. Sylvester (University of Michigan)
With process scaling, leakage power reduction has become one of the most important design
concerns. Multithreshold techniques have been used to reduce runtime leakage power without
sacrificing performance. In this paper, we propose small biases of transistor gatelength to
further minimize power in a manufacturable manner. Unlike multiVth techniques, gatelength
biasing requires no additional masks and may be performed at any stage in the design process.
Our results show that gatelength biasing effectively reduces leakage power by up to 25% with
less than 4% delay penalty. We show the feasibility of the technique in terms of
manufacturability and pincompatibility for postlayout power optimization. We also show up to
54% reduction in leakage uncertainty due to interdie process variation in circuits when biased
gatelengths, versus only unbiased one, are used. Circuits selectively biased show much less
sensitivity to both intra and inter die variations.
Keywords: Layout, lithography, OPC, leakage, power, manufacturability
Chair: Anirudh Devgan (IBM Corporation)
Organizers: Charlie ChungPing Chen, Sani R. Nassif

21.1 FirstOrder Incremental BlockBased Statistical Timing Analysis [p. 331]

C. Visweswariah (IBM T.J. Watson Research Center),
K. Ravindran (University of California at Berkeley), K. Kalafala (IBM Microelectronics),
S. G. Walker (IBM T.J. Watson Research Center), S. Narayan (IBM Microelectronics)
Variability in digital integrated circuits makes timing verification an extremely challenging task.
In this paper, a canonical first order delay model is proposed that takes into account both
correlated and independent randomness. A novel lineartime blockbased statistical timing
algorithm is employed to propagate timing quantities like arrival times and required arrival times
through the timing graph in this canonical form. At the end of the statistical timing, the
sensitivities of all timing quantities to each of the sources of variation are available. Excessive
sensitivities can then be targeted by manual or automatic optimization methods to improve the
robustness of the design. This paper also reports the first incremental statistical timer in the
literature which is suitable for use in the inner loop of physical synthesis or other optimization
programs. The third novel contribution of this paper is the computation of local and global
criticality probabilities. For a very small cost in CPU time, the probability of each edge or node
of the timing graph being critical is computed. Numerical results are presented on industrial
ASIC chips with over two million logic gates.
Keywords: Statistical timing, incremental, variability

21.2 Fast Statistical Timing Analysis Handling Arbitrary Delay Correlations [p. 337]

M. Orshansky, A. Bandyopadhyay (The University of Texas at Austin),
An efficient statistical timing analysis algorithm that can handle arbitrary (spatial and structural)
causes of delay correlation is described. The algorithm derives the entire cumulative distribution
function of the circuit delay using a new mathematical formulation. Spatial as well as structural
correlations between gate and wire delays can be taken into account. The algorithm can handle
node delays described by nonGaussian distributions. Because the analytical computation of an
exact cumulative distribution function for a probabilistic graph with arbitrary distributions is
infeasible, we find tight upper and lower bounds on the true cumulative distribution. An efficient
algorithm to compute the bounds is based on a PERTlike single traversal of the subgraph
containing the set of N deterministically longest paths. The efficiency and accuracy of the
algorithm is demonstrated on a set of ISCAS'85 benchmarks. Across all the benchmarks, the
average rms error between the exact distribution and lower bound is 0.7%, and the average
maximum error at 95th percentile is 0.6%. The computation of bounds for the largest benchmark
takes 39 seconds.

21.3 STAC: Statistical Timing Analysis with Correlation [p. 343]

J. Le, X. Li, L. T. Pileggi (Carnegie Mellon University),
Current technology trends have led to the growing impact of both interdie and intradie process
variations on circuit performance. While it is imperative to model parameter variations for sub
100nm technologies to produce an upper bound prediction on timing, it is equally important to
consider the correlation of these variations for the bound to be useful. In this paper we present an
efficient blockbased statistical static timing analysis algorithm that can account for correlations
from process parameters and reconverging paths. The algorithm can also accommodate
dominant interconnect coupling effects to provide an accurate compilation of statistical timing
information. The generality and efficiency for the proposed algorithm is obtained from a novel
simplification technique that is derived from the statistical independence theories and principal
component analysis (PCA) methods. The technique significantly reduces the cost for mean,
variance and covariance computation of a set of correlated random variables.
Keywords: Statistical timing, process variation
Chair: Grant Martin (Tensilica, Inc.)
Organizer: Francine Bacchini (Thinkbold Corporate Communications)
Panelists: P. Paulin (STMicroelectronics), R. A. Bergamaschi (IBM), R. Pawate (Texas Instruments, India),
A. Bernstein (Intel Corp., Israel), R. Chandra (Qualcomm), M. BenRomdhane (Conexant)

Systemlevel design is being touted as the Holy Grail that the electronics industry has long
sought, but most offers have been disappointing because they seldom deliver results. Many
designers are fed up with the "Blah, Blah" on systemlevel design as they are waiting for design
facts.
Why? It seems that major breakthroughs are happening thanks to the adoption of standard
directions for modeling design at higher than RTL level. These models are called TLM and new
languages are being adopted (SystemC, SystemVerilog). The emergence of new standards may
reshape completely the way design industry is organized.
This panel will bring six speakers relating their success stories about design starting at the
systemlevel. The format is an educational Panel aimed at informing DAC attendees of the
challenges (difficulties and pitfalls) and opportunities (sizable benefits and lessons learned from
these experiences).
Chair: Patrick H. Madden (University of Kitakyushu)
Organizers: Carl Sechen, Igor L. Markov

23.1 LargeScale Placement by GridWarping [p. 351]

Z. Xiu, J. D. Ma (Carnegie Mellon University), S. M. Fowler (Intel Corp.),
R. A. Rutenbar (Carnegie Mellon University)
Gridwarping is a new placement algorithm based on a strikingly simple idea: rather than move
the gates to optimize their location, we elastically deform a model of the 2D chip surface on
which the gates have been roughly placed, "stretching" it until the gates arrange themselves to
our liking. Put simply: we move the grid, not the gates. Deforming the elastic grid is a
surprisingly simple, low dimensional nonlinear optimization, and augments a traditional quadratic
formulation. A preliminary implementation, WARP1, is already competitive with most recently
published placers, e.g., placements that average 4% better wirelength, 40% faster than
GORDIANLDOMINO.
Keywords: Algorithms, Placement

23.2 Placement Feedback: A Concept and Method for Better MinCut Placements [p. 357]

A. B. Kahng, S. Reda (University of California at San Diego)
The advent of strong multilevel partitioners has made topdown mincut placers a favored
choice for modern placer implementations. We examine terminal propagation, an important step
in mincut placers, because it is responsible for translating partitioning results into global
placement wirelength assumptions. In this work, we identify a previously overlooked problem 
ambiguous terminal propagation  and propose a solution based on the concept of feedback from
automatic control systems. Implementing our approach in Capo (version 8.7 [5, 10]) and
applying it to standard benchmark circuits yields up to 14% wirelength reductions for the IBM
benchmarks and 10% reductions for PEKO instances. Experiments also show consistent
improvements for routed wirelength, yielding up to 9% wirelength reductions with practical
increase in placement runtime. In addition, our method significantly improves routability without
building congestion maps, and reduces the number of vias.
Keywords: mincut placement, terminal propagation, feedback

23.3 QuantumDot Cellular Automata (QCA) Circuit Partitioning:
Problem Modeling and Solutions [p. 363]

D. A. Antonelli, D. Z. Chen, T. J. Dysart, X. S. Hu (University of Notre Dame),
A. B. Kahng (University of California at San Diego),
P. M. Kogge, R. C. Murphy, M. T. Niemier (University of Notre Dame)
This paper presents the QuantumDot Cellular Automata (QCA) physical design problem, in the
context of the VLSI physical design problem. The problem is divided into three subproblems:
partitioning, placement, and routing of QCA circuits. This paper presents an ILP formulation and
heuristic solution to the partitioning problem, and compares the two sets of results. Additionally,
we compare the results of a humangenerated circuit to the ILP and Heuristic formulations. The
results demonstrate that the heuristic is a practical method of reducing partitioning run time
while providing a result that is close to the optimal for a given circuit.
Keywords: Circuit Partitioning, Computer Aided Design, QuantumDot Cellular Automata
(QCA)
Chair: Luca Daniel (Massachusetts Institute of Technology)
Organizers: Byron L. Krauter, Vikram Jandhyala

24.1 PassivityPreserving Model Reduction Via a Computationally
Efficient ProjectAndBalance Scheme [p. 369]

N. Wong (The University of Hong Kong), V. Balakrishnan, C.K. Koh (Purdue University)
This paper presents an efficient twostage projectandbalance scheme for passivitypreserving
model order reduction. Orthogonal dominant eigenspace projection is implemented by
integrating the Smith method and Krylov subspace iteration. It is followed by stochastic balanced
truncation wherein a novel method, based on the complete separation of stable and unstable
invariant subspaces of a Hamiltonian matrix, is used for solving two dual algebraic Riccati
equations at the cost of essentially one. A fastconverging quadrupleshift bulgechasing SR
algorithm is also introduced for this purpose. Numerical examples confirm the quality of the
reducedorder models over those from conventional schemes.
Keywords: Model reduction, Dominant Eigenspace, Projection, Stochastic Balanced Truncation,
Riccati Equation, SR Algorithm

24.2 A Linear Fractional Transform (LFT) Based Model for Interconnect
Parametric Uncertainty [p. 375]

J. M. Wang, O. A. Hafiz (University of Arizona at Tucson), J. Li (ETOP Design Technology)
As we scale toward nanometer technologies, the increase in interconnect parameter variations
will bring significant performance variability. New design methodologies will emerge to
facilitate construction of reliable systems from unreliable nanometer scale components. Such
methodologies require new performance models which accurately capture the manufacturing
realities. In this paper, we present a Linear Fractional Transform (LFT) based model for
interconnect Parametric Uncertainty. This new model formulates the interconnect parameter
uncertainty as a repeated scalar uncertainty structure. With the help of generalized Balanced
Truncation Realization (BTR) based on Linear Matrix Inequalities (LMI's), the new model
reduces the order of the original interconnect network while preserves the stability. This paper
also shows that the LFT based model even guarantees passivity if the BTR reduction is based on
solutions to a pair of Linear Matrix Inequalities (LMI's) which generalizes Lur'e equations .
Keywords: Parametric Uncertainty, Linear Fractional Transform

24.3 Variational Delay Metrics for Interconnect Timing Analysis [p. 381]

K. Agarwal, D. Sylvester, D. Blaauw (University of Michigan), F. Liu, S. Nassif (IBM Research),
S. Vrudhula (University of Arizona)
In this paper we develop an approach to model interconnect delay under process variability for
timing analysis and physical design optimization. The technique allows for closedform
computation of interconnect delay probability density functions (PDFs) given variations in
relevant process parameters such as linewidth, metal thickness, and dielectric thickness. We
express the resistance and capacitance of a line as a linear function of random variables and then
use these to compute circuit moments. Finally, these variabilityaware moments are used in
known closedform delay metrics to compute interconnect delay PDFs. We compare the
approach to SPICE based Monte Carlo simulations and report an error in mean and standard
deviation of delay of 1% and 4% on average, respectively.

24.4 Exploiting Input Infromation in a Model Reduction Algorithm
for Massively Coupled Parasitic Networks [p. 385]

L. M. Silveira (Technical University of Lisbon), J. R. Phillips (Cadence Design Systems)
In this paper we present a model reduction algorithm that circumvents some of the issues
encountered for parasitic networks with large numbers of input/output "ports". Our approach is
based on the premise that for such networks, there are typically strong dependencies between the
input waveforms at different network "ports". We present an approximate truncated balanced
realizations procedure that, by exploiting such correlation information, produces much more
compact models compared to standard algorithms such as PRIMA.
Keywords: Model order reduction, interconnect, parasitic.
Chair: Heinrich Meyr (RWTH Aachen/CoWare, Incorporated)
Organizer: Mahmut Kandemir

25.1 Automatic Translation of Software Binaries onto FPGAs [p. 389]

G. Mittal, D. C. Zaretsky, X. Tang, P. Banerjee (Northwestern University)
The introduction of advanced FPGA architectures, with builtin DSP support, has given DSP
designers a new hardware alternative. By exploiting its inherent parallelism, it is expected that
FPGAs can outperform DSP processors. This paper describes the process and considerations for
automatically translating binaries targeted for general DSP processors into Register Transfer
Level (RTL) VHDL or Verilog code to be mapped onto commercial FPGAs. The Texas
Instruments C6000 DSP processor architecture is chosen as the DSP processor platform, and the
Xilinx Virtex II as a target FPGA. Various optimizations are discussed, including data
dependency analysis, procedure extraction, induction variable analysis, memory optimizations,
and scheduling. Experimental results on resource usage and performance are shown for ten
software binary benchmarks. Results show performance gains of 320X in the FPGA designs
over that of the DSP processors in terms of reductions of execution cycles.
Keywords: Reconfigurable computing, compiler, hardwaresoftware codesign, binary
translation, decompilation.

25.2 AreaEfficient Instruction Set Synthesis for Reconfigurable
SystemonChip Designs [p. 395]

P. Brisk, A. Kaplan, M. Sarrafzadeh (University of California at Los Angeles)
Silicon compilers are often used in conjunction with Field Programmable Gate Arrays (FPGAs)
to deliver flexibility, fast prototyping, and accelerated timetomarket. Many of these compilers
produce hardware that is larger than necessary, as they do not allow instructions to share
hardware resources. This study presents an efficient heuristic which transforms a set of custom
instructions into a single hardware datapath on which they can execute. Our approach is based on
the classic problems of finding the longest common subsequence and substring of two (or more)
sequences. This heuristic produces circuits which are as much as 85.33% smaller than those
synthesized by integer linear programming (ILP) approaches which do not explore resource
sharing. On average, we obtained 55.41% area reduction for pipelined datapaths, and 66.92%
area reduction for VLIW datapaths. Our solution is simple and effective, and can easily be
integrated into an existing silicon compiler.
Keywords: Compiler, FieldProgrammable Gate Array (FPGA), Integer Linear Programming
(ILP), Resource Sharing

25.3 Data Compression for Improving SPM Behavior [p. 401]

O. Ozturk, M. Kandemir (Pennsylvania State University), I. Demirkiran (Syracuse University),
G. Chen, M. J. Irwin (Pennsylvania State University)
Scratchpad memories (SPMs) enable fast access to timecritical
data. While prior research studied both static and dynamic SPM
management strategies, not being able to keep all hot data (i.e.,
data with high reuse) in the SPM remains the biggest problem.
This paper proposes data compression to increase the number of
data blocks that can be kept in the SPM. Our experiments with
several embedded applications show that our compressionbased
SPM management heuristic is very effective and outperforms prior
static and dynamic SPM management approaches. We also present
an ILP formulation of the problem, and show that the proposed
heuristic generates competitive results with those obtained through
ILP, while spending much less time in compilation.
Keywords
scratchpad memory, compilers, data compression
Chair: Sujit Dey (University of California at San Diego)
Organizer: Sujit Dey

26.1 Platform Based Design: Does it Answer the Entire SoC Challenge? [p. 407]

G. Smith (Gartner Dataquest)
During the mid1990s, design fads emerged as the design challenge was becoming significant
and the EDA vendors were entered the methodology business. Trying to market methodologies
helped produce today’s unfortunate design fad phenomenon.

26.2 Nomadic Platform Approach for Wireless Mobile Multimedia [p. 408]

M. Hopkins (STMicroelectronics, Inc.)

26.3 Benefits and Challenges for PlatformBased Design [p. 409]

A. SangiovanniVincentelli, L. Carloni (University of California at Berkeley),
F. De Bernardinis (University of California at Berkeley and Università di Pisa),
M. Sgroi (DoCoMo EuroLabs)
Platforms have become an important concept in the design of electronic systems. We present
here the motivations behind the interest shown and the challenges that we have to face to make
the Platformbased Design method a standard. As a generic term, platforms have meant different
things to different people. The main challenges are to distill the essence of the method, to
formalize it and to provide a framework to support its use in areas that go beyond the original
domain of application.

26.4 Trends in the Use of ReConfigurable Platforms [p. 415]

M. Baron (Microprocessor Report)
Chair: Rajeev Murgai (Fujitsu Laboratories)
Organizers: Marek Perkowski, Soha Hassoun

27.1 A Recursive Paradigm to Solve Boolean Relations [p. 416]

D. Bañeres, J. Cortadella (Universidad Politécnica de Catalunya), M. Kishinevsky (Intel Corp.)
A recursive algorithm for solving Boolean relations is presented.
It provides several features: wide exploration of solutions,
parametrizable cost function and efficiency. The experimental results
show the applicability of the method and tangible improvements
with regard to previous heuristic approaches.
Keywords: Boolean relations, decomposition, logic design.

27.2 A Robust Algorithm for Approximate Compatible Observability
Don't Care (CODC) Computation [p. 422]

N. Saluja, S. P. Khatri (University of Colorado)
Compatible Observability Don't Cares (CODCs) are a powerful means to express the flexibility
present at a node in a multilevel logic network. Despite their elegance, the
applicability of CODCs has been hampered by their computational complexity. The CODC
computation for a network involves several image computations, which require the construction
of global BDDs of the circuit nodes. The size of BDDs of circuit nodes is unpredictable, and as a
result, the CODC computation is not robust. In practice, CODCs cannot be computed for large
circuits due to this limitation.
In this paper, we present an algorithm to compute approximate CODCs (ACODCs). This
algorithm allows us to compute compatible don't cares for significantly larger designs. Our
ACODC algorithm is scalable in the sense that the user may trade off time and memory against
the accuracy of the ACODCs computed. The ACODC is computed by considering a subnetwork
rooted at the node of interest, up to a certain topological depth, and performing its don't care
computation. We prove that the ACODC is an approximation of its CODC.
We have proved the soundness of the approach, and performed extensive experiments to explore
the tradeoff between memory utilization, speed and accuracy. We show that even for small
topological depths, the ACODC computation gives very good results. Our experiments
demonstrate that our algorithm can compute ACODCs for circuits whose CODC computation
has not been demonstrated to date. Also, for a set of benchmark circuits whose CODC
computation yields an average 28% reduction in literals after optimization, our ACODC
computation yields an average 22% literal reduction. Our algorithm has runtimes which are
about 25X and memory utilization which is 33 X better that of the CODC computation of SIS.
Keywords: Compatible Observability Don't Cares (CODC), Multilevel Logic Optimization,
Logic Synthesis

27.3 A Method to Decompose MultipleOutput Logic Functions [p. 428]

T. Sasao, M. Matsuura (Kyushu Institute of Technology)
This paper shows a method to decompose a given multipleoutput
circuit into two circuits with intermediate outputs.
We use a BDD for characteristic function (BDD for CF)
to represent a multipleoutput function. Many benchmark
functions were realized by LUT cascades with intermediate
outputs. Especially, adders and a binary to BCD converter
were successfully designed. Comparison with FPGAs is also
presented.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids
General Terms
Algorithms, Performance, Experimentation, Theory
Keywords
Cascade, BDD, Characteristic function, FPGA

27.4 Symmetry Detection for Incompletely Specified Functions [p. 434]

K.H. Wang, J.H. Chen (Fu Jen Catholic University)
In this paper, we formulate symmetry detection for incompletely specified functions as an
equation without using cofactor computation and equivalence checking. Based on this equation,
a symmetry detection algorithm is proposed. This algorithm can simultaneously find nonequivalence
and equivalence symmetries. Experimental results on a set of benchmarks show that
our algorithm is indeed very effective in solving symmetry detection problem for incompletely
specified functions.
Keywords: Equivalence Symmetry, NonEquivalence Symmetry

27.5 Implicit Enumeration of Structural Changes in Circuit Optimization [p. 438]

V. N. Kravets, P. Kudva (IBM T.J. Watson Research Center)
We describe an implicit technique for enumerating structural choices in circuit optimization. The
restructuring technique relies on the symbolic statements of functional decomposition which
explores behavioral equivalence of circuit signals through rewiring and resubstitution. Using
rigid, yet practical, formulation a rich variety of restructuring candidates is computed
symbolically and applied incrementally to produce circuit changes with predictable structural
effects. The restructuring technique is used to obtain much improved delays of the already
optimized circuits along with their area savings. It is also applied to analyze benefits of
optimizing circuit topology at the early steps of synthesis targeting its routability.
Keywords: Decomposition, Technology Mapping, Physical Synthesis, Resynthesis,
Optimization
Chair: John Cohn (IBM Corporation)
Organizers: Michael Orshansky, Sudhakar Bobba

28.1 Parametric Yield Estimation Considering Leakage Variability [p. 442]

R. R. Rao (University of Michigan), A. Devgan (IBM Corporation),
D. Blaauw, D. Sylvester (University of Michigan)
Leakage current has become a stringent constraint in modern processor designs in addition to
traditional constraints on frequency. Since leakage current exhibits a strong inverse correlation
with circuit delay, effective parametric yield prediction must consider the dependence of leakage
current on frequency. In this paper, we present a new chiplevel statistical method to estimate the
total leakage current in the presence of withindie and dietodie variability. We develop a
closedform expression for total chip leakage that models the dependence of the leakage current
distribution on a number of process parameters. The model is based on the concept of scaling
factors to capture the effects of withindie variability. Using this model, we then present an integrated
approach to accurately estimate the yield loss when both frequency and power limits are
imposed on a design. Our method demonstrates the importance of considering both these limiters
in calculating the yield of a lot.
Keywords: Leakage, variability, parametric yield

28.2 A Methodology to Improve Timing Yield in the Presence of Process Variations [p. 448]

S. Raj, S. B. K. Vrudhula, J. Wang (University of Arizona)
The ability to control the variations in IC fabrication process is rapidly diminishing as feature
sizes continue towards the sub100 nm regime. As a result, there is an increasing uncertainty in
the performance of CMOS circuits. Accounting for the worst case values of all parameters will
result in an unacceptably low timing yield. Design for Variability, which involves designing to
achieve a given level of confidence in the performance of ICs, is fast becoming an indispensable
part of IC design methodology. This paper1 describes a method to identify certain paths in the
circuit that are responsible for the spread of timing performance. The method is based on
defining a disutility function of the gate and path delays, which includes both the means and
variances of the delay random variables. Based on the moments of this disutility function, an
algorithm is presented which selects a subset of paths (called undominated paths) as being most
responsible for the variation in timing performance. Next, a statistical gate sizing algorithm is
presented, which is aimed at minimizing the delay variability of the nodes in the selected paths
subject to constraints on the critical path delay and the area penalty. MonteCarlo simulations
with ISCAS'85 benchmark circuits shows that our statistical optimization approach results in
significant improvements in timing yield over traditional deterministic sizing methods.
Keywords: Timing Analysis, Timing Yield, Gate Sizing.

28.3 Novel Sizing Algorithm for Yield Improvement under Process
Variation in Nanometer Technology [p. 454]

S. H. Choi (Intel Corporation), B. C. Paul, K. Roy (Purdue University)
Due to process parameter variations, a large variability in circuit delay occurs in scaled
technologies affecting the yield. In this paper, we propose a sizing algorithm to ensure the speed
of a circuit under process variation with a certain degree of confidence while maintaining the
area and power budget within a limit. This algorithm estimates the variation in circuit delay
using statistical timing analysis considering both inter and intradie process variation and resizes
the circuit to achieve a desired yield. Experimental results on several benchmark circuits show
that one can achieve up to 19% savings in area (power) using our algorithm compared to the
worstcase design.

28.4 Statistical Timing Analysis Based on a Timing Yield Model [p. 460]

F. N. Najm (University of Toronto), N. Menezes (Intel Corporation)
Starting from a model of the withindie systematic variations using principal components
analysis, a model is proposed for estimation of the parametric yield, and is then applied to
estimation of the timing yield. Key features of these models are that they are easy to compute,
they include a powerful model of withindie correlation, and they are “fullchip” models in the
sense that they can be applied with ease to circuits with millions of components. As such, these
models provide a way to do statistical timing analysis without the need for detailed statistical
analysis of every path in the design.
Keywords: Statistical timing analysis, timing yield, principal components
Chair: Stephen A. Edwards (Columbia University)
Organizers: Reinaldo A. Bergamaschi, Stephen A. Edwards

29.1 System Design for DSP Applications in Transaction Level Modeling Paradigm [p. 466]

A. K. Deb, A. Jantsch, J. Öberg (Royal Institute of Technology)
In this paper, we systematically define three transaction level models (TLMs), which reside at
different levels of abstraction between the functional and the implementation model of a DSP
system. We also show a unique language support to build the TLMs. Our results show that the
abstract TLMs can be built and simulated much faster than the implementation model at the
expense of a reasonable amount of simulation accuracy.
Keywords: Transaction level modeling, system design, DSP, grammar

29.2 An Analytical Approach for Dynamic Range Estimation [p. 472]

B. Wu, J. Zhu, F. N. Najm (University of Toronto)
It has been widely recognized that the dynamic range information of an application can be
exploited to reduce the datapath bitwidth of either processors or ASICs, and therefore the overall
circuit area, delay and power consumption. While recent proposals of analytical dynamic range
estimation methods have shown significant advantages over the traditional profilingbased
method in terms of runtime, we argue that the rather simplistic treatment of input correlation may
lead to significant error. We instead introduce a new analytical method based on a mathematical
tool called KarhunenLoéve Expansion (KLE), which enables the orthogonal decomposition of
random processes. We show that when applied to linear systems, this method can not only lead
to much more accurate result than previously possible, thanks to its capability to capture and
propagate both spatial and temporal correlation, but also richer information than the value
bounds previously produced, which enables the exploration of interesting tradeoff between
circuit performance and signaltonoise ratio.
Keywords: Dynamic range, bitwidth, KarhunenLoéve Expansion (KLE), correlation in the
following text), has to be obtained.

29.3 Automated FixedPoint DataType Optimization Tool
for Signal Processing and Communication Systems [p. 478]

C. Shi, R. W. Brodersen (University of California at Berkeley)
A tool that automates the floatingpoint to fixedpoint conversion (FFC) process for digital signal
processing systems is described. The tool automatically optimizes fixedpoint data types of
arithmetic operators, including overflow modes, integer word lengths, fractional word lengths,
and the number systems. The approach is based on statistical modeling, hardware resource
estimation and global optimization based on an initial structural system description. The basic
technique exploits the fact that the fixed point realization is a weak perturbation of the floating
point realization which allows the development of a system model which can be used in the
optimization process.
Keywords: Fixedpoint arithmetic, optimization, communication systems, digital signal
processing, FPGA, ASIC

29.4 An Algorithm for Converting FloatingPoint Computations
to FixedPoint in MATLAB based FPGA Design [p. 484]

S. Roy, P. Banerjee (Northwestern University)
Most practical FPGA designs of digital signal processing applications are limited to fixedpoint
arithmetic owing to the cost and complexity of floatingpoint hardware. While mapping DSP
applications onto FPGAs, a DSP algorithm designer, who often develops his applications in
MATLAB, must determine the dynamic range and desired precision of input, intermediate and
output signals in a design implementation to ensure that the algorithm fidelity criteria are met.
The first step in a flow to map MATLAB applications into hardware is the conversion of the
floatingpoint MATLAB algorithm into a fixedpoint version. This paper describes an approach
to automate this conversion, for mapping to FPGAs by profiling the expected inputs to estimate
errors. Our algorithm attempts to minimize the hardware resources while constraining the
quantization error within a specified limit.
Keywords: Quantization, Quantizer, Fixed point, Floating Point

29.5 Synthesizing Interconnect Efficient Low Density Parity Check Codes [p. 488]

M. Mohiyuddin, A. Prakash, A. Aziz (The University of Texas at Austin), W. Wolf (Princeton University)
Error correcting codes are widely used in communication and storage applications. Codec
complexity has usually been measured with a software implementation in mind. A recent
hardware implementation of a Low Density Parity Check code (LDPC) indicates that
interconnect complexity dominates the VLSI cost. We describe a heuristic interconnectaware
synthesis algorithm which generates LDPC codes that use an order of magnitude less wiring with
little or no loss of coding efficiency.
Keywords: Error correcting codes, lowdensity paritycheck codes, routing congestion.
Chair: Paolo Prinetto (Politecnico di Torino)
Organizers: TM Mak, Yervant Zorian

30.1 On PathBased Learning and Its Applications in Delay Test and Diagnosis [p. 492]

L.C. Wang (University of California at Santa Barbara), T. M. Mak (Intel Corporation),
K.T. Cheng (University of California at Santa Barbara), M. S. Abadir (Motorola, Inc.)
This paper describes the implementation of a novel pathbased learning methodology that can be
applied for two purposes: (1) In a presilicon simulation environment, pathbased learning can be
used to produce a fast and approximate simulator for statistical timing simulation. (2) In postsilicon
phase, pathbased learning can be used as a vehicle to derive critical paths based on the
pass/fail behavior observed from the test chips. Our pathbased learning methodology consists of
four major components: a delay test pattern set, a logic simulator, a set of selected paths as the
basis for learning, and a machine learner. We explain the key concepts in this methodology and
present experimental results to demonstrate its feasibility and applications.
Keywords: Delay test, Statistical timing simulation, Machine learning

30.2 Efficient OnLine Testing of FPGAs with Provable Diagnosabilities [p. 498]

V. Verma (Xilinx, Inc.), S. Dutt, V. Suthar (University of Illinois at Chicago)
We present novel and efficient methods for online testing in FPGAs. The testing approach uses
a ROving TEster (ROTE), which has provable diagnosabilities and is also faster than prior
FPGA testing methods. We present 1 and 2diagnosable builtin selftester (BISTer) designs
that make up the ROTE, and that avoid expensive adaptive diagnosis. To the best of our
knowledge, this is the first time that a BISTer design with diagnosability greater than one has
been developed for FPGAs. We also develop functional testing methods that test PLBs in only
two circuit functions that will be mapped to them (as opposed to testing PLBs in all their
operational modes) as the ROTE moves across a functioning FPGA. Simulation results show that
our 1diagnosable BISTer and our functional testing technique leads to significantly more
accurate (98% (90.5%) fault coverage at a fault/defect density of 10% (25%)) and faster testanddiagnosis
of FPGAs than achieved by previous work. In general, it is expected that ROTE will
achieve high fault coverages at fault/defect densities of up to 25% using our 1diagnosable
BISTer and up to 33% using our 2diagnosable BISTer. Our methods should thus prove useful
for testing current very deep submicron FPGAs as well as future nanoCMOS and molecular
nanotechnology FPGAs in which defect densities are expected to be in the 10% range.
Keywords and Phrases: builtin selftester (BISTer), diagnosability, FPGAs, functional testing,
online testing, roving tester (ROTE).

30.3 On Test Generation for Transition Faults with Minimized Peak
Power Dissipation [p. 504]

W. Li, S. M. Reddy (University of Iowa), I. Pomeranz (Purdue University)
This paper presents a method of generating tests for transition faults using tests for stuckat faults
such that the peak power is the minimum possible using a given set of tests for stuckat faults.
The proposed method is suitable for use in testing scan designs that employ enhanced scan. The
method reduces the peak power consumption in benchmark circuits by 19% on the average with
essentially the same test set size and the same fault coverage compared to an earlier method.
Keywords: Power Dissipation, Test Generation, Transition Faults

30.4 A New State Assignment Technique for Testing and Low Power [p. 510]

S. Park, S. Cho (Hanyang University at Ansan), S. Yang (Pusan University),
M. Ciesielski (University of Massachusetts)
In order to improve the testabilities and power consumption, a new state assignment technique
based on mblock partition is introduced in this paper. The length and number of feedback cycles
are reduced with minimal switching activity on the state variables. Experiment shows significant
improvement in power dissipation and testabilities for benchmark circuits.
Keywords: State Encoding, Fault Coverage, Low power, Scan design.

30.5 Automatic Generation of Breakpoint Hardware for Silicon Debug [p. 514]

B. Vermeulen (Philips Research Laboratories), M. Z. Urfianto (Royal Institute of Technology),
S. K. Goel (Philips Research Laboratories)
Scanbased silicon debug is a technique that can be used to help find design errors in prototype
silicon more quickly. One part of this technique involves the inclusion of breakpoint modules
during the design stage of the chip. This paper focuses on an innovative approach to
automatically generate breakpoint modules by means of a breakpoint description language. This
language is illustrated using an example, and experimental results are presented that show the
efficiency and effectiveness of this new method for generating breakpoint hardware.
Chair: Aarti Gupta (NEC Corporation)
Organizers: PeiHsin Ho, Tony Ma

31.1 AMUSE: A MinimallyUnsatisfiable Subformula Extractor [p. 518]

Y. Oh, M. N. Mneimneh, Z. S. Andraus, K. A. Sakallah, I. L. Markov (University of Michigan)
This paper describes a new algorithm for extracting unsatisfiable subformulas from a given
unsatisfiable CNF formula. Such unsatisfiable "cores" can be very helpful in diagnosing the
causes of infeasibility in large systems. Our algorithm is unique in that it adapts the “learning
process” of a modern SAT solver to identify unsatisfiable subformulas rather than search for
satisfying assignments. Compared to existing approaches, this method can be viewed as a
bottomup core extraction procedure which can be very competitive when the core sizes are
much smaller than the original formula size. Repeated runs of the algorithm with different
branching orders yield different cores. We present experimental results on a suite of large
automotive benchmarks showing the performance of the algorithm and highlighting its ability to
locate not just one but several cores.
Keywords: Minimallyunsatisfiable subformula, (MUS), conjunctive normal form (CNF),
Boolean satisfiability (SAT), diagnosis.

31.2 A SATBased Algorithm for Reparameterization in Symbolic Simulation [p. 524]

P. Chauhan, E. M. Clarke, D. Kroening (Carnegie Mellon University)
Parametric representations used for symbolic simulation of circuits usually use BDDs. After a
few steps of symbolic simulation, state set representation is converted from one parametric
representation to another smaller representation, in a process called reparameterization. For large
circuits, the reparametrization step often results in a blowup of BDDs and is expensive due to a
large number of quantifications of input variables involved. Efficient SAT solvers have been
applied successfully for many verification problems. This paper presents a novel SATbased
reparameterization algorithm that is largely immune to the large number of input variables that
need to be quantified. We show experimental results on large industrial circuits and compare our
new algorithm to both SATbased Bounded Model Checking and BDD based symbolic
simulation. We were able to achieve on average 3x improvement in time and space over BMC
and able to complete many examples that BDD based approach could not even finish.
Keywords: Symbolic Simulation, SAT checkers, Bounded Model Checking, Parametric
Representation, Safety Property Checking

31.3 Exploiting Structure in Symmetry Detection for CNF [p. 530]

P. T. Darga, M. H. Liffiton, K. A. Sakallah, I. L. Markov (University of Michigan)
Instances of the Boolean satisfiability problem (SAT) arise in many areas of circuit design and
verification. These instances are typically constructed from some humandesigned artifact, and
thus are likely to possess much inherent symmetry and sparsity. Previous work [4] has shown
that exploiting symmetries results in vastly reduced SAT solver run times, often with the search
for the symmetries themselves dominating the total SAT solving time. Our contribution is
twofold. First, we dissect the algorithms behind the venerable nauty [9] package, particularly the
partition refinement procedure responsible for the majority of search space pruning as well as the
majority of run time overhead. Second, we present a new symmetrydetection tool, saucy, which
outperforms nauty by several orders of magnitude on the large, structured CNF formulas
generated from typical EDA problems.
Keywords: Boolean satisfiability (SAT), symmetry, abstract algebra, backtrack search, partition
refinement, graph automorphism.

31.4 Refining the SAT Decision Ordering for Bounded Model Checking [p. 535]

C. Wang, H. Jin, G. D. Hachtel, F. Somenzi (University of Colorado)
Bounded Model Checking (BMC) relies on solving a sequence of highly correlated Boolean
satisfiability (SAT) problems, each of which checks for the existence of counterexamples of a
bounded length. The performance of SAT search depends heavily on the variable decision
ordering. We propose an algorithm to exploit the correlation among different SAT problems in
BMC, by predicting and successively refining a partial variable ordering. This ordering is based
on the analysis of all previous unsatisfiable instances, and is combined with the SAT solver's
existing decision heuristic to determine the final variable decision ordering. Experiments on real
designs from industry show that our new method improves the performance of SATbased BMC
significantly.
Keywords: Bounded Model checking, SAT, decision heuristic

31.5 Efficient Equivalance Checking with Partitions and Hierarchical Cutpoints [p. 539]

D. Anastasakis, L. McIlwain (Synopsys, Inc.), S. Pilarski (University of Washington)
Previous results show that both flat and hierarchical methodologies present obstacles to
effectively completing combinational equivalence checking. A new approach that combines the
benefits while effectively dealing with the pitfalls of both styles of equivalence checking is
presented.
Keywords: Logic Design, Verification, Equivalence Checking
Chair: William H. Joyner, Jr. (IBM Corporation/SRC)
Organizers: Shishpal Rawat, William H. Joyner, Jr.
Panelists: J. Darringer (IBM Corporation), D. Gajski (University of California at Irvine),
P. O. Pistilli (MP Associates), H. De Man (IMEC), C. Harris (Kluwer Academic Publishers),
J. Solomon (Independent Technology Advisor)

A long, long time ago, in a laboratory far, far away, EDA researchers and developers used paper
tape instead of Linux, rubylith instead of GDS II, yellow wires instead of ten levels of metal.
Sitting around a potbellied stove, in their rocking chairs, practitioners of that era (and this) will
offer insight into why some great ideas were immediately put into practice while others stayed
on the drawing board or in the ivory tower. They will share remembrances of things past, of
simpler days when foundries made steel, when options meant CMOS or bipolar, when real parts
were measured instead of benchmarks touted. Their stories of what it was like, what has
changed, and whether the "good old days" were then or now will be followed by questions and,
possibly, answers.
Chair: Sujit Dey (University of California at San Diego)
Organizers: Sujit Dey

33.1 Offchip LatencyDriven Dynamic Voltage and Frequency
Scaling for an MPEG Decoding [p. 544]

K. Choi, R. Soma, M. Pedram (University of Southern California)
This paper describes a dynamic voltage and frequency scaling (DVFS) technique for MPEG
decoding to reduce the energy consumption using the computational workload decomposition.
This technique decomposes the workload for decoding a frame into onchip and offchip
workloads. The execution time required for the onchip workload is CPU frequencydependent,
whereas the offchip workload execution time does not change, regardless of the CPU frequency,
resulting in the maximum energy savings by setting the minimum frequency during offchip
workload execution time, without causing any delay penalty. This workload decomposition is
performed using a performancemonitoring unit (PMU) in the XScaleprocessor, which provides
various statistics such as cache hit/miss and CPU stall, due to data dependency at run time. The
onchip workload for an incoming frame is predicted using a framebased history so that the
processor voltage and frequency can be scaled to provide the exact amount of computing power
needed to decode the frame. To guarantee a quality of service (QoS) constraint, a prediction error
compensation method, called interframe compensation, is proposed in which the onchip
workload prediction error is diffused into subsequent frames such that run time frame rates
change smoothly. The proposed DVFS algorithm has been implemented on an XScalebased
Testbed. Detailed current measurements on this platform demonstrate significant CPU energy
savings ranging from 50% to 80% depending on the video clip.
Keywords: Low power, MPEG decoding, voltage and frequency scaling

33.2 EnergyAware Deterministic Fault Tolerance in Distributed
RealTime Embedded Systems [p. 550]

Y. Zhang (Duke University), R. Dick (Northwestern University), K. Chakrabarty (Duke University)
We investigate a unified approach for fault tolerance and dynamic power management in
distributed realtime embedded systems. Coordinated checkpointing is used to achieve fault
tolerance, and power management is carried out using dynamic voltage scaling. We present
feasibilityofscheduling tests for coordinated checkpointing schemes for a constant processor
speed as well as for DVSenabled processors that can operate at variable speeds. Simulation
results based on the CORDS hardware/software cosynthesis system show that, compared to
faultoblivious methods, the proposed approach significantly reduces power consumption while
guaranteeing timely task completion in the presence of faults.
Keywords: Realtime systems, fault tolerance, checkpointing, voltage scaling

33.3 Proxybased Task Partitioning of Watermarking Algorithms
for Reducing Energy Consumption in Mobile Devices [p. 556]

A. Kejariwal, S. Gupta, A. Nicolau, N. Dutt (University of California at Irvine),
R. Gupta (University of California at San Diego)
Digital watermarking is a process that embeds an imperceptible signature or watermark in a
digital file containing audio, image, text or video data. The watermark is later used to
authenticate the data file and for tamper detection. It is particularly valuable in the use and
exchange of digital media such as audio and video on emerging handheld devices. However,
watermarking is computationally expensive and adds to the drain of the available energy in
handheld devices. We present an approach in which we partition the watermarking embedding
and extraction algorithms and migrate some tasks to a proxy server. This leads to a lower energy
consumption on the handheld without compromising the security of the watermarking process.
Our results show that executing watermarking partitioned between the proxy and the handheld
reduces the total energy consumed by 80% over running it only on the handheld and improves
performance by over two orders of magnitude.
Keywords: Handhelds, Proxy, Partitioning, Watermarking

33.4 Adaptive Data Partitioning for Ambient Multimedia [p. 562]

X. Hu, R. Marculescu (Carnegie Mellon University)
In the near future, Ambient Intelligence (AmI) will become part of everyday life. Combining
featurerich multimedia with AmI (dubbed Ambient Multimedia for short) has the potential of
changing the way we perceive and interact with our environment. One major difficulty, however,
in designing Ambient Multimedia Systems (AMS) comes from the strong constraints imposed on
system resources by the AmI application requirements. In this paper, we propose a method for
mapping multimedia applications on systems with very limited resources (i.e. memory,
computing capability and battery lifetime) by combining adaptive data partitioning with dynamic
power management. The potential of the approach is illustrated through a case study of an object
tracking application running on a resourceconstrained platform.
Keywords: Ambient Intelligence, motion detection, power management

33.5 Energy Characterization of Filesystems for Diskless Embedded Systems [p. 566]

S. Choudhuri (University of California at Irvine), R. N. Mahapatra (Texas A & M University)
The need for low power, small formfactor, secondary storage devices in embedded systems has
led to the widespread use of flash memory. Energy consumption due to processor and flash for
such devices is critical to embedded system design. In this paper, we have proposed a
quantitative account of energy consumption in both processor and flash due to overhead of
filesystem related system calls. A macromodel for such energy consumption is derived using
linear regression analysis. The results describing filesystem energy consumption have been
obtained from Linux Kernel running Journaling Flash Filesystem 2 (JFFS2) and Extended 3
(Ext3) filesystems on StrongARM processor with flash as secondary storage device. Armed with
such a macromodel, a designer can choose and partition filesystem, estimate the application
energy consumption (due to processor and flash) due to filesystem during the early stage of
system design.
Keywords: diskless, flash, macromodel
Chair: Marios Papaefthymiou (University of Michigan)
Organizers: James C. Hoe, Leon Stok, Marios Papaefthymiou

34.1 A Method for Correcting the Functionality of a WirePipelined Circuit [p. 570]

V. Nookala, S. S. Sapatnekar (University of Minnesota)
As acrosschip interconnect delays can exceed a clock cycle, wire pipelining becomes essential
in high performance designs. Although it allows higher clock frequencies, it may change the
microarchitecture altogether because of the arbitrary increase in the latencies of the paths and
cycles of the circuit. This paper proposes a method to regain the functionality of a wirepipelined
circuit. In this approach, increased cycle latencies are compensated by slowing down the issue
rate of the inputs. Our method finds the optimal value of the slowdown required for a circuit as it
directly affects the throughput of the circuit. We also incorporate area minimization in our
formulation to minimize the number of extra flipflops added to the circuit. The formulation is
tested on circuits derived from ISCAS benchmarks and the results suggest that wire pipelining
increases the overall throughput in most of the cases.
Keywords: Wire pipelining, Synchronous design

34.2 A New Approach to Latency Insensitive Design [p. 576]

M. R. Casu, L. Macchiarulo (Politecnico di Torino/CERCOM)
Latency Insensitive Protocols have been proposed as a viable mean to speed up large SystemsonChip where the limit in clock frequency is given by long global wires connecting together
functional blocks. In this paper we keep the philosophy of Latency Insensitive Design and show
that a drastic simplification can be done that results in even no need to implement any kind of
protocol. By using a scheduling algorithm for the functional blocks activation we greatly reduce
the routing resources demand of the old protocol, the area occupied by the sequential elements
used to pipeline long interconnects and the complexity of the gating structure used to activate the
modules.
Keywords: Interconnections, Pipelining, SystemonChip

34.3 Prelayout Wire Length and Congestion Estimation [p. 582]

Q. Liu, M. MarekSadowska (University of California at Santa Barbara)
In this paper, we study the prelayout wire length and congestion estimation. We find that two
structural metrics, mutual contraction and net range, can be used to predict wire lengths. These
metrics have different application ranges and complement each other. We also propose a new
metric, the structural pin density, to capture the peak routing congestion of designs. Larger
maximum pin densities usually lead to larger peak congestions in circuits with similar average
congestions. We demonstrate experimentally very good correlation of our prelayout measures
with post layout interconnect lengths. We also isolate the structural netlist properties which cause
the peak congestion.
Keywords: Wire Length, Congestion, Prediction

34.4 The Best of Both Worlds: The Efficient Asynchronous Implementation
of Synchronous Specifications [p. 588]

A. Davare (University of California at Berkeley), K. Lwin (Cadence Design Systems),
A. Kondratyev (Cadence Berkeley Labs.),
A. SangiovanniVincentelli (University of California at Berkeley)
The desynchronization approach combines a traditional synchronous specification style with a
robust asynchronous implementation model. The main contribution of this paper is the
description of two optimizations that decrease the overhead of desynchronization. First, we
investigate the use of clustering to vary the granularity of desynchronization. Second, by
applying temporal analysis on a formal execution model of the desynchronized design, we
uncover significant amounts of timing slack. These methods are successfully applied to industrial
RTL designs.
Keywords: Desynchronization, Separation Analysis, Clustering

34.5 Fast Hazard Detection in Combinational Circuits [p. 592]

C. Jeong, S. M. Nowick (Columbia University)
In designing asynchronous circuits it is critical to ensure that circuits are free of hazards in the
specified set of input transitions. In this paper, two new algorithms for detecting hazards in
combinational circuits are proposed. These algorithms can be used to determine if the circuit is
free of combinational hazards without exploring all the gates in the circuits, thus providing more
efficient hazard detection. Experimental results indicate that the best new algorithm on average
visits only 20.7% of the original gates, with an average runtime speedup of 1.69 and best
speedup of 2.27 (for the largest example).
Keywords: Logic Design, Asynchronous circuits, Hazards.
Chair: Petru Eles (Linkoping University)
Organizers: Ahmed A. Jerraya, Gila Kamhi

35.1 Defect Tolerant Probabilistic Design Paradigm for Nanotechnologies [p. 596]

M. Jacome, C. He, G. de Veciana, S. Bijansky (The University of Texas at Austin)
Recent successes in the development and selfassembly of nanoelectronic devices suggest that
the ability to manufacture dense nanofabrics is on the near horizon. However, the tremendous
increase in device density of nanoelectronics will be accompanied by a substantial increase in
hard and soft faults, posing a major challenge to current design methodologies and tools. In this
paper we propose a novel probabilistic design paradigm for defective but reconfigurable
nanofabrics. The new design goal is to devise an appropriate structural/behavioral decomposition
which improves scalability by constraining the reconfiguration process, while meeting a desired
probability of successful instantiation, i.e, yield. Our approach not only addresses the scalability
problem in configuring dense nanofabrics subject to defects, but gives a rich framework in which
critical tradeoffs among performance, yield, and per chip cost can be explored. We present a
concrete instance of the approach and show extensive experimental results supporting these
claims.
Keywords: Nanotechnologies, probabilistic design, defect tolerance

35.2 ArchitectureLevel Synthesis for Automatic Interconnect Pipelining [p. 602]

J. Cong, Y. Fan, Z. Zhang (University of California at Los Angeles)
For multigigahertz synchronous designs in nanometer technologies, multiple clock cycles are
needed to cross the global interconnects, thus making it necessary to have pipelined global
interconnects. In this paper we present an architecturelevel synthesis solution to support
automatic pipelining of onchip interconnects. Specifically, we extend the recently proposed
Regular Distributed Register (RDR) microarchitecture to support interconnect pipelining. We
formulate a novel global interconnect sharing problem for global wiring minimization and show
that it is polynomial time solvable by transformation to a special case of the realtime scheduling
problem. Experimental results show that our approach matches or exceeds the RDRbased
approach in performance, with a significant wiring reduction of 15% to 21%.
Keywords: Highlevel synthesis, multicycle communication, interconnect pipelining,
scheduling, register binding

35.3 Automatic Generation of Equivalent Architecture Model from Functional
Specification [p. 608]

S. Abdi, D. Gajski (University of California at Irvine)
This paper presents an algorithm for automatic generation of an architecture model from a
functional specification, and proves its correctness. The architecture model is generated by
distributing the intended system functionality over various components in the platform
architecture. We then define simple transformations that preserve the execution semantics of
system level models. Finally, the model generation algorithm is proved correct using our
transformations. As a result, we have an automated path from a functional model of the system to
an architectural one and we need to debug and verify only the functional specification model,
which is smaller and simpler than the architecture model. Our experimental results show
significant savings in both the modeling and the validation effort.
Keywords: System level design, Formal verification, Model Refinement

35.4 DivideandConcatenate: An Architecture Level Optimization
Technique for Universal Hash Functions [p. 614]

B. Yang, R. Karri (Polytechnic University), D. A. McGrew (Cisco Systems, Inc.)
We present an architecture optimization technique called divideandconcatenate for universal
hash functions. The area of a multiplier increases quadratically and its speed increases gradually
with the operand size and two universal hash functions are equivalent if they have the same
collision probability property. Based on these observations, the divideandconcatenate approach
divides a 2wbit data path (with collision probability 22w) into two wbit data paths (each with
collision probability 2w), applies one message word to these two wbit data paths and
concatenates their results to construct an equivalent 2wbit data path (with collision probability
22w). We demonstrate this technique on Linear Congruential Hash (LCH) family. When
compared to the 100% overhead associated with duplicating a straightforward 32bit LCH data
path, the divideandconcatenate approach that uses four equivalent 8bit data paths yields a
101% increase in throughput with only 52% hardware overhead.

35.5 Performance Analysis of Different Arbitration Algorithms
of the AMBA AHB Bus [p. 618]

M. Conti, M. Caldari, G. B. Vece, S. Orcioni, C. Turchetti (Università Politecnica delle Marche)
Bus performances are extremely important in a platformbased design. System Level analysis of
bus performances gives important information for the analysis and choice between different
architectures driven by functional, timing and power constraints of the SystemonChip. This
paper presents the effect of different arbitration algorithms and bus usage methodologies on the
bus AMBA AHB performances in terms of effective throughput and power dissipation. SystemC
and VHDL models have been developed and simulations have been performed.
Keywords: AMBA AHB BUS, SystemC, Arbitration Algorithm, Performance
SPECIAL SESSION 36:
BioMEMS
Chair: Andrew B. Kahng (University of California at San Diego)
Organizer: Jacob K. White

36.1 Design Tools for BioMEMS [p. 622]

T. Korsmeyer, J. Zeng, K. Greiner (Coventor, Inc.)
Microsystems used for chemical analyses and biological assays are termed BioMEMS or labsonachip. These systems often require some of the traditional electromechanical capabilities of
MEMS, and in addition require the manipulation of fluids in either continuous flow or droplet
form. The distinction between continuous flow and droplets defines two broad categories of
BioMEMS. Different applications call for one or the other of these approaches, but in either case,
software for design and simulation can make a significant contribution to design optimization
and reduction in time to market. A computer aided design and analysis approach is presented in
which systemlevel analysis is favored over detailed analysis, although it is shown that this is not
always possible, nor preferred. Examples of the use of design and analysis software in
BioMEMS development are presented including: electrostatic actuation, a labonachip for
separation, onchip optics, a digital fluidic processor, electrospray ionization, and a twostage
chemical reaction.
Keywords: MEMS, BioMEMS, labonachip, µTAS, CAD, systemlevel modeling, BEM,
FEM

36.3 CAD Challenges in BioMEMS Design [p. 629]

J. White (Massachusetts Institute of Technology)
The ability of micromachining, or MEMS, technology to directly manipulate micron and
nanometersized objects makes it an idea technology for a wide range of biological and
biomedical applications, and has led to a subfield in bioMEMs design. Although BioMEMS is
attracting substantial attention and research, development has been hampered by the lack of
computeraided design tools. The available tools are far behind those for integrated circuit
design, and therefore successful bioMEMS designers require very sophisticated processing
expertise. It is the purpose of this paper to encourage research in this rapidly evolving computer
aided design field, by providing the briefest of summaries and an extensive set of pointers to
literature.
Chair: Rob A. Rutenbar (Carnegie Mellon University)
Organizer: Rob A. Rutenbar
Panelists: T. Bonaccio (IBM Corp.), T. Meng (Stanford University), E. Perea (STMicroelectronics),
R. Pitts (Texas Instruments, Inc.), C. Sodini (Massachusetts Institute of Technology),
J. Wieser (National Semiconductor Corp.)

Once upon a time there was a wise and benevolent ruler whose Law multiplied his subjects'
wealth and happinessabout 2X, every couple of years, but the kingdom was divided. Those in
the happy hamlet of Digital got fatter (and faster), year after year. Not so the talented artisans in
the town of Analog complained constantly about "voltage headroom", "variability", "noise",
"matching", "kT/C limits", the rising costs of supporting their neighbors' insatiable addiction to
shrinking transistors, and how the grass looked greener just over the border, in SiliconGermania.
So, what's a King to do? Will we see billion transistor chips with integrated RF made from
transistors that are 25 atoms wide? Or will the peasants in the land of Analog really revolt?
Chair: Malgorzata MarekSadowska (University of California at Santa Barbara)
Organizers: ChungKuan Cheng, Frank M. Johannes

38.1 ProfileGuided Microarchitectural Floorplanning for Deep
Submicron Processor Design [p. 634]

M. Ekpanyapong, J. R. Minz (Georgia Institute of Technology),
T. Watewai (University of California at Berkeley),
H.H. S. Lee, S. K. Lim (Georgia Institute of Technology)
As process technology migrates to deep submicron with feature size less than 100nm, global
wire delay is becoming a major hindrance in keeping the latency of intrachip communication
within a single cycle, thus decaying the performance scalability substantially. An effective
floorplanning algorithm can no longer ignore the information of dynamic communication
patterns of applications. In this paper, using the profile information acquired at the
architecture/microarchitecture level, we propose a "profileguided microarchitectural floorplanner"
that considers both the impact of wire delay and the architectural behavior, namely the intermodule
communication, to reduce the latency of frequent routes inside a processor and to
maintain performance scalability. Based on our simulation results, the profileguided method
shows a 5% to 40% average IPC improvement when clock frequency is fixed. From the
perspective of instruction throughput (in BIPS), our floorplanner is much more scalable than a
conventional wire length based floorplanner.
Keywords: Microarchitectural Planning, Floorplanning, Computer Architecture.

38.2 Floorplanning Optimization with Trajectory PiecewiseLinear
Model for Pipelined Interconnects [p. 640]

C. Long, L. J. Simonson, W. Liao, L. He (University of California at Los Angeles)
Interconnect pipelining has a great impact on system performance, but has not been considered
by automatic floorplanning. Considering interconnect pipelining, we study the floorplanning
optimization problem to minimize system CPI (cycles per instruction) and in turn maximize
system performance. We develop an efficient tablebased model called trajectory piecewise
linear (TPWL) model to estimate CPI with interconnect pipelining. Experiments show that the
TPWL model differs from cycleaccurate simulations by less than 3.0%. We integrate this model
with a simulatedannealing based floorplan optimization to obtain CPIaware floorplanning.
Compared to the conventional floorplanning to minimize area and wire length, our CPIaware
floorplanning can reduce CPI by up to 28.6% with a small area overhead of 5.69% under 100nm
technology and obtain better results under 70nm technology. To the best of our knowledge, this
paper is the first indepth study on floorplanning optimization with consideration of interconnect
pipelining.

38.3 A Packing Algorithm for NonManhattan Hexagon/Triangle
Placement Design by Using an Adaptive Otree Representation [p. 646]

J. Li, T. Yan, B. Yang, J. Yu (University of Electronic Science and Technology of China),
C. Li (Cadence Design Systems)
A nonManhattan Hexagon/Triangle Placement (HTP for short) paradigm is proposed in the
present paper. Main feature of this paradigm lies in adapting to the Y architecture which is one
of the promising nonManhattan VLSI circuit layout architectures. Aim of the HTP is to place a
set of equilateral triangles with given size onto a hexagonal chip with maximal chip area usage.
Based on the Otree representation, some adaptive packing rules are adopted to develop an
effective placement algorithm for solving the HTP problem in BBL mode. Two examples with
benchmark data transformed from the Manhattan BBL mode placement (ami33/49) are presented
to justify the feasibility and effectiveness of our algorithms. Experiment results demonstrate that
the chip area usage of 94% is achieved through simulated annealing optimization.
Keywords: VLSI circuit physical design, nonManhattan layout, Yarchitecture, Otree
representation, placement, diagonal wiring
Chair: David Hathaway (IBM Corp.)
Organizers: Kenneth L. Shepard, Sudhakar Bobba

39.1 WorstCase Circuit Delay Taking into Account Power Supply Variations [p. 652]

D. Kouroussis, R. Ahmadi, F. N. Najm (University of Toronto)
Current Static Timing Analysis (STA) techniques allow one to verify the timing of a circuit at
different process corners which only consider cases where all the supplies are low or high. This
analysis may not give the true maximum delay of a circuit because it neglects the possible
mismatch between drivers and loads. We propose a new approach for timing analysis in which
we first identify the critical path(s) of a circuit using a powersupplyaware timing model. Given
these critical paths, we then take into account how the power nodes of the gates on the critical
path are connected to the power grid, and reanalyze for the worstcase time delay. This reanalysis
is posed as an optimization problem where the complete operation of the entire circuit is
abstracted in terms of current constraints. We present our technique and report on the
implementation results using benchmark circuits tied to a number of testcase power grids.
Keywords: Static timing analysis, power grid, voltage fluctuations

39.2 Statistical Gate Delay Model Considering Multiple Input Switching [p. 658]

A. Agarwal (University of Michigan), F. Dartu (Intel Corporation), D. Blaauw (University of Michigan)
There is an increased dominance of intradie process variations, creating a need for an accurate
and fast statistical timing analysis. Most of the recent proposed approaches assume a Single
Input Switching model. Our experiments show that SIS underestimates the mean delay of a stage
by up to 20% and overestimates the standard deviation up to 26%. We also show that Multiple
Input Switching has a greater impact on statistical timing, than regular static timing analysis.
Hence, we propose a modeling technique for gate delay variability, considering MIS. Our model
can be efficiently incorporated into most of the statistical timing analysis frameworks. On
average over all test cases, our approach underestimates mean delay of a stage by 0.01% and
overestimates the standard deviation by only 2%, hence increasing the robustness to process
variations. Our modeling technique is independent of the deterministic MIS model, and we show
that its sensitivity to variations in the MIS model is small.

39.3 Static Timing Analysis using Backward Signal Propagation [p. 664]

D. Lee (University of Michigan), V. Zolotov (Motorola, Inc.), D. Blaauw (University of Michigan)
In this paper, we address the problem of signal pruning in static timing analysis (STA).
Traditionally, signals are propagated through the circuit and are pruned, such that only the signal
with the latest arrival time at each node is propagated forward. This signal pruning is a key to the
linear run time of STA. However, it was previously observed that a signal with the latest arrival
time may not be the most critical signal, as an earlier signal with a larger transition time can
result in a longer delay in the downstream logic. Hence, arrival time based pruning can result in
an optimistic delay, incorrect critical paths, and discontinuities of the delay during circuit
optimization. Although algorithms were proposed to remedy this issue, they rely on propagation
of multiple signals and have an exponential worstcase complexity. In this paper, we propose a
new timing analysis algorithm, which uses a two pass traversal of the circuit. In the initial
backward traversal, we construct delay tables which record the required time at a node as a
function of the transition time at that node. This is followed by a forward traversal where signals
are pruned not based on arrival times but based on slack. The proposed algorithm corrects the
accuracy problems of the arrival time based pruning while at the same time maintaining the
linear run time of STA. We implemented our algorithm and demonstrated its accuracy and
efficiency.
Chair: Grant E. Martin (Tensilica, Inc.)
Organizer: Grant E. Martin

40.1 Design and Implementation of the POWER5™ Microprocessor [p. 670]

J. Clabes, J. Friedrich, M. Sweet, J. DiLullo, S. Chu, D. Plass, J. Dawson, P. Muench, L. Powell,
M. Floyd, B. Sinharoy, M. Lee, M. Goulet, J. Wagoner, N. Schwartz, S. Runyon, G. Gorman
(IBM Systems Group), P. Restle (IBM Research), R. Kalla, J. McGill, S. Dodson (IBM Systems Group)
POWER5 offers significantly increased performance over previous POWER designs by
incorporating simultaneous multithreading, an enhanced memory subsystem, and extensive RAS
and power management support. The 276M transistor processor is implemented in 130nm
silicononinsulator technology with 8level of Cu metallization and operates at >1.5 GHz.
Keywords: POWER5, Microprocessor Design, Simultaneous Multithreading (SMT),
Temperature Sensor, Power Reduction, Clock Gating

40.2 A DualCore 64b UltraSPARC Microprocessor for Dense Server Applications [p. 673]

T. Takayanagi, J. L. Shin, B. Petrick, J. Su, A. S. Leon (Sun Microsystems, Inc.)
A processor core, previously implemented in a 0.25 μm Al process, is redesigned for a 0.13μm
Cu process to create a dualcore processor with 1MB integrated L2 cache, offering an efficient
performance/power ratio for computedense server applications. Deep submicron circuit design
challenges, including negative bias temperature instability (NBTI), leakage and coupling noise,
and L2 cache implementation are discussed.
Keywords: Multiprocessor, Dualcore, UltraSPARC, Dense server, Throughput computing,
Deep submicron technology, cache, leakage, Negative Bias Temperature Instability, NBTI,
Coupling noise, L2

40.3 Low Voltage Swing Logic Circuits for a Pentium® 4 Processor Integer Core [p. 678]

D. J. Deleganes, M. Barany, G. Geannopoulos, K. Kreitzer, A. P. Singh, S. Wijeratne (Intel Corporation)
The Pentium® 4 processor architecture uses a 2x frequency core clock[1] to implement low
latency integer ops. Low Voltage Swing logic circuits implemented in 90nm technology[2] meet
the frequency demands of a third generation integercore design.
Keywords: Pentium® 4 processor, integer core, senseamp, adder, rotator, microprocessor, Low
Voltage Swing, LVS
Chair: Grant E. Martin (Tensilica, Incorporation)
Organizer: Grant E. Martin

41.1 The Future Multiprocessor SystemsonChips [p. 681]

W. Wolf (Princeton University)
This paper surveys the stateoftheart and pending challenges in MPSoC design. Standards in
communications, multimedia, networking, and other areas encourage the development of high performance
platforms that can support a range of implementations of the standard. A
multiprocessor systemonchip includes embedded processors, digital logic, and mixedsignal
circuits combined into a heterogeneous multiprocessor. This mix of technologies creates a major
challenge for MPSoC design teams. We will look at some existing MPSoC designs and then
describe some hardware and software challenges for MPSoC designers.
Keywords: MPSOC, systemonchip, realtime, low power, embedded software.

41.2 Heterogeneous MPSoC  The Solution to EnergyEfficient Signal Processing [p. 686]

T. Kogel (CoWare, Inc.), H. Meyr (CoWare, Inc. and Aachen University of Technology)
To meet conflicting flexibility, performance and cost constraints of demanding signal processing
applications, future designs in this domain will contain an increasing number of application
specific programmable units combined with complex communication and memory
infrastructures. Novel architecture trends like Application Specific Instructionset Processors
(ASIPs) as well as customized buses and NetworkonChip based communication promise
enormous potential for optimization.
However, stateoftheart tooling and design practice is not in a shape to take advantage of this
advances in computer architecture and silicon technology. Currently, EDA industry develops two
diverging strategies to cope with the design complexity of such application specific,
heterogeneous MPSoC platforms. First, the Ip driven approach emphasizes the composition of
MPSoC platforms from configurable offtheshelf Intellectual Property blocks. On the other
hand, the designdriven approach strives to take design efficiency to the required level by use of
system level design methodologies and IP generation tools.
In this paper, we discuss technical and economical aspects of both strategies. Based on the
analysis of recent trends in computer architecture and system level design, we envision a handinhand approach of signal processing platform architectures and design methodology to conquer
the complexity crisis in emerging MPSoC developments.
Keywords: MPSoC, Design Space Exploration, Signal Processing, Energy Efficiency,
NetworkonChip

41.3 Flexible Architectures for Engineering Successful SOCs [p. 692]

C. Rowen, S. Leibson (Tensilica, Inc.)
This paper focuses on a particular SOC design technology and methodology, here called the
advanced or processorcentric SOC design method, which reduces the risk of SOC design and
increases ROI by using configurable processors to implement onchip functions while increasing
the SOC's flexibility through software programmability. The essential enabler for this design
methodology is automatic processor generation—the rapid and easy creation of new
microprocessor architectures, complete with efficient hardware designs and comprehensive
software tools. The high speed of the generation process and the great flexibility of the generated
architectures underpin a fundamental shift of the role of processors in system architecture.
Keywords: RTL, SOC, MPSOC, processor cores, RISC
Chair: Andrew B. Kahng (University of California at San Diego)
Organizers: Rich Goldman, Kurt Keutzer
Panelists: C. Bittlestone (Texas Instruments, Inc.), A. Bootehsaz (Synopsys, Inc.), S. Y. Borkar (Intel Corp.),
E. Chen (TSMC), L. Scheffer (Cadence Design Systems, Inc.), C. Visweswariah (IBM Corp.)

Process variations  which affect critical electrical parameters and lead to both random and
systematic changes in circuit performance – have always posed significant challenges to
semiconductor design. In the past, withindie process variation was relatively small, and methods
such as cornerbased analysis were sufficient. This allowed timing analysis tools to calculate
delays, slew times, coupling and power in a straightforward way. Today, the International
Technology Roadmap for Semiconductors suggests that the semiconductor industry's historical
ability to control process variations is under siege, for both devices and interconnects. As
statistical variation increases, will cornercasing lead to too much conservatism, and hence a
requirement for new statistical timing and noise analysis tools? In other words, is the design flow
inevitably moving to "delay is no longer a number; it's a distribution"? Or are the urgency and
the advantages of statistical timing analysis overstated?
Chair: Bill Halpin (Intel Corporation)
Organizers: Carl Sechen, Phiroze Parakh

43.1 Modeling Repeaters Explicitly Within Analytical Placement [p. 699]

P. Saxena (Intel Labs), B. Halpin (Synplicity, Inc.)
Recent works have shown that scaling causes the number of repeaters to grow rapidly. We
demonstrate that this growth leads to massive placement perturbations that break the
convergence of today’s interleaved placement and repeater insertion flows. We then present two
new force models for repeaters targeted towards analytical placement algorithms. Our
experiments demonstrate the effectiveness of our repeater modeling technique in preserving
placement convergence (often also accompanied by wirelength improvement) at the 45 and 32
nm technology nodes.
Keywords: Analytical placement, Buffering, Forcedirected placement, Interconnect, Placement,
Repeater insertion, Scaling.

43.2 Quadratic Placement Using an Improved Timing Model [p. 705]

B. Obermeier, F. M. Johannes (Technical University of Munich)
The performance of timingdriven placement methods depends strongly on the choice of the net
model. In this paper a more precise net model is presented that does not increase numerical
complexity. We introduce a method that replaces the clique model of a net by a tree model in the
quadratic placement formulation. This improvement enables us to control the length of every tree
segment separately. Furthermore, we present an analysis of the effects of every tree segment to
the net delay. The result is in turn used to control the placement engine. Our presented results are
based on legal placements. They show significant improvements over stateofthe art methods.
Keywords: timing driven placement, Quadratic placement, Steiner tree net model, sensitivity,
optimization potential

43.3 An Approach to PlacementCoupled Logic Replication [p. 711]

M. Hrkic, J. Lillis, G. Beraudo (University of Illinois at Chicago)
We present a set of techniques for placementcoupled, timingdriven logic replication. Two
components are at the core of the approach. First is an algorithm for optimal timingdriven fanin
tree embedding; the algorithm is very general in that it can easily incorporate complex objective
functions (e.g., placement costs) and can perform embedding on any graphbased target. Second
we introduce the Replication Tree which allows us to induce large fanin trees from a given
circuit which can then be optimized by the embedder. We have built an optimization engine
around these two ideas and report promising results for the FPGA domain including clock period
reductions of up to 36% compared with a timingdriven placement from VPR [12] and almost
double the average improvement of local replication from [1]. These results are achieved with
modest area and runtime overhead.
Keywords: Timing Optimization, Logic Replication, Placement, Programmable Logic
Chair: Masaharu Imai (Osaka University)
Organizers: Joachim Gerlach, Margarida Jacome

44.1 A Novel Approach for Flexible and Consistent ADLdriven ASIP Design [p. 717]

G. Braun, A. Nohl (CoWare, Inc.),
W. Sheng, J. Ceng, M. Hohenauer, H. Scharwächter, R. Leupers, H. Meyr (Institute for Integrated Systems)
Architecture description languages (ADL) have been established to aid the design of applicationspecific
instructionset processors (ASIP). Their main contribution is the automatic generation of
a software toolkit, including C compiler, assembler, linker, and instructionset simulator. Hence,
the challenge in the design of such ADLs is to unambiguously capture the architectural
information required for the toolkit generation in a single model. This is particularly difficult for
C compiler and simulator, as both require information about the instructions’ semantics,
however, while the C compiler needs to know what an instructions does, the simulator needs to
know how. Existing ADLs solve this problem by either introducing redundancy or by limiting
the language’s flexibility.
This paper presents a novel, mixedlevel approach for ADLbased instructionset description,
which offers maximum flexibility while preventing from inconsistencies. Moreover, it enables
capturing instruction and cycleaccurate descriptions in a single model. The feasibility and
design efficiency of our approach is demonstrated with a number of contemporary, realworld
processor architectures.
Keywords: ADL, ASIP, Embedded Processors

44.2 Characterizing Embedded Applications for InstructionSet
Extensible Processors [p. 723]

P. Yu, T. Mitra (National University of Singapore)
Extensible processors, which allow customization for an application domain by extending the
core instruction set architecture, are becoming increasingly popular for embedded systems.
However, existing techniques restrict the set of possible candidates for custom instructions by
imposing a variety of constraints. As a result, the true extent of performance improvement
achievable by extensible processors for embedded applications remains unknown. Moreover, it is
unclear how the interplay among these restrictions impacts the performance potential. Our
careful examination of this issue shows that significant speedup can only be obtained by relaxing
some of the constraints to a reasonable extent. In particular, to the best of our knowledge, ours is
the first work that studies the impact of relaxing control flow constraint by identifying instructions
across basic blocks and indicates 5148% relative speedup for different applications.
Keywords: Customization processors, Instructionset extensions.

44.3 Introduction of Local Memory Elements in Instruction Set Extensions [p. 729]
P. Biswas (University of California at Irvine),

V. Choudhary, K. Atasu, L. Pozzi, P. Ienne (Swiss Federal Institute of Technology),
N. Dutt (University of California at Irvine)
Automatic generation of Instruction Set Extensions (ISEs), to be executed on a custom
processing unit or a coprocessor is an important step towards processor customization. A typical
goal of a manual designer is to combine a large number of atomic instructions into an ISE
satisfying microarchitectural constraints. However, memory operations pose a challenge for
previous ISE approaches by limiting the size of the resulting instruction. In this paper, we
introduce memory elements into custom units which result in ISEs closer to those sought after by
the designers. We consider two kinds of memory elements for mapping to the specialized
hardware: small hardware tables and architecturallyvisible state registers. We devised a genetic
algorithm to specifically exploit opportunities of introducing memory elements during ISE
generation. Finally, we demonstrate the effectiveness of our approach by a detailed study of the
variation in performance, area and energy in the presence of the generated ISEs, on a number of
MediaBench, EEMBC and cryptographic applications. With the introduction of memory, the
average speedup varied from 2.7X to 5X depending on the architectural configuration with a
nominal area overhead. Moreover, we obtained an average energy reduction of 26% with respect
to a 32KB cache.
Keywords: Customizable processors, ASIPs, Instruction Set Extensions, Adhoc Functional
Units, Coprocessors, Genetic Algorithm.
Chair: Katherine Compton (University of Wisconsin)
Organizers: Jens Palsberg, Scott Hauck

45.1 FPGA Power Reduction Using Configurable DualVdd [p. 735]

F. Li, Y. Lin, L. He (University of California at Los Angeles)
Power optimization is of growing importance for FPGAs in nanometer technologies. Considering
dualVdd technique, we show that configurable power supply is required to obtain a satisfactory
performance and power tradeoff. We design FPGA circuits and logic fabrics using configurable
dualVdd and develop the corresponding CAD flow to leverage such circuits and logic fabrics.
We then carry out a highly quantitative study using area, delay and power models obtained from
detailed circuit design and SPICE simulation in 100nm technology. Compared to singleVdd
FPGAs with optimized Vdd level for the same target clock frequency, configurable dualVdd
FPGAs with full and partial supply programmability for logic blocks reduce logic power by
35.46% and 28.62% respectively and reduce total FPGA power by 14.29% and 9.04%
respectively. To the best of our knowledge, it is the first indepth study on FPGAs with
configurable dualVdd for power reduction.
Keywords: FPGA, low power, power efficient, dualVdd, configurable

45.2 MultiResource Aware Partitioning Algorithms for FPGAs
with Heterogeneous Resources [p. 741]

N. Selvakkumaran (University of Minnesota), A. Ranjan, S. Raje (HierDesign Inc.),
G. Karypis (University of Minnesota)
As FPGA densities increase, partitioningbased FPGA placement approaches are becoming
increasingly important as they can be used to provide highquality and computationally scalable
placement solutions. However, modern FPGA architectures incorporate heterogeneous resources,
which place additional requirements on the partitioning algorithms because they now need to not
only minimize the cut and balance the partitions, but also they must ensure that none of the
resources in each partition is oversubscribed. In this paper, we present a number of multilevel
multiresource hypergraph partitioning algorithms that are guaranteed to produce solutions that
balance the utilization of the different resources across the partitions. We evaluate our algorithms
on twelve industrial benchmarks ranging in size from 5,236 to 140,118 vertices and show that
they achieve minimal degradation in the mincut while balancing the various resources.
Comparing the quality of the solution produced by some of our algorithms against that produced
by hMETIS, we show that our algorithms are capable of balancing the different resources while
incurring only a 3.3%–5.7% higher cut.
Keywords: Partitioning, Placement, multiconstraint, multiresource, FPGA, hierarchical

45.3 An SoC Design Methodology Using FPGAs and Embedded Microprocessors [p. 747]

N. Ohba, K. Takano (IBM Japan Ltd.)
In System on Chip (SoC) design, growing design complexity has forced designers to start
designs at higher abstraction levels. This paper proposes an SoC design methodology that makes
full use of FPGA capabilities. Design modules in different abstraction levels are all combined
and run together in an FPGA prototyping system that fully emulates the target SoC. The higher
abstraction level design modules run on microprocessors embedded in the FPGAs, while lowerlevel
synthesizable RTL design modules are directly mapped onto FPGA reconfigurable cells.
We made a hardware wrapper that gets the embedded microprocessors to interface with the fully
synthesized modules through IBM CoreConnect buses. Using this methodology, we developed
an image processor SoC with cryptographic functions, and we verified the design by running real
firmware and application programs. For the designs that are too large to be fit into an FPGA,
dynamic reconfiguration method is used.
Keywords: SoC, ASIC, FPGA prototyping, and mixedlevel verification.
Chair: Srivaths Ravi (NEC Corporation)
Organizer: Srivaths Ravi
P. Kocher (Cryptography Research), R. Lee (Princeton University),
G. McGraw (Cigital), A. Raghunathan, S. Ravi (NEC Laboratories America)


The growing number of instances of breaches in information security in the last few years has
created a compelling case for efforts towards secure electronic systems. Embedded systems,
which will be ubiquitously used to capture, store, manipulate, and access data of a sensitive
nature, pose several unique and interesting security challenges. Security has been the subject of
intensive research in the areas of cryptography, computing, and networking. However, security is
often misconstrued by embedded system designers as the addition of features, such as specific
cryptographic algorithms and security protocols, to the system. In reality, it is an entirely new
metric that designers should consider throughout the design process, along with other metrics
such as cost, performance, and power.
This paper is intended to introduce embedded system designers and design tool developers to the
challenges involved in designing secure embedded systems. We attempt to provide a unified
view of embedded system security by first analyzing the typical functional security requirements
for embedded systems from an enduser perspective. We then identify the implied challenges for
embedded system architects, as well as hardware and software designers (e.g., tamperresistant
embedded system design, processing requirements for security, impact of security on battery life
for batterypowered systems, etc.). We also survey solution techniques to address these
challenges, drawing from both current practice and emerging research, and identify open
research problems that will require innovations in embedded system architecture and design
methodologies.
Keywords: Embedded Systems, Security, Cryptography, Security Protocols, Security
Processing, Design, Design Methodologies, Architectures, Tamper Resistance, Software Attacks,
Viruses, Trusted Computing, Digital Rights Management, Performance, Battery Life
Chair: Farid N. Najm (University of Toronto)
Organizers: Enrico Macii, Naehyuck Chang

47.1 Tradeoffs between Gate Oxide Leakage and Delay for Dual Tox Circuits [p. 761]

A. K. Sultania (University of Minnesota), D. Sylvester (University of Michigan),
S. S. Sapatnekar (University of Minnesota)
Gate oxide tunneling current (Igate) will become the dominant component of leakage in CMOS
circuits as the physical oxide thickness (Tox) goes below 15Å. Increasing the value of Tox reduces
the leakage at the expense of an increase in delay, and a practical tradeoff between delay and
leakage can be achieved by assigning one of the two permissible Tox values to each transistor. In
this paper, we propose an algorithm for dual Tox assignment to optimize the total leakage power
under delay constraints, and generate a leakage/delay tradeoff curve. As compared to the case
where all transistors are set to low Tox, our approach achieves an average leakage reduction of
83% under 100nm models.
Keywords: Leakage power, Dual Tox Circuits

47.2 Implicit Pseudo Boolean Enumeration Algorithms for Input Vector Control [p. 767]

K. Chopra, S. B. K. Vrudhula (University of Arizona)
In a CMOS combinational logic circuit, the subthreshold leakage current in the standby state
depends on the state of the inputs. In this paper we present a new approach to identify the
minimum leakage set of input vectors (MLS). Applying a vector in the MLS is known as Input
Vector Control (IVC), and has proven to be very useful in reducing gate oxide leakage and subthreshold
leakage in standby mode of operation. The approach presented here is based on
Implicit Enumeration of integervalued decision diagrams. Since the search space for minimum
leakage vector increases exponentially with the number of primary inputs, the enumeration is
done with respect to the minimum balanced cut of the digraph representation of the circuit. To
reduce the switching power dissipated when the inputs are driven to a given state (during entry
into and exit from the standby state), we extend the MLS algorithm to compute a bounded
leakage set (BLS). Given a bound of standby leakage, we present an algorithm for computing
minimal switching cost partial input vectors such that the leakage of the circuit is always less
than the upper bound.
Keywords: CMOS, Leakage, Power, Binary Decision Diagrams, Symbolic Methods, SAT

47.3 Statistical Optimization of Leakage Power Considering Process
Variations using DualVth and Sizing [p. 773]

A. Srivastava, D. Sylvester, D. Blaauw (University of Michigan)
Increasing levels of process variability in sub100nm CMOS design has become a critical
concern for performance and power constraint designs. In this paper, we propose a new
statistically aware DualVt and sizing optimization that considers both the variability in
performance and leakage of a design. While extensive work has been performed in the past on
statistical analysis methods, circuit optimization is still largely performed using deterministic
methods. We show in this paper that deterministic optimization quickly looses effectiveness for
stringent performance and leakage constraints in designs with significant variability. We then
propose a statistically aware dualVt and sizing algorithm where both delay constraints and
sensitivity computations are performed in a statistical manner. We demonstrate that using this
statistically aware optimization, leakage power can be reduced by 1535% compared to
traditional deterministic analysis. The improvements increase for strict delay constraints making
statistical optimization especially important for high performance designs.
Keywords: Leakage, variability, optimization

47.4 Leakage and CrosstalkAware Bus Encoding for Total Power Reduction [p. 779]

H. S. Deogun, R. R. Rao, D. Sylvester, D. Blaauw (University of Michigan)
Power consumption, particularly runtime leakage, in long onchip buses has grown to an
unacceptable portion of the total power budget due to heavy buffer insertion to combat RC
delays. In this paper, we propose a new bus encoding algorithm and circuit scheme for onchip
buses that eliminates capacitive crosstalk while simultaneously reducing total power. We
introduce a new buffer design approach with selective use of high threshold voltage transistors
and couple this buffer design with a novel bus encoding scheme. The proposed encoding scheme
significantly reduces total power by 26% and runtime leakage power by 42% while also
eliminating capacitive crosstalk. In addition, the proposed encoding is specifically optimized to
reduce the complexity of the encoding logic, allowing for a significant reduction in overhead
which has not been considered in previous bus encoding work.
Keywords: Leakage reduction, encoding, low power

47.5 Power Minimization using Simultaneous Gate Sizing, DualVdd
and DualVth Assignment [p. 783]

A. Srivastava, D. Sylvester, D. Blaauw (University of Michigan),
We develop an approach to minimize total power in a dualVdd and dualVth design. The
algorithm runs in two distinct phases. The first phase relies on upsizing to create slack and
maximize low Vdd assignments in a backward topological manner. The second phase proceeds
in a forward topological fashion and both sizes and reassigns gates to high Vdd to enable
significant static power savings through high Vth assignment. The proposed algorithm is
implemented and tested on a set of combinational benchmark circuits. A comparison with
traditional CVS and dualVth/sizing algorithms demonstrate the advantage of the algorithm over
a range of activity factors, including an average power reduction of 30% (50%) at high (nominal)
primary input activities.
Keywords: Power dissipation, optimization, multiple voltages
Chair: Yehea Massoud (Rice University)
Organizers: Sachin S. Sapatnekar, Jamil Kawa

48.1 Sparse Transformations and Preconditioners for Hierarchical
3D Capacitance Extraction with Multiple Dielectrics [p. 788]

S. Yan, V. Sarin, W. Shi (Texas A&M University)
Capacitance extraction is an important problem that has been extensively studied. This paper
presents a significant improvement for the fast multipole accelerated boundary element method.
We first introduce an algebraic transformation to convert the n × n dense capacitance coefficient
matrix into a sparse matrix with O(n) nonzero entries. We then use incomplete Cholesky
factorization or incomplete LU factorization to produce an effective preconditioner for the sparse
linear system. Simulation results show that our algorithm drastically reduces the number of
iterations needed to solve the linear system associated with the boundary element method. For
the k×k bus crossing benchmark, our algorithm uses 34 iterations, compared to 1020 iterations
used by the previous algorithms such as FastCap [1] and HiCap [2]. As a result, our algorithm is
220 times faster than those algorithms. Our algorithm is also superior to the multiscale method
[3] because our preconditioner reduces the number of iterations further and applies to multiple
dielectrics.
Keywords: Capacitance, extraction, interconnect, iterative methods, preconditioning

48.2 A Fast Parasitic Extractor Based on LowRank Multilevel Matrix
Compression for Conductor and Dielectric Modeling in Microelectronics
and MEMS [p. 794]

D. Gope, S. Chakraborty, V. Jandhyala (University of Washington)
Parasitic parameter extraction is a crucial issue in Integrated Circuit design. Integral equation
based solvers, which guarantee high accuracy, suffer from a time and memory bottleneck arising
from the dense matrices generated. In this paper we present a hybrid FMMQR algorithm that
combines the best features of the Fast Multipole Method and the QR based matrix compression
method to achieve faster setup and solve time and lower memory requirements. The method is
applied to extract parasitic capacitances from the layout of arbitrarily shaped conductors and
dielectrics. Examples demonstrating the accuracy and the superior time and memory
performances as compared to existing solvers are also presented.
Keywords: Parasitics, Multilevel, Lowrank, conductors and dielectrics

48.3 CHIME: Coupled Hierarchical Inductance Model Evaluation [p. 800]

S. Gupta, L. T. Pileggi (Carnegie Mellon University)
Modeling inductive effects accurately and efficiently is a critical necessity for design verification
of high performance integrated systems. While several techniques have been suggested to
address this problem, they are mostly based on sparsification schemes for the L or Linverse
matrix. In this paper, we introduce CHIME, a methodology for nonlocal inductance modeling
and simulation. CHIME is based on a hierarchical model of inductance that accounts for all
inductive couplings at a linear cost, without requiring any window size assumptions for
sparsification. The efficacy of our approach stems from representing the mutual inductive
couplings at various levels of hierarchy, rather than discarding some of them. A prototype
implementation demonstrates orders of magnitude speedup over a full, flat model and significant
accuracy improvements over a truncated model. Importantly, this hierarchical circuit simulation
capability produces a solution that is as accurate as the hierarchically extracted circuits, thereby
providing a "golden standard" against which simpler truncation based models can be validated.
KEYWORDS: Inductance Modeling, Circuit Simulation, Extraction

48.4 LargeScale FullWave Simulation [p. 806]

S. Kapur, D. E. Long (Integrand Software, Inc.)
We describe a new extraction tool, EMX (ElectroMagnetic eXtractor), for the analysis of RF,
analog and highspeed digital circuits. EMX is a fast fullwave field solver. It incorporates two
new techniques which make it significantly faster and more memoryefficient than previous
solvers. First, it takes advantage of layout regularity in typical designs. Second, EMX uses a new
method for computing the vectorpotential component in the mixed potential integral equation.
These techniques give a speedup of more than a factor of ten, together with a corresponding
reduction in memory.
Keywords: Layout extraction, multipole, integral equations

48.5 ClosedForm Expressions of Distributed RLC Interconnects
for Analysis of OnChip Inductance Effects [p. 810]

Y. Tanji (Kagawa University), H. Asai (Shizuoka University)
The closedform expressions of distributed RLC interconnects are proposed for analysis of onchip
inductance effects in order to insert optimally the repeaters. The transfer function of a
circuit with driverinterconnectload structure is approximated by the 5th order rational
functions. The step responses computed by using the proposed expressions give the good
agreement with the SPICE simulations.
Keywords: Inductance Effects, RLC Distributed Interconnects
Chair: Jordi Cortadella (University of Polytechnica De Catalunya)
Organizers: Leon Stok, Steven M. Nowick

49.1 Resynthesis for Delay Variation Tolerance [p. 814]

S.C. Chang, C.T. Hsieh, K.C. Wu (National Tsing Hua University)
Several factors such as process variation, noises, and delay defects can degrade the reliabilities of
a circuit. Traditional methods add a pessimistic timing margin to resolve delay variation
problems. In this paper, instead of sacrificing the performance, we propose a resynthesis
technique which adds redundant logics to protect the performance. Because nodes in the critical
paths have zero slacks and are vulnerable to delay variation, we formulate the problem of
tolerating delay variation to be the problem of increasing the slacks of nodes. Our resynthesis
technique can increase the slacks of all nodes or wires to be larger than a predetermined value.
Our experimental results show that additional area penalty is around 21% for 10% of delay
variation tolerance.
Keywords: Delay variation, tolerance

49.2 PostLayout Logic Optimization of Domino Circuits [p. 820]

A. Cao, C.K. Koh (Purdue University)
Logic duplication, a commonly used synthesis technique to remove trapped inverters in
reconvergent paths of Domino circuits, incurs high area and power penalties. In this paper, we
propose a synthesis scheme to reduce the duplication cost by allowing inverters in Domino logic
under certain timing constraints. In order to guarantee the robustness of such Domino circuits,
we perform the reduction of logic duplication at the physical level. Experimental results show
significant reduction in duplication cost, which translates into significant improvements in area,
power, and/or delay.
Keywords: Domino logic, synthesis, optimization, layout

49.3 Multiple Constant Multiplication By TimeMultiplexed Mapping
of Addition Chains [p. 826]

P. Tummeltshammer (University of Technology, Vienna),
J. C. Hoe, M. Püschel (Carnegie Mellon University)
An important primitive in the hardware implementations of linear DSP transforms is a circuit
that can multiply an input value by one of several different preset constants. We propose a novel
implementation of this circuit based on combining the addition chains of the constituent
constants. We present an algorithm to automatically generate such a circuit for a given set of
constants. The quality of the resulting circuits is evaluated after synthesis for a commercial 0.18
µm standard cell library. We compare the area and latency efficiency of this addition chain based
approach against a straightforward approach based on a constant table and a full multiplier.
Keywords: Addition chains, multiplierless, directed acyclic graph, fusion

49.4 Decomposing Specifications with Concurrent Outputs
to Resolve State Coding Conflicts in Asynchronous Logic Synthesis [p. 830]

H. K. Kapoor, M. B. Josephs (London South Bank University)
Synthesis of asynchronous logic using the tool Petrify requires a state graph with a complete
state coding. It is common for specifications to exhibit concurrent outputs, but Petrify is
sometimes unable to resolve the state coding conflicts that arise as a result, and hence cannot
synthesise a circuit. A pair of decomposition heuristics (expressed in the language of Delay
Insensitive Sequential Processes) are given that helps one to obtain a synthesisable specification.
The second heuristic has been successfully applied to a set of nine benchmarks to obtain
significant reductions both in area and in synthesis time, compared with synthesis performed on
the original specifications.
Keywords: Asynchronous logic synthesis, delayinsensitive decomposition

49.5 A New Heuristic Algorithm for Reversible Logic Synthesis [p. 834]

P. Kerntopf (Warsaw University of Technology)
Reversible logic has applications in many fields, including quantum computing. Synthesis
techniques for reversible circuits are not well developed, even for functions with a small number
of inputs and outputs. This paper proposes an approach to reversible logic synthesis using a new
complexity measure based on shared binary decision diagrams with complemented edges
(instead of truth tables, as in the previous algorithms). The approach can be used with arbitrary
libraries of reversible logic gates and arbitrary cost functions. Experiments show promising
results in comparison with the known approaches.
Keywords: Reversible Logic Circuits, Synthesis.

49.6 Quantum Logic Synthesis by Symbolic Reachability Analysis [p. 838]

W. N. N. Hung (Intel Corporation), X. Song, G. Yang (Portland State University),
J. Yang (Intel Corporation), M. Perkowski (Portland State University)
Reversible quantum logic plays an important role in quantum computing. In this paper, we
propose an approach to optimally synthesize quantum circuits by symbolic reachability analysis
where the primary inputs are purely binary. We present an exact synthesis method with optimal
quantum cost and a speedup method with nonoptimal quantum cost. Both our methods
guarantee the synthesizeability of all reversible circuits. Unlike previous works which use
permutative reversible gates, we use a lower level library which includes nonpermutative
quantum gates. Our approach obtains the minimum cost quantum circuits for Miller's gate, halfadder,
and fulladder, which are better than previous results. In addition, we prove the minimum
quantum cost (using our elementary quantum gates) for Fredkin, Peres, and Toffoli gates. Our
work constitutes the first successful experience of applying satisfiability with formal methods to
quantum logic synthesis.
Keywords: Reversible Logic, Quantum Computing, Formal Verification, Model Checking,
Satisfiability.
Chair: Joel R. Philips (Cadence Design Systems, Incorporated)
Organizers: Jacob, K. White, Kartikeya Mayaram

50.1 A Frequency Relaxation Approach for Analog/RF SystemLevel Simulation [p. 842]

X. Li, Y. Xu, P. Li, P. Gopalakrishnan, L. T. Pileggi (Carnegie Mellon University)
The increasing complexity of today's mixedsignal integrated circuits necessitates both topdown
and bottomup systemlevel verification. Timedomain statespace modeling and simulation
approaches have been successfully applied for such purposes (e.g. Simulink); however, analog
circuits are often best analyzed in the frequency domain. Circuitlevel analyses, such as
harmonic balance, have been successfully extended to the frequency domain [2], but these
algorithms are impractical for simulating large systems with wideband input and noise signals.
In this paper we proposed a frequencydomain approach for analog/RF systemlevel simulation
that is capable of capturing various second order effects (e.g. nonlinearity, noise, etc.) for both
timeinvariant and timevarying systems with wideband inputs. The simulator directly evaluates
the frequency domain response at each node via a relaxation scheme that is proven to be
convergent under typical circuit conditions. Our experimental results demonstrate the accuracy
and efficiency of the proposed simulator under various wideband input and noise excitations.
Keywords: SystemLevel Simulation, Analog/RF Circuits

50.2 Robust, Stable TimeDomain Methods for Solving MPDEs of Fast/Slow Systems [p. 848]

T. Mei, J. Roychowdhury (University of Minnesota),
T. S. Coffey, S. A. Hutchinson, D. M. Day (Sandia National Laboratories)
In this paper, we explore in detail the stability properties of timedomain numerical methods for
multitime partial differential equations (MPDEs). We demonstrate that simple techniques for
numerical discretization can lead easily to instability. By investigating the underlying
eigenstructure of several discretization techniques along different artificial time scales, we show
that not all combinations of techniques are stable. We identify choices of discretization method
and of step size along slow time scales that lead to robust, stable timedomain integration
methods for the MPDE. One of our results is that applying overstable methods along one timescale
can compensate for unstable discretization along others. Our novel integration schemes
bring robustness to timedomain MPDE solution methods, as we demonstrate with examples.
Keywords: MPDE, stability, eigenstructure, timedomain discretization, envelope

50.3 HighLevel Simulation of Substrate Noise in HighOhmic Substrates
with Interconnect and Supply Effects [p. 854]

G. Van der Plas (IMEC), M. Badaroglu (IMEC, K. U. Leuven),
G. Vandersteen (IMEC, Vrije Universiteit Brussel), P. Dobrovolny (IMEC),
P. Wambacq (IMEC, Vrije Universiteit Brussel), S. Donnay (IMEC), G. Gielen (ESAT, K.U. Leuven),
H. De Man (IMEC, ESAT, K.U. Leuven)
Substrate noise is a major obstacle for mixedsignal integration. In this paper we propose a fast
and accurate highlevel methodology to simulate substrate noise generated by large digital
circuits. The methodology can handle any substrate type, e.g. bulktype or EPItype, and takes
into account the effects of interconnect and supply. For each standard cell a substrate
macromodel is used in order to efficiently simulate the total system, which consists of a network
of such macromodels. For a 40K gates telecom circuit fabricated in a 0.18 μm CMOS process,
measurements indicate that substrate noise is simulated by using our methodology within 20%
error but several orders of magnitude faster in CPU time than a full SPICE simulation.
Keywords: model order reduction, modeling, substrate noise

50.4 Hierarchical Approach to Exact Symbolic Analysis of Large Analog Circuits [p. 860]

S. X.D. Tan, W. Guo, Z. Qi (University of California at Riverside)
This paper provides a novel approach to exact symbolic analysis of very large analog circuits.
The new method is based on determinant decision diagrams (DDDs) to represent symbolic
product terms. But instead of constructing DDD graphs directly from a flat circuit matrix, the
new method constructs DDD graphs in a hierarchical way based on hierarchically defined circuit
structures. The resulting algorithm can analyze much larger analog circuits exactly than before.
Theoretically, we show that exact symbolic expressions of a circuit are cancellationfree
expressions when the circuit is analyzed hierarchically. Practically we propose a novel
hierarchical DDD graph construction algorithm. Our experimental results show that very large
analog circuits, which can't be analyzed exactly before like μA725 and other unstructured
circuits up to 100 nodes, can be analyzed by the new approach for the first time. The new
approach significantly improves the exact symbolic capacity and promises huge potentials for the
new applications of symbolic analysis in analog circuit design automation.
Keywords
Behavioral Modeling, Symbolic Analysis, Circuit Simulation

50.5 An Essentially NonOscillatory (ENO) HighOrder Accurate
Adaptive Table Model for Device Modeling [p. 864]

B. Yang, B. McGaughy (Cadence Design Systems, Inc.)
Modern analytical device models become more and more complicated and expensive to evaluate
in circuit simulation. Interpolation based table lookup device models become increasingly
important for fast circuit simulation. Traditional table model trades accuracy for speed and is
only used in fastSpice simulators but not good enough for primetime Spice simulators such as
SPECTRE. We propose a novel table model technology that uses highorder essentially nonoscillatory
(ENO) polynomial interpolation in multidimensions to guarantee smoothness in
multidimensions and high accuracy in approximating i  v/q  v curves. An efficient transfinite
blending technique for the reconstruction of multidimensional tables is used. Interpolation
stencil is adaptively determined by automatic accuracy control. The method has been proved to
be superior to traditional ones and successfully applied in Spectre and Ultrasim for simulating
digital, analog, RF, and mixedsignal circuits.
Chair: Jihong Kim (Seoul National University)
Organizers: Chaitali Chakrabarti, Sarma B. Vrudhula

51.1 Theoretical and Practical Limits of Dynamic Voltage Scaling [p. 868]

B. Zhai, D. Blaauw, D. Sylvester (University of Michigan), K. Flautner (ARM Ltd.)
Dynamic voltage scaling (DVS) is a popular approach for energy reduction of integrated circuits.
Current processors that use DVS typically have an operating voltage range from full to half of
the maximum Vdd. However, it is possible to construct designs that operate over a much larger
voltage range: from full Vdd to subthreshold voltages. This possibility raises the question of
whether a larger voltage range improves the energy efficiency of DVS. First, from a theoretical
point of view, we show that for subthreshold supply voltages leakage energy becomes dominant,
making "just in time completion" energy inefficient. We derive an analytical model for the
minimum energy optimal voltage and study its trends with technology scaling. Second, we use
the proposed model to study the workload activity of an actual processor and analyze the energy
efficiency as a function of the lower limit of voltage scaling. Based on this study, we show that
extending the voltage range below 1/2 Vdd will improve the energy efficiency for most
processor designs, while extending this range to subthreshold operation is beneficial only for
very specific applications. Finally, we show that operation deep in the subthreshold voltage
range is never energyefficient.
Keywords dynamic voltage scaling, minimum energy point

51.2 Enabling Energy Efficiency in ViaPatterned Gate Array Devices [p. 874]

R. R. Taylor (Carnegie Mellon University), H. Schmit (Tabula, Inc.)
In an attempt to enable the costeffective production of lowand midvolume applicationspecific
chips, researchers have proposed a number of socalled structured ASIC architectures. These
architectures represent a departure from traditional standardcellbased ASIC designs in favor of
techniques which present more physical and structural regularity. If structured ASICs are to
become a viable alternative to standard cells, they must deliver performance and energy
efficiency which is competitive with standardcellbased design techniques. This paper focuses
on one family of structured ASICs known as viapatterned gate arrays, or VPGAs. In this paper,
we present circuit structures and power optimization algorithms which can be applied to VPGA
chips in an effort to reduce their operational power dissipation.
Keywords: Structured ASIC, VPGA, LowPower, Voltage Scaling, Power Optimization

51.3 Compact Thermal Modeling for TemperatureAware Design [p. 878]

W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, S. Velusamy (University of Virginia)
Thermal design in sub100nm technologies is one of the major challenges to the CAD
community. In this paper, we first introduce the idea of temperatureaware design. We then
propose a compact thermal model which can be integrated with modern CAD tools to achieve a
temperatureaware design methodology. Finally, we use the compact thermal model in a case
study of microprocessor design to show the importance of using temperature as a guideline for
the design. Results from our thermal model show that a temperatureaware design approach can
provide more accurate estimations, and therefore better decisions and faster design convergence.
Keywords: temperatureaware design, temperatureaware computing, thermal model, poweraware
design, leakage, reliability.

51.4 Simultaneous Optimization of Supply and Threshold Voltages for LowPower
and HighPerformance Circuits in the Leakage Dominant Era [p. 884]

A. Basu, S.C. Lin, V. Wason (University of California at Santa Barbara),
A. Mehrotra (Berkeley Design Automation, Inc.), K. Banerjee (University of California at Santa Barbara)
Electrothermal couplings between supply voltage, operating frequency, power dissipation and
die temperature have been shown to significantly impact the energydelayproduct (EDP) based
simultaneous optimization of supply (Vdd) and threshold (Vth) voltages. We present for the first
time, the implications of an electrothermally aware EDP optimization on circuit operation in
leakage dominant nanometer scale CMOS technologies. It is demonstrated that electrothermal
EDP (EEDP) optimization restricts the operation of the circuit to a certain region in the VddVth
plane. Also, the significance of EEDP optimization has been shown to increase with increase in
leakage power and/or process variations.
Keywords: Electrothermal couplings, energy delay product, subthreshold leakage, temperature
aware design
Chair: David Blaauw (University of Michigan)
Organizers: Anirudh Devgan, Dennis Sylvester

52.1 Noise Characterization of Static CMOS Gates [p. 888]

R. Kanj (University of Illinois at UrbanaChampaign),
T. Lehner, B. Agrawal (IBM Corporation), E. Rosenbaum (University of Illinois at UrbanaChampaign)
We present new macromodeling techniques for capturing the response of a CMOS logic gate to
noise pulses at the input. Two approaches are presented. The first one is a robust mathematical
model which enables the hierarchical generation of noise abstracts for circuits composed of the
precharacterized cells. The second is a circuit equivalent model which generates accurate noise
waveforms for arbitrarily shaped and timed multipleinput glitches, arbitrary loads, and external
noise coupling.
Keywords: Cell model, noise analysis, sensitivity, circuitequivalent model, mathematical model,
simulation

52.2 A Scalable Soft Spot Analysis Methodology for Compound Noise
Effects in Nanometer Circuits [p. 894]

C. Zhao, X. Bai, S. Dey (University of California at San Diego)
Circuits using nanometer technologies are becoming increasingly vulnerable to signal
interference from multiple noise sources as well as radiationinduced soft errors. One way to
ensure reliable functioning of chips is to be able to analyze and identify the spots in the circuit
which are susceptible to such effects (called “soft spots” in this paper), and to make sure such
soft spots are “hardened” so as to resist multiple noise effects and soft errors. In this paper, we
present a scalable soft spot analysis methodology to study the vulnerability of digital ICs
exposed to nanometer noise and transient soft errors. First, we define "softness" as an important
characteristic to gauge system vulnerability. Then several key factors affecting softness are
examined. Finally an efficient Automatic Soft Spot Analyzer (ASSA) is developed to obtain the
softness distribution which reflects the unbalanced noisetolerant capability of different regions
in a design. The proposed methodology provides guidelines to reduction of severe nanometer
noise effects caused by aggressive design in the premanufacturing phase, and guidelines to
selective insertion of online protection schemes to achieve higher robustness. The quality of the
proposed softspot analysis technique is validated by HSPICE simulation, and its scalability is
demonstrated on a commercial embedded processor.
Keywords: Nanometer technology, compound noise effect, robustness, softness distribution.

52.3 A Novel Technique to Improve Noise Immunity of CMOS
Dynamic Logic Circuits [p. 900]

L. Ding, P. Mazumder (University of Michigan)
Dynamic CMOS logic circuits are widely employed in high performance VLSI chips in pursuing
very high system performance. However, dynamic circuits are inherently less resistant to noises
than static CMOS gates. With the increasing stringent noise requirement due to aggressive
technology scaling, the noise tolerance of dynamic circuits has to be first improved for the
overall reliable operation of VLSI systems. In this paper, we present a novel noisetolerant
design technique using circuitry exhibiting a negative differential resistance effect. We have
demonstrated that using the proposed method the noise tolerance of dynamic logic gates can be
improved beyond the level of static CMOS logic gates while the performance advantage of
dynamic circuits is still retained.
Keywords: Dynamic circuits, Domino logic style, Noisetolerant design, Negative differential
resistance, Digital integrated circuits

52.4 Statistical Timing Analysis in Sequential Circuit for OnChip
Global Interconnect Pipelining [p. 904]

L. Zhang, Y. Hu (University of Wisconsin), C. Chen (National Taiwan University)
With deepsubmicron (DSM) technology, statistical timing analysis becomes increasingly
crucial to characterize signal transmission over global interconnect wires. In this paper, a novel
statistical timing analysis approach has been developed to analyze the behavior of two important
pipelined architectures for multiple clockcycle global interconnect, namely, the flipflop inserted
global wire and the latch inserted global wire. We present analytical formula that is based on
parameters obtained using Monte Carlo simulation. These results enable a global interconnect
designer to explore design tradeo.s between clock frequency and probability of biterror during
data transmission.
Keywords: Statistical Timing Analysis, Interconnect Pipelining
Chair: Giovanni De Micheli (Stanford University)
Organizers: Marcello Coppola, Pai H. Chou

53.1 Debugging HW/SW Interface for MPSoC: Video Encoder
System Design Case Study [p. 908]

M.W. Youssef, S. Yoo, A. Sasongko, Y. Paviot, A. A. Jerraya (TIMA Laboratory)
This paper reports a case study of multiprocessor SoC (MPSoC) design of a complex video
encoder, namely OpenDivX. OpenDivX is a popular version of MPEG4. It requires massive
computation resources and deals with complex data structures to represent video streams. In this
study, the initial specification is given in sequential C code that had to be parallelized to be
executed on four different processors. High level programming model, namely Message Passing
Interface (MPI) was used to enable intertask communication among parallelized C code. A four
processor hardware prototyping platform was used to debug the parallelized software before final
SoC hardware is ready. The targeting of abstract parallel code using MPI to the multiprocessor
architecture required the design of an additional hardwaredependent software layer to refine the
abstract programming model. The design was made by a team work of three types of designer:
application software, hardwaredependent software and hardware platform designers. The
collaboration was necessary to master the whole flow from the specification to the platform.
The study showed that HW/SW interface debug was the most timeconsuming step. This is
identified as a potential killer for applicationspecific MPSoC design. To further investigate the
ways to accelerate the HW/SW interface debug, we analyzed bugs found in the case study and
the available debug environments. Finally, we address a debug strategy that exploits efficiently
existing debug environments to reduce the time for HW/SW interface debug.
Keywords: Multiprocessor systemonchip, Hardwaresoftware interface, Hardwaredependant
software, Debug

53.2 SUNMAP: A Tool for Automatic Topology Selection and Generation for NoCs [p. 914]

S. Murali, G. De Micheli (Stanford University)
Increasing communication demands of processor and memory cores in Systems on Chips (SoCs)
necessitate the use of Networks on Chip (NoC) to interconnect the cores. An important phase in
the design of NoCs is the mapping of cores onto the most suitable topology for a given
application. In this paper, we present SUNMAP a tool for automatically selecting the best
topology for a given application and producing a mapping of cores onto that topology. SUNMAP
explores various design objectives such as minimizing average communication delay, area,
power dissipation subject to bandwidth and area constraints. The tool supports different routing
functions (dimension ordered, minimumpath, traffic splitting) and uses floorplanning
information early in the topology selection process to provide feasible mappings. The network
components of the chosen NoC are automatically generated using cycleaccurate SystemC soft
macros from xpipes architecture. SUNMAP automates NoC selection and generation, bridging
an important design gap in building NoCs. Several experimental case studies are presented in the
paper, which show the rich design space exploration capabilities of SUNMAP.
Keywords: Systems On Chip, Networks on Chip, Topology, Mapping, SystemC.

53.3 FITS: Frameworkbased Instructionset Tuning Synthesis
for Embedded Application Specific Processors [p. 920]

A. Cheng (The University of Michigan), G. Tyson (Florida State University),
T. Mudge (The University of Michigan)
We propose a new instruction synthesis paradigm that falls between a generalpurpose embedded
processor and a synthesized application specific processor (ASP). This is achieved by replacing
the fixed instruction and register decoding of general purpose embedded processor with
programmable decoders that can achieve ASP performance with the fabrication advantages of a
mass produced single chip solution.
Keywords: 16bit ISA, Instruction synthesis, Embedded processor, ASP, Instruction encoding,
Reconfigurable processors, Configurable architecture, Code density, Lowpower, Energy
efficient

53.4 Mapping a Domain Specific Language to a Platform FPGA [p. 924]

C. Kulkarni, G. Brebner (Xilinx Inc.), G. Schelle (University of Colorado)
A domain specific language (DSL) enables designers to rapidly specify and implement systems
for a particular domain, yielding designs that are easy to understand, reason about, reuse and
maintain. However, there is usually a significant overhead in the required infrastructure to map
such a DSL on to a programmable logic device. In this paper, we present a mapping of an
existing DSL for the networking domain on to a platform FPGA by embedding the DSL into an
existing language infrastructure. In particular, we will show that, using few basic concepts, we
are able to achieve a successful mapping of the DSL on to a platform FPGA and create a reusable
structure that also makes it easy to extend the DSL. Finally we will present some results
of mapping the DSL on to a platform FPGA and comment on the resulting overhead.
Keywords: Domain Specific Language, Platform FPGA, Network Processing
Chair: Bernd Koenemann (Cadence Design)
Organizers: Erik Jan Marinissen, Seiji Kajihara

54.1 On the Generation of ScanBased Test Sets with Reachable States
for Testing under Functional Operation Conditions [p. 928]

I. Pomeranz (Purdue University)
Designfortestability (DFT ) for synchronous sequential circuits allows the generation and
application of tests that rely on nonfunctional operation of the circuit. This can result in
unnecessary yield loss due to the detection of faults that do not affect normal circuit operation.
Considering single stuckat faults in fullscan circuits, a test vector consists of a primary input
vector U and a state S . We say that the test vector consisting of U and S relies on nonfunctional
operation if S is an unreachable state, i.e., a state that cannot be reached from all the circuit
states. Our goal is to obtain test sets with states S that are reachable states. Given a test set C, the
solution we explore is based on a simulationbased procedure to identify reachable states that can
replace unreachable states in C. No modifications are required to the test generation procedure
and no sequential test generation is needed. Our results demonstrate that the proposed procedure
is able to produce test sets that detect many of the circuit faults, which are detectable using scan,
and practically all the sequentially irredundant faults, by using test vectors with reachable states.
The procedure is applicable to any type of scanbased test set, including test sets for delay faults.
Keywords: functional tests, reachable states, scan design.

54.2 Scalable Selector Architecture for XTolerant Deterministic BIST [p. 934]

P. Wohl, J. A. Waicukauski, S. Patel (Synopsys Inc.)
Xtolerant deterministic BIST (XDBIST) was recently presented as a method to efficiently
compress and apply scan patterns generated by automatic test pattern generation (ATPG) in a
logic builtin selftest architecture. In this paper we introduce a novel selector architecture that
allows arbitrary compression ratios, scales to any number of scan chains and minimizes area
overhead. XDBIST testcoverage, full Xtolerance and scanbased diagnosis ability are preserved
and are the same as deterministic scanATPG.
Keywords: testdata compression, testgeneration (ATPG).

54.3 ScanBIST Based on Transition Probabilities [p. 940]

I. Pomeranz (Purdue University)
We demonstrate that it is possible to generate a deterministic test set that detects all the
detectable single stuckat faults in a fullscan circuit such that each test contains a small number
of transitions from 0 to 1 or from 1 to 0 when considering consecutive input values. Using this
result we show that builtin testpattern generation for scan circuits can be based on transition
probabilities instead of probabilities of specific bits in the test set being 0 or 1. The resulting
approach associates only two parameters with every set of test vectors: an initial value and a
transition probability. We demonstrate that this approach is effective in detecting all the
detectable single stuckat faults in benchmark circuits.
Keywords: builtin selftest, scan design.

54.4 Combining Dictionary Coding and LFSR Reseeding for Test Data Compression [p. 944]

X. Sun, L. Kinney, B. Vinnakota (University of Minnesota)
In this paper we describe a method to combine dictionary coding and partial LFSR reseeding to
improve the compression efficiency for test data compression. We also present a fast matrix
calculation method which significantly reduces the computation time to find a solution for partial
LFSR reseeding. Experimental results on ISCAS89 benchmark circuits show that our approach is
better than either dictionary coding or LFSR reseeding, and outperforms several test data
compression methods proposed recently.
Keywords: VLSI test, BuiltIn Self Test.
Chair: Jason Cong (Magma Design Automation, Inc.)
Organizers: Jens Palsberg, Scott Hauck

55.1 Virtual Memory Window for ApplicationSpecific Reconfigurable
Coprocessors [p. 948]

M. Vuletic, L. Pozzi, P. Ienne (Swiss Federal Institute of Technology Lausanne)
Reconfigurable SystemsonChip (SoCs) on the market consist of fullfledged processors and large
FieldProgrammable GateArrays (FPGAs). The latter can be used to implement the system glue
logic, various peripherals, and applicationspecific coprocessors. Using FPGAs for applicationspecific
coprocessors has certain speedup potentials, but it is less present in practice because of
the complexity of interfacing the software application with the coprocessor. Another obstacle is
the lack of portability across different systems. In this work, we present a virtualisation layer
consisting of an operatingsystem extension and a hardware component. It lowers the complexity
of interfacing and increases portability potentials, while it also allows the coprocessor to access
the user virtual memory through a virtual memory window. The burden of moving data between
processor and coprocessor is shifted from the programmer to the operating system. Since the
virtualisation layer components hide physical details of the system, user designed hardware and
software become perfectly portable. A reconfigurable SoC running Linux is used to prove the
viability of the concept. Two applications are ported to the system for testing the approach, with
their critical functions mapped to the specific coprocessors. We show a significant speedup
compared to the software versions, while limited penalty is paid for virtualisation.
Keywords: Reconfigurable Computing, OS, Coprocessors, Codesign.

55.2 Dynamic FPGA Routing for JustinTime FPGA Compilation [p. 954]

R. Lysecky, F. Vahid, S. X.D. Tan (University of California at Riverside)
Justintime (JIT) compilation has previously been used in many applications to enable standard
software binaries to execute on different underlying processor architectures. However, embedded
systems increasingly incorporate Field Programmable Gate Arrays (FPGAs), for which the
concept of a standard hardware binary did not previously exist, requiring designers to implement
a hardware circuit for a single specific FPGA. We introduce the concept of a standard hardware
binary, using a justintime compiler to compile the hardware binary to an FPGA. A JIT
compiler for FPGAs requires the development of lean versions of technology mapping,
placement, and routing algorithms, of which routing is the most computationally and memory
expensive step. We present the Riverside OnChip Router (ROCR) designed to efficiently route
a hardware circuit for a simple configurable logic fabric that we have developed. Through
experiments with MCNC benchmark hardware circuits, we show that ROCR works well for JIT
FPGA compilation, producing good hardware circuits using an order of magnitude less memory
resources and execution time compared with the well known Versatile Place and Route (VPR)
tool suite. ROCR produces good hardware circuits using 13X less memory and executing 10X
faster than VPR's fastest routing algorithm. Furthermore, our results show ROCR requires only
10% additional routing resources, and results in circuit speeds only 32% slower than VPR's
timingdriven router, and speeds that are actually 10% faster than VPR's routabilitydriven
router.
Keywords: Place and route, justintime compilation, hardware/software partitioning, FPGA,
configurable logic, platforms, systemonachip, dynamic optimization, codesign, warp
processors.

55.3 An Efficient Algorithm for Finding Empty Space for Online FPGA Placement [p. 960]

M. Handa, R. Vemuri (University of Cincinnati)
A fast and efficient algorithm for finding empty area is necessary for online placement, task
relocation and defragmentation on a partially reconfigurable FPGA. We present an algorithm
that finds empty area as a list of overlapping maximal rectangles. Using an innovative
representation of the FPGA, we are able to predict possible locations of the maximal empty
rectangles. Worstcase time complexity of our algorithm is O(xy) where x is the number of
columns, y is the number of rows and x.y is the total number of cells on the FPGA. Experiments
show that, in practice, our algorithm needs to scan less than 15% of the FPGA cells to make a list
of all maximal empty rectangles.
Keywords: Online Placement, Partially Reconfigurable FPGAs, Reconfigurable Computing
