In this paper we present a new algorithm for computing reduced-order models of interconnect which utilizes the dominant controllable subspace of the system. The dominant controllable modes are computed via a new iterative Lyapunov equation solver, Vector ADI. This new algorithm is as inexpensive as Krylov subspace-based moment matching methods, and often produces a better approximation over a wide frequency range. A spiral inductor and a transmission line example show this new method can be much more accurate than moment matching via Arnoldi.
Since Asymptotic Waveform Evaluation (AWE) was introduced in [5], many interconnect model order reduction methods via Pade approximation have been proposed. Although the stability and precision of model reduction methods have been greatly improved, the following important question has not been answered: "What is the error bound in the time domain?". This problem is mainly caused by the "gap" between the frequency domain and the time domain, i.e. a good approximated transfer function in the frequency domain may not be a good approximation in the time domain. All of the existing methods approximate the transfer function directly in the frequency domain and hence can not provide error bounds in the time domain. In this paper, we present new moment matching methods which can provide guaranteed error bounds in the time domain. Our methods are based on the classic work by Teasdale in [1] which performs Pade approximation in a transformed domain by the bilinear conformal transformation s = 1-z/1+z.
A new algorithm based on Krylov subspace
methods is proposed for model-reduction of
large nonlinear circuits. Reduction is obtained
by projecting the original system described by
nonlinear differential equations into a subspace
of a lower dimension. The reduced model can
be simulated using conventional numerical integration
techniques. Significant reduction in
computational expense is achieved as the size of
the reduced equations is much less than that of
the original system.
1.1 Keywords:
Model-reduction, nonlinear circuits, Krylov-subspace
ENOR is an innovative way to produce provably-passive, reciprocal, and compact representations of RLC circuits. Beginning with the nodal equations, ENOR formulates recurrence relations for the moments that involve factorizing a symmetric, positive definite matrix; this contrasts with other RLC order reduction algorithms that require expensive LU factorization. It handles floating capacitors, inductor loops, and resistor links in a uniform way. It distinguishes between active and passive ports, does Gram-Schmidt orthogonalization on the fly, controls error in the time-domain. ENOR is a superbly simple, flexible, and well-conditioned algorithm for lightning reduction of mega-sized RLC trees, meshes, and coupled interconnects-all with excellent accuracy.
Empirical observation shows that practically encountered instances of ATPG are efficiently solvable. However, it has been known for more than two decades that ATPG is an NP-complete problem. This work is one of the first attempts to reconcile these seemingly disparate results. We introduce the concept of circuit cut-width and characterize the complexity of ATPG in terms of this property. We provide theoretical and empirical results to argue that an interestingly large class of practical circuits have cut-width characteristics which ensure a provably efficient solution of ATPG on them.
Ordered Binary Decision Diagrams (BDDs) are a data structure for representation and manipulation of Boolean functions often applied in VLSI CAD. The choice of the variable ordering largely influences the size of the BDD; its size may vary from linear to exponential. The most successful methods for finding good orderings are based on dynamic variable reordering, i.e. exchanging of neighboring variables. This basic operation has been used in various variants, like sifting and window permutation. In this paper we show that lower bounds computed during the minimization process can speed up the computation significantly. First, lower bounds are studied from a theoretical point of view. Then these techniques are incorporated in dynamic minimization algorithms. By the computation of good lower bounds large parts of the search space can be pruned resulting in very fast computations. Experimental results are given to demonstrate the efficiency of our approach.
Recently, a number of watermarking-based intellectual property protection techniques have been proposed. Although they have been applied to different stages in the design process and have a great variety of technical and theoretical features, all of them share two common properties: they all have been applied solely to optimization problems and do not involve any optimization during the watermarking process. In this paper, we propose the first set of optimization-intensive watermarking techniques for decision problems. In particular, we demonstrate how one can select a subset of superimposed water-marking constraints so that the uniqueness of the signature and the likelihood of satisfying an instance of the satisfiability problem are simultaneously maximized. We have developed three watermarking SAT techniques: adding clauses, deleting literals, push-out and pull-back. Each technique targets different types of signature-induced constraint superimposition on an instance of the SAT problem. In addition to comprehensive experimental validation, we theoretically analyze the potentials and limitations of the proposed watermarking techniques. Furthermore, we analyze the three proposed optimization-intensive watermarking SAT techniques in terms of their suitability for copy detection.
The goal of this paper is to identify the most efficient algorithms for the optimum mean cycle and optimum cost to time ratio problems and compare them with the popular ones in the CAD community. These problems have numerous important applications in CAD, graph theory, discrete event system theory, and manufacturing systems. In particular, they are fundamental to the performance analysis of digital systems such as synchronous, asynchronous, data flow, and embedded real-time systems. For instance, algorithms for these problems are used to compute the cycle period of any cyclic digital system. Without loss of generality, we discuss these algorithms in the context of the minimum mean cycle problem (MCMP). We performed a comprehensive experimental study of ten leading algorithms for MCMP. We programmed these algorithms uniformly and efficiently. We systematically compared them on a test suite composed of random graphs as well as benchmark circuits. Above all, our results provide important insight into the performance of these algorithms in practice. One of the most surprising results of this paper is that Howard's algorithm, known primarily in the stochastic control community, is by far the fastest algorithm on our test suite although the only known bound on its running time is exponential. We provide two stronger bounds on its running time.
IPCHINOOK is a design tool for distributed embedded systems. It gains leverage from the use of a carefully chosen set of design abstractions that raise the level of designer interaction during the specification, synthesis, and simulation of the design. IPCHINOOK focuses on a component-based approach to system building that enhances the ability to reuse existing software modules. This is accomplished through a new model for constructing components that enables composition of control-flow as well as data-flow. The designer then maps the elements of the specification to a target architecture: a set of processing elements and communication channels. IPCHINOOK synthesizes all of the detailed communication and synchronization instructions. Designers get feedback via a co-simulation engine that permits rapid evaluation. By shortening the design cycle, designers are able to more completely explore the design space of possible architectures and/or improve time-to-market. IPCHINOOK is embodied in a system development environment that supports the design methodology by integrating a user interface for system specification, simulation, and synthesis tools. By raising the level of abstraction of specifications above the low-level target-specific implementation, and by automating the generation of these difficult and error-prone details, IPCHINOOK lets designers focus on global architectural and functionality decisions.
One key issue in design flows based on reuse of third-party intellectual property (IP) components is the need to estimate the impact of component instantiation within complex designs. In this paper we introduce JavaCAD, an internet-based EDA tool built on a secure client-server architecture that enables designers to perform simulation and cost estimation of circuits containing IP components without actually purchasing them. At the same time, the tool ensures intellectual property protection for the vendors of IP components, and for the IP-users as well. Moreover, JavaCAD supports negotiation of the amount of information and the accuracy of cost estimates, thereby providing seamless transition between IP evaluation and purchase.
This paper presents a design methodology, called common-case computation (CCC), and new design automation algorithms for optimizing power consumption or performance. The proposed techniques are applicable in conjunction with any high-level design methodology where a structural register-transfer level (RTL) description and its corresponding scheduled behavioral (cycle-accurate functional RTL) description are available. It is a well-known fact that in behavioral descriptions of hardware (also in software), a small set of computations (CCCs) often accounts for most of the computational complexity. However, in hardware implementations (structural RTL or lower level), CCCs and the remaining computations are typically treated alike. This paper shows that identifying and exploiting CCCs during the design process can lead to implementations that are much more efficient in terms of power consumption or performance. We propose a CCC-based high-level design methodology with the following steps: extraction of common-case behaviors and execution conditions from the scheduled description, simplification of the common-case behaviors in a stand-alone manner, synthesis of common-case detection and execution circuits from the common-case behaviors, and composing the original design with the common-case circuits, resulting in a CCC-optimized design. We demonstrate that CCC-optimized designs reduce power consumption by up to 91.5%, or improve performance by up to 76.6% compared to designs derived without special regard for CCCs.
Gate-level voltage scaling is an approach that allows different supply voltages for different gates in order to achieve power reduction. Previous researches focused on determining the voltage level for each gate and ascertaining the power saving capability of the approach via logic-level power estimation. In this paper, we present the layout techniques that feasiblize the approach in cell-based design environment. A new block layout style is proposed to support the voltage scaling with conventional standard cell libraries. The block layout can be automatically generated via a simulated annealing based placement algorithm. In addition, we propose a new cell layout style with built-in multiple supply rails. Using the cell layout, gate-level voltage scaling can be immediately embedded in a typical cell-based design flow. Experimental results show that proposed techniques produce very promising results.
The advent of portable and high-density devices has made power consumption a critical design concern. In this paper, we address the problem of reducing power consumption via gate-level voltage scaling for those designs that are not under the strictest timing budget. We first use a maximum-weighted independent set formulation for voltage reduction on non-critical part of the circuit. Then, we use a minimum-weighted separator set formulation to do gate sizing and integrate the sizing procedure with a voltage scaling procedure to enhance power saving on the whole circuit. The proposed methods are evaluated using the MCNC benchmark circuits. and an average of 19.12% power reduction over the circuits having only one supply voltage has been achieved.
Dynamic power consumed in CMOS gates goes down quadratically with the supply voltage. By maintaining a high supply voltage for gates on the critical path and by using a low supply voltage for gates off the critical path it is possible to dramatically reduce power consumption in CMOS VLSI circuits without performance degradation. Interfacing gates operating under multiple supply voltages, however, requires the use of level converters, which makes the problem modeling difficult. In this paper we develop a formal model and develop an efficient heuristic for addressing the use of two supply voltages for low power CMOS VLSI circuits without performance degradation. Power consumption savings up to 25% over and above the best known existing heuristics are demonstrated for combinational circuits in the ISCAS85 benchmark suite.
This paper presents a new method for determining the widths of the power and ground routes in integrated circuits so that the area required by the routes is minimized subject to the reliability constraints. The basic idea is to transform the resulting constrained nonlinear programming problem into a sequence of linear programs. Theoretically, we show that the sequence of linear programs always converges to the optimum solution of the relaxed convex problem. Experimental results demonstrate that the sequence-of-linear-programming method is orders of magnitude faster than the best-known method based on conjugate gradients, with constantly better optimization solutions.
We propose a Full-plane AWE Routing with Driver Sizing (FAR-DS) algorithm for performance driven routing in deep sub-micron technology. We employ a fourth order AWE delay model in the full plane, including both Hanan and non-Hanan points. Optimizing the driver size simultaneously extends our work into a two-dimensional space, enabling us to achieve the desired balance between wire and driver cost reduction, while satisfying the timing constraints. Compared to SERT, experimental results showed that our algorithm can provide an average reduction of 23% in the wire cost and 50% in the driver cost under stringent timing constraints.
Noise, as well as area, delay, and power, is one of the most important concerns in the design of deep submicron ICs. Currently existing algorithms can not handle simultaneous switching conditions of signals for noise minimization. In this paper, we model not only physical coupling capacitance, but also simultaneous switching behavior for noise optimization. Based on Lagrangian relaxation, we present an algorithm that can optimally solve the simultaneous noise, area, delay, and power optimization problem by sizing circuit components. Our algorithm, with linear memory requirement overall and linear runtime per iteration, is very effective and efficient. For example, for a circuit of 6144 wires and 3512 gates, our algorithm solves the simultaneous optimization problem using only 2.1 MB memory and 47 minute runtime to achieve the precision of within 1% error on a SUN UltraSPARC-I workstation.
During the routing of global interconnects, macro blocks form useful routing regions which allow wires to go through but forbid buffers to be inserted. They give restrictions on buffer locations. In this paper, we take these buffer location restrictions into consideration and solve the simultaneous maze routing and buffer insertion problem. Given a block placement defining buffer location restrictions and a pair of pins (a source and a sink), we give a polynomial time exact algorithm to find a buffered route from the source to the sink with minimum Elmore delay.
We study the variation of the crosstalk in a net and its neighbors when one of its trunks is perturbed, showing that the trunk's perturbation range can be efficiently divided into subintervals having monotonic or unimodal crosstalk variation. We can therefore determine the optimum trunk location without solving any non-linear equations. Using this, we construct and experimentally verify an algorithm to minimize the peak net crosstalk in a gridless channel.
Asynchronous systems are being viewed as an increasingly viable alternative to purely synchronous systems. This paper gives an overview of the current state of the art in practical asynchronous circuit and system design in four areas: controllers, datapaths, processors, and the design of asynchronous/synchronous interfaces.
A method for automating the synthesis of asynchronous control circuits from high level (CSP-like) and/or partial STG (involving only functionally critical events) specifications is presented. The method solves two key subtasks in this new, more flexible, design flow: handshake expansion, i.e. inserting reset events with maximum concurrency, and event reshuffling under interface and concurrency constraints, by means of concurrency reduction. In doing so, the algorithm optimizes the circuit both for size and performance. Experimental results show a significant increase in the solution space explored when compared to existing CSP-based or STG-based synthesis tools.
This paper describes a novel methodology for high performance asynchronous design based on timed circuits and on CAD support for their synthesis using Relative Timing. This methodology was developed for a prototype iA32 instruction length decoding and steering unit called RAPPID ("Revolving Asynchronous Pentium R Processor Instruction Decoder") that was fabricated and tested successfully. Silicon results show significant advantages - in particular, performance of 2.5-4.5 instructions per nS - with manageable risks using this design technology. RAPPID achieves three times faster performance and half the latency dissipating only half the power and requiring a minor area penalty as a comparable 400MHz clocked circuit. Relative Timing is based on user-defined and automatically extracted relative timing assumptions between signal transitions in a circuit and its environment. It supports the specification, synthesis, and verification of high-performance asynchronous circuits, such as pulse-mode circuits, that can be derived from an initial speed-independent specification. Relative timing presents a "middle-ground" between clocked and asynchronous circuits, and is a fertile area for CAD development. We discuss possible directions for future CAD development.
We present a novel approach that minimizes the power consumption of embedded core-based systems through hardware/ software partitioning. Our approach is based on the idea of mapping clusters of operations/instructions to a core that yields a high utilization rate of the involved resources (ALUs, multipliers, shifters, ...) and thus minimizing power consumption. Our approach is comprehensive since it takes into consideration the power consumption of a whole embedded system comprising a microprocessor core, application specific (ASIC) core(s), cache cores and a memory core. We report high reductions of power consumption between 35% and 94% at the cost of a relatively small additional hardware overhead of less than 16k cells while maintaining or even slightly increasing the performance compared to the initial design.
In this paper we present algorithms for the synthesis of encoding and decoding interface logic that minimizes the average number of transitions on heavily-loaded global bus lines. The approach automatically constructs low-transition activity codes and hardware implementation of encoders and decoders, given information on word-level statistics. We present an accurate method that is applicable to low-width buses, as well as approximate methods that scale well with bus width. Furthermore, we introduce an adaptive architecture that automatically adjusts encoding to reduce transition activity on buses whose word-level statistics are not known a-priori. Experimental results demonstrate that our approach well outperforms low-power encoding schemes presented in the past.
Power efficient design of real-time systems based on programmable processors becomes more important as system functionality is increasingly realized through software. This paper presents a power-efficient version of a widely used fixed priority scheduling method. The method yields a power reduction by exploiting slack times, both those inherent in the system schedule and those arising from variations of execution times. The proposed run-time mechanism is simple enough to be implemented in most kernels. Experimental results show that the proposed scheduling method obtains a significant power reduction across several kinds of applications.
In embedded system design, the designer has to choose an on-chip
memory configuration that is suitable for a specific
application. To aid in this design choice, we present a memory
exploration strategy based on three performance metrics,
namely, cache size, the number of processor cycles and the
energy consumption. We show how the performance is affected
by cache parameters such as cache size, line size, set
associativity and tiling, and the off-chip data organization. We
show the importance of including energy in the performance
metrics, since an increase in the cache line size, cache size, tiling
and set associativity reduces the number of cycles but does not
necessarily reduce the energy consumption. These performance
metrics help us find the minimum energy cache configuration if
time is the hard constraint, or the minimum time cache
configuration if energy is the hard constraint.
Keywords:
Design automation, Low power design, Memory hierarchy, Low
power embedded systems, Memory exploration and
optimization, Cache simulator, Off-chip data assignment.
Distributed computing has taken a new importance in order to
meet the requirements of users demanding information "anytime,
anywhere." Inferno facilitates the creation and support of
distributed services in the new and emerging world of network
environments. These environments include a world of varied
terminals, network hardware, and protocols. The Namespace is a
critical Inferno concept that enables the participants in this
network environment to deliver resources to meet the many
needs of diverse users.
This paper discusses the elements of the Namespace technology.
Its simple programming model and network transparency is
demonstrated through the design of an application that can have
components in several different nodes in a network. The
simplicity and flexibility of the solution is highlighted.
Keywords:
Inferno, InfernoSpaces, distributed applications, Styx,
networking protocols.
You read about it everywhere: distributed computing is the next revolution, perhaps relegating our desktop computers to the museum. But in fact the age of distributed computing has been around for quite a while. Every time we withdraw money from an ATM, start our car, use our cell phone, or microwave our dinner, microprocessors are at work performing dedicated functions. These are examples of just a very few of the thousands of "embedded systems." Until recently the vast majority of these embedded systems used 8- and 16-bit microprocessors, requiring little in the way of sophisticated software development tools, including an Operating System (OS). But the breaking of the $5 threshold for 32-bit processors is now driving an explosion in high-volume embedded applications. And a new trend towards integrating a full system-on-a-chip (SOC) promises a further dramatic expansion for 32-bit embedded applications as we head into the 21 st century.
This paper gives an overview of the JiniTM
architecture, which provides a federated infra-structure
for dynamic services in a network. Services may be large or
small.
1.1 Keywords:
Jini, Java, networks, distribution, distributed computing
The complexity of large-scale multiprocessors has burdened the design and verification process making complexity-effective functional verification an elusive goal. We propose a solution to the verification of complex systems by introducing an abstracted verification environment called Raven. We show how Raven uses standard C/C++ to extend the capability of contemporary discrete-event logic simulators. We introduce new data types and a diagnostic programming interface (DPI) that provide the basis for Raven. Finally, we show results from an interconnect router ASIC used in a large-scale multiprocessor.
The Advanced VLIW architecture of the Equator MAP1000 processor has many features that present significant verification challenges. We describe a functional verification methodology to address this complexity. In particular, we present an efficient method to generate directed assembly tests and a novel technique using the processor itself to control self-tests and check the results at speed using native instructions only. We also describe the use of emulation in both pre-silicon and post-silicon verification stages.
In this paper, we demonstrate a method for generation of assembler test programs that systematically probe the micro architecture of a PowerPC superscalar processor. We show innovations such as ways to make small models for large designs, predict, with cycle accuracy the movement of instructions through the pipes (taking into account stalls and dependencies) and generation of test programs such that each reaches a new micro architectural state. We compare our method to the established practice of massive random generation and show that the quality of our tests, as measured by transition coverage, is much higher. The main contribution of this paper is not in theory, as the theory has been discussed in previous papers, but in describing how to translate this theory into practice in a practical way, a task that was far from trivial.
In this paper, we describe a fast and convenient verification methodology for microprocessor using large-size, real application programs as test vectors. The verification environment is based on automatic consistency checking between the golden behavioral reference model and the target HDL model, which are run in an hand-shaking fashion. In conjunction with the automatic comparison facility, a new HDL saver is proposed to accelerate the verification process. The proposed saver allows 'restart' from the nearest checkpoint before the point of inconsistency detection regardless of whether any modification on the source code is made or not. It is to be contrasted with conventional saver that does not allow restart when some design change, or debugging is made. We have proved the effectiveness of the environment through applying it to a real-world example, i.e., Pentium-compatible processor design process. It was shown that the HDL verification with the proposed saver can be faster and more flexible than the hardware emulation approach. In short, it was demonstrated that restartability with source code modification capability is very important in obtaining the short debugging turnaround time by eliminating a large number of redundant simulations.
This paper addresses test generation for design verification of pipe-lined
microprocessors. To handle the complexity of these designs,
our algorithm integrates high-level treatment of the datapath with
low-level treatment of the controller, and employs a novel "pipe-frame"
organization that exploits high-level knowledge about the
operation of pipelines. We have implemented the proposed algorithm
and used it to generate verification tests for design errors in a
representative pipelined microprocessor.
Keywords: design verification, sequential test generation, high-level
test generation, pipelined microprocessors.
This paper describes the efforts made and the results of creating an Architecture Validation Suite for the PowerPC architecture. Although many functional test suites are available for multiple architectures, little has been published on how these suites are developed and how their quality should be measured. This work provides some insights for approaching the difficult problem of building a high quality functional test suite for a given architecture. By defining a set of generic coverage models that combine program-based, specification-based, and sequential bug-driven models, it establishes the groundwork for the development of architecture validation suites for any architecture.
This paper studies a projection technique based on block Krylov subspaces for the computation of reducedorder models of multiport RLC circuits. We show that these models are always passive, yet they still match at least half as many moments as the corresponding reduced-order models based on matrixPadé approximation. For RC, RL, and LC circuits, the reduced-order models obtained by projection and matrix-Padé approximation are identical. For general RLC circuits, we show how the projection technique can easily be incorporated into the SyMPVL algorithm to obtain passive reduced-order models, in addition to the high-accuracy matrix-Padé approximations that characterize SyMPVL, at essentially no extra computational costs. Connections between SyMPVL and the recently proposed reduced-order modeling algorithmPRIMA are also discussed. Numerical results for interconnect simulation problems are reported.
As interconnect feature sizes continue to scale to smaller dimensions, long interconnect can dominate the IC timing performance, but the interconnect parameter variations make it difficult to predict these dominant delay extremes. This paper presents a model order-reduction technique for RLC interconnect circuits that includes variational analysis to capture manufacturing variations. Matrix perturbation theory is combined with dominant-pole-analysis and Krylov-subspace-analysis methods to produce reduced-order models with direct inclusion of statistically independent manufacturing variations. The accuracy of the resulting variational reduced-order models is demonstrated on several industrial examples.
The problem of computing rational function approximations to tabulated frequency data is of paramount importance in the modeling arena. In this paper we present a method for generating a state space model from tabular data in the frequency domain that solves some of the numerical difficulties associated with the traditional fitting techniques used in linear least squares approximations. An extension to the MIMO case is also derived.
High-level synthesis operates on internal models known as control/data flow graphs (CDFG) and produces a register-transfer-level (RTL) model of the hardware implementation for a given schedule. For high-level synthesis to be efficient it has to estimate the effect that a given algorithmic decision (e.g., scheduling, allocation) will have on the final hardware implementation (after logic synthesis). Currently, this effect cannot be measured accurately because the CDFGs are very distinct from the RTL/gate-level models used by logic synthesis, precluding interaction between high-level and logic synthesis. This paper presents a solution to this problem consisting of a novel internal model for synthesis which spans the domains of high-level and logic synthesis. This model is an RTL/gate-level network capable of representing all possible schedules that a given behavior may assume. This representation allows high-level synthesis algorithms to be formulated as logic transformations and effectively interleaved with logic synthesis.
In this paper, we establish a theoretical framework for a new concept of scheduling called soft scheduling. In contrasts to the traditional schedulers referred as hard schedulers, soft schedulers make soft decisions at a time, or decisions that can be adjusted later. Soft scheduling has a potential to alleviate the phase coupling problem that has plagued traditional high level synthesis (HLS), HLS for deep submicron design and VLIW code generation. We then develop a specific soft scheduling formulation, called threaded schedule, under which a linear, optimal (in the sense of online optimality) algorithm is guaranteed.
Finding the minimum column multiplicity for a bound set of variables is an important problem in Curtis decomposition. To investigate this problem, we compared two graph-coloring programs: one exact, and another one based on heuristics which can give, however, provably exact results on some types of graphs. These programs were incorporated into the multi-valued decomposer MVGUD. We proved that the exact graph coloring is not necessary for high-quality functional decomposers. Thus we improved by orders of magnitude the speed of the column multiplicity problem, with very little or no sacrifice of decomposition quality. Comparison of our experimental results with competing decomposers shows that for nearly all benchmarks our solutions are best and time is usually not too high.
The application of retiming and clock skew scheduling for improving the operating speed of synchronous circuits under setup and hold constraints is investigated in this paper. It is shown that when both long and short paths are considered, circuits optimized by the simultaneous application of retiming and clock scheduling can achieve shorter clock periods than optimized circuits generated by applying either of the two techniques separately. A mixed-integer linear programming formulation and an efficient heuristic are given for the problem of simultaneous retiming and clock skew scheduling under setup and hold constraints. Experiments with benchmark circuits demonstrate the efficiency of this heuristic and the effectiveness of the combined optimization. All of the test circuits show improvement. For more than half of them, the maximum operating speed increases by more than 21% over the optimized circuits obtained by applying retiming or clock skew scheduling separately.
Retiming is an optimization technique for synchronous circuits introduced by Leiserson and Saxe in 1983. Although powerful, retiming is not very widely used because it does not handle in a satisfying way circuits whose registers have load enable, synchronous and asynchronous set/clear inputs. We propose an extension of retiming whose basis is the characterization of registers into register classes. The new approach called multiple-class retiming handles circuits with an arbitrary number of register classes. We present results on a set of industrial FPGA designs showing the effectiveness and efficiency of multiple-class retiming.
We present a novel approach to performance optimization by integrating retiming and resynthesis. The approach is oblivious of register boundaries during resynthesis. In addition, it guides resynthesis by a criterion that is directly tied to the performance target. The proposed approach obtains provable results. Experimental results further demonstrate the effectiveness of our approach.
Sequential logic optimization based on the extraction of computational kernels has proved to be very promising when the target is power minimization. Efficient extraction of the kernels is at the basis of the optimization paradigm; the extraction procedures proposed so far exploit common logic synthesis transformations, and thus assume the availability of a gate-level description of the circuit being optimized. In this paper we present exact and approximate algorithms for the automatic extraction of computational kernels directly from the functional specification of a RTL component. We show the effectiveness of such algorithms by reporting the results of an extensive experimentation we have carried out on a large set of standard benchmarks, as well as on some designs with known functionality.
It is generally believed that there will be little more variety in
CPU architectures, and thus the design of Instruction-set
Architectures (ISAs) will have no role in the future of embedded
CPU design. Nonetheless, it is argued in this paper that
architectural variety will soon again become an important topic,
with the major motivation being increased performance due to the
customization of CPUs to their intended use. Five major barriers
that could hinder customization are described, including the
problems of existing binaries, toolchain development and
maintenance costs, lost savings/higher chip cost due to the lower
volumes of customized processors, added hardware development
costs, and some factors related to the product development cycle
for embedded products. Each is discussed, along with potential,
sometimes surprising, solutions.
Keywords:
Embedded processors, custom processors, instruction-level
parallelism, VLIW, mass customization of toolchains
Operating systems and development tools can
impose overly general requirements that prevent
an embedded system from achieving its
hardware performance entitlement. It is time
for embedded processor designers to become
more involved with system software and tools.
Keywords: Digital signal processors, instruction set architecture, compiler,
real-time operating system, software configuration.
In this paper, we present a complete chip design method which incorporates a soft-macro resynthesis method in interaction with chip floorplanning for area and timing improvements. We develop a timing-driven design flow to exploit the interaction between HDL synthesis and physical design tasks. During each design iteration, we resynthesize soft macros with either a relaxed or a tightened timing constraint which is guided by the post-layout timing information. The goal is to produce area-efficient designs while satisfying the timing constraints. Experiments on a number of industrial designs have demonstrated that by effectively relaxing the timing constraint of the non-critical modules and tightening the timing constraint of the critical modules, a design can achieve 13% to 30% timing improvements with little to no increase in chip area.
We present an ordered tree, O-tree, structure to represent non-slicing floorplans. The O-tree uses only n(2 + [lg n]) bits for a floorplan of n rectangular blocks. We define an admissible placement as a compacted placement in both x and y direction. For each admissible placement, we can find an O-tree representation. We show that the number of possible O-tree combinations is O(n!22n - 2 / n1.5 ). This is very concise compared to a sequence pair representation which has O((n!)2) combinations. The approximate ratio of sequence pair and O-tree combinations is O(n2(n/4e)n). The complexity of O-tree is even smaller than a binary tree structure for slicing floorplan which has O(n! 25n -3/ n1.5) combinations. Given an O-tree, it takes only linear time to construct the placement and its constraint graph. We have developed a deterministic floorplanning algorithm utilizing the structure of O-tree. Empirical results on MCNC benchmarks show promising performance with average 16% improvement in wire length, and 1% less in dead space over previous CPU-intensive cluster refinement method.
This paper addresses the problem of device-level placement for analog layout. Different from most of the existent approaches employing basically simulated annealing optimization algorithms operating on at Gellat-Jepsen spatial representations [2], we are using a more recent topological representation called sequence-pair [7], which has the advantage of not being restricted to slicing floorplan topologies. In this paper, we are explaining how specific features essential to analog placement, as the ability to deal with symmetry and device matching constraints, can be easily handled by employing the sequence-pair representation. Several analog examples substantiate the effectiveness of our placement tool, which is already in use in an industrial environment.
Our problem consists of a partially ordered set of tasks communicating
over a shared bus which are to be mapped to a heterogeneous
multiprocessor system. The goal is to minimize the makespan,
while satisfying constrains implied by data dependencies and exclusive
resource usage.
We present a new efficient heuristic approach based on list
scheduling and genetic algorithms, which finds the optimum in few
seconds on average even for large examples (up to 96 tasks) taken
from [3]. The superiority of our algorithm compared to some other
algorithms is demonstrated.
Keywords: heterogeneous system design, heuristic, genetic algorithms,
list scheduling
This paper presents a new scheduling algorithm that maximizes the performance of a design under resource constraints in high-level synthesis. The algorithm tries to achieve the maximal utilization of resources and the minimal waste of clock slack time. Moreover, it exploits the technique of bit-level chaining to target high-speed designs. The algorithm tries non-integer multiple-cycling and chaining, which allows multiple cycle execution of chained operations, to further increase the performance at the cost of small increase in the complexity of the control unit. Experimental results on several datapath-intensive designs show significant improvement in execution time, over the conventional scheduling algorithms.
This paper presents a technique for highly constrained
event sequence scheduling. System
resource protocols as well as an external interface
protocol are described by non-deterministic
finite automata (NFA). All valid schedules
which adhere to interfacing constraints and
resource bounds for flow graph described
behavior are determined exactly. A model and
scheduling results are presented for an extensive
design example.
Keywords:
Interface protocols, protocol-constrained scheduling, automata.
Emerging design problems are prompting the use of code motion and speculative execution in high-level synthesis to shorten schedules and meet tight time-constraints. However, some code motions are not worth doing from a worst-case execution perspective. We propose a technique that selects the most promising code motions, thereby increasing the density of optimal solutions in the search space.
Although model checking is an exhaustive formal verification method, a bug can still escape detection if the erroneous behavior does not violate any verified property. We propose a coverage metric to estimate the "completeness" of a set of properties verified by model checking. A symbolic algorithm is presented to compute this metric for a subset of the CTL property specification language. It has the same order of computational complexity as a model checking algorithm. Our coverage estimator has been applied in the course of some real-world model checking projects. We uncovered several coverage holes including one that eventually led to the discovery of a bug that escaped the initial model checking effort.
Symbolic techniques have undergone major improvements in the last few years. Nevertheless they are still limited by the size of the involved BDDs, and extending their applicability to larger and real circuits is a key issue. Within this framework, we introduce "activity profiles" as a novel technique to characterize transition relations. In our methodology a learning phase is used to collect activity measures, related to time and space cost, for each BDD node of the transition relation. We use inexpensive reachability analysis as learning technique, and we operate within inner steps of image computations involving the transition relation and state sets. The above informations can be used for several purposes. In particular, we present an application of activity profiles in the field of reachability analysis itself. We propose transition relation subsetting and partial traversals of the state transition graph. We show that a sequence of partial traversals is able to complete a reachability analysis problem with smaller memory requirement and improved time performance.
Approximate reachability techniques trade off accuracy for the capacity to deal with bigger designs. Cho et al [4] proposed partitioning the set of state bits into mutually disjoint subsets and doing symbolic forward reachability on the individual subsets to obtain an over approximation of the reachable state set. Recently [7] this was improved upon by dividing the set of state bits into various subsets that could possibly overlap, and doing symbolic reachability over the overlapping subsets. In this paper, we further improve on this scheme by augmenting the set of state variables with auxiliary state variables. These auxiliary state variables are added to capture some important internal conditions in the combinational logic. Approximate symbolic forward reachability on overlapping subsets of this augmented set of state variables yields much tighter approximations than earlier methods.
In this paper, we study the application of propositional decision procedures in hardware verification. In particular, we apply bounded model checking, as introduced in [1], to equivalence and invariant checking. We present several optimizations that reduce the size of generated propositional formulas. In many instances, our SAT-based approach can significantly outperform BDD-based approaches. We observe that SAT-based techniques are particularly efficient in detecting errors in both combinational and sequential designs.
We present a framework for rapidly exploring the design space of low power application-specific programmable processors (ASPP), in particular media processors. We focus on a category of processors that are programmable yet optimized to reduce power consumption for a specific set of applications. The key components of the framework presented in this paper are a retargetable instruction level parallelism (ILP) compiler, processor simulators, a set of complete media applications written in a high level language and an architectural component selection algorithm. The fundamental idea behind the framework is that with the aid of a retargetable ILP compiler and simulators it is possible to arrange architectural parameters (e.g., the issue width, the size of cache memory units, the number of execution units, etc.) to meet low power design goals under area constraints.
Successful exploration of system-level design decisions is impossible without fast and accurate estimation of the impact on the system cost. In most multimedia applications, the dominant cost factor is related to the organization of the memory architecture. This paper presents a systematic approach which allows effective system-level exploration of memory organization design alternatives, based on accurate feedback by using our earlier developed tools. The effectiveness of this approach is illustrated on an industrial application. Applying our approach, a substantial part of the design search space has been explored in a very short time, resulting in a cost-efficient solution which meets all design constraints.
The realization of new MPEG-4 functionality, applicable to 3D graphics texture compression and image database access over the Internet, is demonstrated in a PC-based compression system. Applying our system-level design methodologies effectively removes all implementation bottlenecks. A first-of-a-kind ASIC, called Ozone, accelerates the Embedded Zero Tree based encoding and is capable of compressing 30 color CIF images per second.
A fully digital QAM16 burst receiver ASIC is presented. The BO4 receiver demodulates at 10 Mbit/s and uses an advanced signal processing architecture that performs per burst automatic equalization. It is a critical building block in a broadband access system for HFC networks. The chip was designed using a C++ based flow and is implemented as a 80 Kgate 0.7u CMOS standard cell design.
In this paper, we present a new multilevel k-way hypergraph partitioning algorithm that substantially outperforms the existing state-of-the-art K-PM/LR algorithm for multi-way partitioning. both for optimizing local as well as global objectives. Experiments on the ISPD98 benchmark suite show that the partitionings produced by our scheme are on the average 15% to 23% better than those produced by the K-PM/LR algorithm, both in terms of the hyperedge cut as well as the (Kmetric. Furthermore, our algorithm is significantly faster, requiring 4 to 5 times less time than that required by K-PM/LR.
We illustrate how technical contributions in the VLSI CAD partitioning literature can fail to provide one or more of: (i) reproducible results and descriptions, (ii) an enabling account of the key understanding or insight behind a given contribution, and (iii) experimental evidence that is not only contrasted with the state-of-the-art, but also meaningful in light of the driving application. Such failings can lead to reporting of spurious and misguided conclusions. For example, new ideas may appear promising in the context of a weak experimental testbed, but in reality do not advance the state of the art. The resulting inefficiencies can be detrimental to the entire research community. We draw on several models (chiefly from the metaheuristics community) [5] for experimental research and reporting in the area of heuristics for hard problems, and suggest that such practices can be adopted within the VLSI CAD community. Our focus is on hypergraph partitioning.
We empirically assess the implications of fixed terminals for hypergraph partitioning heuristics. Our experimental testbed incorporates a leading-edge multilevel hypergraph partitioner [14] [3] and IBM-internal circuits that have recently been released as part of the ISPD-98 Benchmark Suite [2, 1]. We find that the presence of fixed terminals can make a partitioning instance considerably easier (possibly to the point of being "trivial"): much less effort is needed to stably reach solution qualities that are near best-achievable. Toward development of partitioning heuristics specific to the fixed-terminals regime, we study the pass statistics of flat FM-based partitioning heuristics. Our data suggest that with more fixed terminals, the improvements in a pass are more likely to occur near the beginning of the pass. Restricting the length of passes - which degrades solution quality in the classic (free-hypergraph) context - is relatively safe for the fixed-terminals regime and considerably reduces run time of our FM-based heuristic implementations. We believe that the distinct nature of partitioning in the fixed-terminals regime has deep implications (i) for the design and use of partitioners in top-down placement, (ii) for the context in which VLSI hypergraph partitioning research is pursued, and (iii) for the development of new benchmark instances for the research community.
This paper presents two primary results relevant to physical design problems in CAD/VLSI through a case study of the linear placement problem. First a local search mechanism which incorporates a neighborhood operator based on constraint relaxation is proposed. The strategy exhibits many of the desirable features of analytical placement while retaining the flexibility and non-determinism of local search. The second and orthogonal contribution is in netlist clustering. We characterize local optima in the linear placement problem through a simple visualization tool { the displacement graph. This characterization reveals the relationship between clusters and local optima and motivates a dynamic clustering scheme designed specifically for escaping such local optima. Promising experimental results are reported.
Variants of delay-cost functions have been used in a class of technology mapping algorithms [1, 2, 3, 4]. We illustrate that in an industrial environment the delay-cost function can grow unboundedly and lead to very large run-times. The key contribution of this work is a novel bounded compression algorithm. We introduce a concept of alpha delay-cost curve, (alpha-DC-curve) that requires up to exponentially less delay-cost points to be stored compared to that stored by the delay function. We prove that the solution obtained by this exponential compaction of the delay-function is bounded to alpha% of the optimal solution. We also suggest a large set of CAD applications which may benefit from using alpha-DC-curve. Finally, we demonstrate the effectiveness of our compaction scheme on one such application, namely technology mapping for low power. Experimental results on industrial environment show that we are more than 17 times faster than [2] on certain MCNC circuit.
In this paper we study the technology mapping problem for FPGAs with nonuniform pin delays and fast interconnects. We develop the PinMap algorithm to compute the delay optimal mapping solution for FPGAs with nonuniform pin delays in polynomial time based on the efficient cut enumeration. Compared with FlowMap [5] without considering the nonuniform pin delays, PinMap is able to reduce the circuit delay by 15% without any area penalty. For mapping with fast interconnects, we present two algorithms, an iterative refinement based algorithm, named ChainMap, and a Boolean matching based algorithm, named HeteroBM, which combines the Boolean matching techniques proposed in [2] and [3] and the heterogeneous technology mapping mechanism presented in [1]. It is shown that both ChainMap and HeteroBM are able to significantly reduce the circuit delay by making efficient use of the FPGA fast interconnects resources.
High performance circuit techniques such as domino logic have migrated from the microprocessor world into more mainstream ASIC designs. The problem is that domino logic comes at a heavy cost in terms of total power dissipation. For mobile and portable devices such as laptops and cellular phones, a high power dissipation is an unacceptable price to pay for high performance. Hence, we study synthesis techniques that allow designers to take advantage of the speed of domino circuits while at the same time to minimize total power consumption. Specifically, in this paper we present three results related to automated phase assignment for the synthesis of low power domino circuits: (1) We demonstrate that the choice of phase assignment at the primary outputs of a circuit can significantly impact power dissipation in the domino block (2) We propose a method for efficiently estimating power dissipation in a domino circuit and (3) We apply the method to determine a phase assignment that minimizes power consumption in the final circuit implementation. Preliminary experimental results on a mixture of public domain benchmarks and real industry circuits show potential power savings as high as 34% over the minimum area realization of the logic. Furthermore, the low power synthesized circuits still meet timing constraints.
We introduce SImulation Verification with Augmentation
(SIVA), a tool for checking safety properties on digital hardware
designs. SIVA integrates simulation with symbolic techniques
for vector generation. Specifically, the core algorithm
uses a combination of ATPG and BDDs to generate input
vectors which cover behavior not excited by simulation. Experimental
results demonstrate considerable improvement in
state space coverage compared with either simulation or formal
verification in isolation.
Keywords: Formal verification, ATPG, simulation, BDDs,
coverage.
Symbolic methods are often considered the state-of-the-art technique for validating digital circuits. Due to their complexity and unpredictable run-time behavior, however, their potential is currently limited to small-to-medium circuits. Logic simulation privileges capacity, it is nicely scalable, flexible, and it has a predictable run-time behavior. For this reason, it is the common choice for validating large circuits. Simulation, however, typically visits only a small fraction of the state space: The discovery of bugs heavily relies on the expertise of the designer of the test stimuli. In this paper we consider a symbolic simulation approach to the validation problem. Our objective is to trade-off between formal and numerical methods in order to simulate a circuit with a "very large number" of input combinations and sequences in parallel. We demonstrate larger capacity with respect to symbolic techniques and better efficiency with respect to cycle-based simulation. We show that it is possible to symbolically simulate very large trace sets in parallel (over 100 symbolic inputs) for the largest ISCAS benchmark circuits, using 96 Mbytes of memory.
We study the applicability of the logic of Positive Equality with Uninterpreted Functions (PEUF) [2][3] to the verification of pipelined microprocessors with very large Instruction Set Architectures (ISAs). Abstraction of memory arrays and functional units is employed, while the control logic of the processors is kept intact from the original gate-level designs. PEUF is an extension of the logic of Equality with Uninterpreted Functions, introduced by Burch and Dill [4], that allows us to use distinct constants for the data operands and instruction addresses needed in the symbolic expression for the correctness criterion. We present several techniques that make PEUF scale very efficiently for the verification of pipelined microprocessors with large ISAs. These techniques are based on allowing a limited form of non-consistency in the uninterpreted functions, representing initial memory state and ALU behaviors. Our tool required less than 30 seconds of CPU time and 5 MB of memory to verify a 5-stage MIPS-like pipelined processor that implements 191 instructions of various classes. The verification was done by correspondence checking - a formal method, where a pipelined microprocessor is compared against a non-pipelined specification.
We describe the use of parametric representations of Boolean predicates to encode data-space constraints and significantly extend the capacity of formal verification. The constraints are used to decompose verifications by sets of case splits and to restrict verifications by validity conditions. Our technique is applicable to any symbolic simulator. We illustrate our technique on state-of-the-art Intel (R) designs, without removing latches or modifying the circuits in any way.
Vertical benchmarks are complex system designs represented at multiple levels of abstraction. More effective than component-based CAD benchmarks, vertical benchmarks enable quantitative comparison of CAD techniques within or across design flows. This work describes the notion of vertical benchmarks and presents our benchmark, which is based on a commercial DSP, by comparing two alternative design flows.
Much effort in hardware/software co-design has been devoted to developing "push-button" types of tools for automatic hardware/software partitioning. However, given the highly complex nature of embedded system design, user guided design exploration can be more effective. In this paper , we propose a framework for designer assisted partitioning that can be used in conjunction with any given search strategy. A key component of this framework is the visualization of the design space, without enumerating all possible design configurations. Furthermore, this design space representation provides a straightforward way for a designer to identify promising partitions and hence guide the subsequent exploration process. Experiments have shown the effectiveness of this approach.
This paper describes a new design flow that significantly
reduces time-to-market for highly
complex multiprocessor-based System-On-Chip
designs. This flow, called Fast Prototyping,
enables concurrent hardware and software
development, early verification and productive
re-use of intellectual property. We describe how
using this innovative system design flow, that
combines different technologies, such as C
modeling, emulation, hard Virtual Component
re-use and CoWare N2CTM, we achieve better
productivity on a multi-processor SOC design.
1.1 Keywords:
System design, Hardware/Software (HW/SW) co-design,
Virtual Component (VC) re-use, Fast Prototyping, system
verification, system modeling.
Verification is one of the most critical and time-consuming tasks in today's design processes. This paper demonstrates the verification process of a 8.8 million gate design using HW-simulation and cycle simulation-based HW/SW-coverification. The main focuses are overall methodology, testbench management, the verification task itself and defect management. The chosen verification process was a real success: the quality of the designed hard- and software was increased and furthermore the time needed for integration and test of the design in the context of the overall system was greatly reduced.
Dual threshold technique has been proposed to reduce leakage power in low voltage and low power circuits by applying a high threshold voltage to some transistors in non-critical paths, while a low-threshold is used in critical path(s) to maintain the performance. Mixed-Vth (MVT) static CMOS design technique allows different thresholds within a logic gate, thereby increasing the number of high threshold transistors compared to the gate-level dual threshold technique. In this paper, a methodology for MVT CMOS circuit design is presented. Different MVT CMOS circuit schemes are considered and three algorithms are proposed for the transistor-level threshold assignment under performance constraints. Results indicate that MVT CMOS design technique can provide about 20% more leakage reduction compared to the corresponding gate-level dual threshold technique.
We present a new approach for estimation and optimization
of the average stand-by power dissipation in
large MOS digital circuits. To overcome the complexity
of state dependence in average leakage estimation,
we introduce the concept of "dominant leakage states"
and use state probabilities. Our method achieves
speed-ups of 3 to 4 orders of magnitude over exhaustive
SPICE simulations while maintaining accuracies
within 9% of SPICE. This accurate estimation is used
in a new sensitivity-based leakage and performance
optimization approach for circuits using dual Vtprocesses.
In tests on a variety of industrial circuits, this
approach was able to obtain 81-100% of the performance
achievable with all low V t transistors, but with
1/3 to 1/6 the stand-by current.
Keywords:
Low-power-design, Dual-Vt, Leakage
The state dependence of leakage can be exploited to obtain modest leakage savings in CMOS circuits. However, one can modify circuits considering state dependence and achieve larger savings. We identify a low leakage state and insert leakage control transistors only where needed. Leakage levels are on the order of 35% to 90% lower than those obtained by state dependence alone.
We propose a method for power optimization that considers glitch reduction by gate sizing based on the statistical estimation of glitch transitions. Our method reduces not only the amount of capacitive and short-circuit power consumption but also the power dissipated by glitches which has not been exploited previously. The effect of our method is verified experimentally using 8 benchmark circuits with a 0.6 um standard cell library. Our method reduces the power dissipation from the minimum-sized circuits further by 9.8% on average and 23.0% maximum. We also verify that our method is effective under manufacturing variation.
This paper describes a method of optimally sizing digital circuits on a static-timing basis. All paths through the logic are considered simultaneously and no input patterns need be specified by the user. The method is unique in that it is based on gradient-based, nonlinear optimization and can accommodate transistor-level schematics without the need for pre-characterization. It employs efficient time-domain simulation and gradient computation for each channel-connected component. A large-scale, general-purpose, nonlinear optimization package is used to solve the tuning problem. A prototype tuner has been developed that accommodates combinational circuits consisting of parameterized library cells. Numerical results are presented.
Partitioning and clustering are crucial steps in circuit layout for handling large scale designs enabled by the deep submicron technologies. Retiming is an important sequential logic optimization technique for reducing the clock period by optimally repositioning ip ops [7]. In our exploration of a logical and physical co-design flow, we developed a highly efficient algorithm on combining retiming with circuit partitioning or clustering for clock period minimization. Compared with the recent result by Pan et al. [10] on quasi-optimal clustering with retiming, our algorithm is able to reduce both runtime and memory requirement by one order of magnitude without losing quality. Our results show that our algorithm can be over 1000X faster for large designs.
In this paper we present a new synthesis and layout approach that avoids the normal iterations between synthesis, technology mapping and layout, and increases routing by abutment. It produces shorter and more predictable delays, and sometimes even layouts with reduced areas. This scheme equalizes delays along different paths, which makes low granularity pipelining a reality, and hence we can clock these circuits at much higher frequencies, compared to what is possible in a conventionally designed circuit. Since any circuit can be clocked at a fixed rate, this method does not require timing-driven synthesis. We propose the logic and layout synthesis schemes and algorithms, discuss the physical layout part of the process, and support our methodology with simulation results.
This paper presents a solution to the problem of performance-driven buffered routing tree generation in electronic circuits. Using a novel bottom-up construction algorithm and a local neighborhood search strategy, this method finds the best solution of the problem in an exponential size solution sub-space in polynomial time. The output is a hierarchical buffered rectilinear Steiner routing tree that connects the driver of a net to its sink nodes. The two variants of the problem, i.e. maximizing the driver required time subject to a total buffer area constraint and minimizing the total buffer area subject to a minimum driver required time constraint, are handled by propagating three-dimensional solution curves during the construction phase. Experimental results prove the effectiveness of this technique compared to the other solutions for this problem.
Buffer insertion has become a critical step in deep submicron design, and several buffer insertion/sizing algorithms have been proposed in the literature. However, most of these methods use simplified interconnect and gate delay models. These models may lead to inferior solutions since the optimized objective is only an approximation for the actual delay. We propose to integrate accurate wire and gate delay models into Van Ginneken's buffer insertion algorithm [18] via the propagation of moments and driving point admittances up the routing tree. We have verified the effectiveness of our approach on an industry design.
As the CMOS technology enters the deep submicron design era, the lateral inter-wire coupling capacitance becomes the dominant part of load capacitance and makes RC delay on the bus structures very data-dependent. Reducing the cross-coupling capacitance is crucial for achieving high-speed as well as lower power operation. In this paper, we propose two interconnect layout design methodologies for minimizing the coupling effect" in the design of full-custom datapath. Firstly, we describe the control signal ordering scheme which was shown to minimize the switching power consumption by 10% and wire delay by 15% for a given set of benchmark examples. Secondly, a track assignment algorithm based on evolutionary programming was used to minimize the cross-coupling capacitance. Experimental results have shown that the chip performance improvement as much as 40% can be obtained using the proposed interconnect schemes in various stages of the datapath layout optimization.
We propose a new VLSI layout methodology which addresses the main problems faced in Deep Sub-Micron (DSM) integrated circuit design. Our layout "fabric" scheme eliminates the conventional notion of power and ground routing on the integrated circuit die. Instead, power and ground are essentially "pre-routed" all over the die. By a clever arrangement of power/ground and signal pins, we almost completely eliminate the capacitive effects between signal wires. Additionally, we get a power and ground distribution network with a very low resistance at any point on the die. Another advantage of our scheme is that the arrangement of conductors ensures that on-chip inductances are uniformly negligible. Finally, characterization of the circuit delays, capacitances and resistances becomes extremely simple in our scheme, and needs to be done only once for a design. We show how the uniform parasitics of our fabric give rise to a reliable and predictable design. We have implemented our scheme using public domain layout software. Preliminary results show that it holds much promise as the layout methodology of choice in DSM integrated circuit design.
In this paper, we introduce a simple procedure to predict wiring
delay in bi-directional buses and a way of properly sizing the
driver for each of its port. In addition, we propose a simple
calibration procedure to improve its delay prediction over the
Elmore delay of the RC tree. The technique is fast, accurate, and
ideal for implementation in floorplanner during behavioral
synthesis.
Keywords:
RC wiring delay, High-Level Synthesis, Floorplanning, Buffer
Optimization, Interconnect optimization.
Recently, several algorithms for interconnect optimization via repeater insertion and wire sizing have appeared based on the Elmore delay model. Using the Devgan noise metric [6] a noise-aware repeater insertion technique has also been proposed recently. Recognizing the conservatism of these delay and noise models, we propose a moment-matching based technique to interconnect optimization that allows for much higher accuracy while preserving the hierarchical nature of Elmore-delay-based techniques. We also present a novel approach to noise computation that accurately captures the effect of several attackers in linear time with respect to the number of attackers and wire segments. Our practical experiments with industrial nets indicate that the corresponding reduction in error afforded by these more accurate models justifies this increase in runtime for aggressive designs which is our targeted domain. Our algorithm yields delay and noise estimates within 5% of circuit simulation results.
This paper reports two sets of important results in our exploration of an interconnect-centric design flow for deep submicron (DSM) designs: (i) We obtain efficient yet accurate wiring area estimation models for optimal wire sizing (OWS). We also propose a simple metric to guide area-efficient performance optimization; (ii) Guided by our interconnect estimation models, we study the interconnect architecture planning problem for wire-width designs. We achieve a rather surprising result which suggests that two pre-determined wire widths per metal layer are sufficient to achieve near-optimal performance. This result will greatly simplify the routing architecture and tools for DSM designs. We believe that our interconnect estimation and planning results will have a significant impact on DSM designs.
We propose a new specification environment for system-level design called ECL. It combines the Esterel and C languages to provide a more versatile means for specifying heterogeneous designs. It can be viewed as the addition to C of explicit constructs from Esterel for waiting, concurrency and pre-emption, and thus makes these operations easier to specify and more apparent. An ECL specification is compiled into a reactive part (an extended finite state machine representing most of the ECL program), and a pure data looping part, thus nicely supporting a mix of control and data. The reactive part can be robustly estimated and synthesized to hardware or software, while the data looping part is implemented in software as specified.
Many embedded systems are implemented with a set of alternative function variants to adapt the system to different applications or environments. This paper proposes a novel approach for the coherent representation and selection of function variants in the different phases of the design process. In this context, the modeling of re-configuration of system parts is supported in a natural way. Using a real example from the video processing domain, the approach is explained and validated.
The increasing size and complexity of designs is making the use of hardware description languages (HDLs), such as Verilog and VHDL, more prevalent. They are able to describe both the initial design and intermediate representations of the design as it is readied for fabrication. For large designs, there inevitably are problems with the tool flow that require custom tools to be created. These tools must be able to access and modify the HDL for the design, requirements that often dwarf the tools’ actual functionality, making them difficult to create without a large effort or cutting corners. During the FLASH project at Stanford we created Vex -- a toolbox of components for dealing with Verilog, tied together with an interactive scripting language -- that simplifies the creation of these tools. It was used to create a number of tools that were critical to our design's tape-out and has also been useful in creating design exploration and research tools.
Today's complex design processes feature large numbers of varied, interdependent constraints, which often cross interdisciplinary boundaries. Therefore, a computer-supported constraint management methodology that automatically detects violations early in the design process, provides useful violation notification to guide redesign efforts, and can be integrated with conventional CAD software can be a great aid to the designer. We present such a methodology and describe its implementation in the Minerva II design process manager, along with an example design session.
The many levels of metal used in aggressive deep submicron process technologies has made fast and accurate capacitance extraction of complicated 3-D geometries of conductors essential, and many novel approaches have been recently developed. In this paper we present an accelerated boundary-element method, like the well-known FASTCAP program, but instead of using an adaptive fast multipole algorithm we use a numerically generated multiscale basis for constructing a sparse representation of the dense boundary-element matrix. Results are presented to demonstrate that the multiscale method can be applied to complicated geometries, generates a sparser boundary-element matrix than the adaptive fast multipole method, and provides an inexpensive but effective preconditioner. Examples are used to show that the better sparsification and the effective preconditioner yield a method that can be 25 times faster than FASTCAP while still maintain accuracy in the smallest coupling capacitances.
Circuit parasitic extraction problems are typically formulated using discretized integral equations that use basis functions defined over tesselated surface meshes. The Fast Multipole Method (FMM) accelerates the solution process by rapidly evaluating potentials and fields due to these basis functions. Unfortunately, the FMM suffers from the drawback that its efficiency degrades if the surface mesh has disparately-sized elements in close proximity to each other. Closely-spaced non-uniformly sized elements can appear in realistic situations for a variety of reasons: owing to mesh refinement, due to accurate modeling requirements for fine structural features, and because of the presence of thin doubly-walled structures. In this paper, modifications to the standard multilevel FMM are presented that permit efficient potential and field evaluation over specific non-uniform meshes. The efficiency of the new technique is demonstrated through examples involving large surface meshes with non-uniformly sized elements in close proximity.
Due to interactions through the common silicon substrate, the layout and placement of devices and substrate contacts can have significant impacts on a circuit's ESD (Electrostatic Discharge) and latchup behavior in CMOS technologies. Proper substrate modeling is thus required for circuit-level simulation to predict the circuit's ESD performance and latchup immunity. In this work we propose a new substrate resistance network model, and develop a novel substrate resistance extraction method that accurately calculates the distribution of injection current into the substrate during ESD or latchup events. With the proposed substrate model and resistance extraction, we can capture the three-dimensional layout parasitics in the circuit as well as the vertical substrate doping profile, and simulate these effects on circuit behavior at the circuit-level accurately. The usefulness of this work for layout optimization is demonstrated with an industrial circuit example.
This paper introduces a continuous-time, controllable Markov process model of a power-managed system. The system model is composed of the corresponding stochastic models of the service queue and the service provider. The system environment is modeled by a stochastic service request process. The problem of dynamic power management in such a system is formulated as policy optimization problem and solved using an efficient "policy iteration" algorithm. Compared to previous work on dynamic power management, our formulation allows better modeling of the various system components, the power-managed system as whole, and its environment. In addition it captures dependencies between the service queue and service provider status. Finally, the resulting power management policy is asynchronous, hence it is more power-efficient and more useful in practice. Experimental results demonstrate the effectiveness of our policy optimization algorithm compared to a number of heuristic (time-out and N-policy) algorithms.
In this work we propose a technique for spatial and temporal partitioning of a logic circuit based on the nodes activity computed by using a simulation at an higher level of abstraction. Only those components that are activated by a given input vector are added to the detailed simulation netlist. The methodology is suitable for parallel implementation on a multi-processor environment and allows to arbitrarily switch between fast and detailed levels of abstraction during the simulation run. The experimental results obtained on a significant set of benchmarks show that it is possible to obtain a considerable reduction in both CPU time and memory occupation together with a considerable degree of accuracy. Furthermore the proposed technique easily fits in the existing industrial design flows.
Many modern multimedia applications such as image and video processing
are characterized by a unique combination of arithmetic and
computational features: fixed-point arithmetic, a variety of short data
types, high degree of instruction-level parallelism, strict
timing constraints, and high computational requirements. Computationally
intensive algorithms usually boost device's power dissipation which is
often key to the efficiency of many communications and multimedia
applications. Although recently virtually all general-purpose processors
have been equipped with multiprecision operations, the current generation
of behavioral synthesis tools for application-specific systems does not
utilize this power/performance optimization paradigm.
In this paper, we explore the potential of using multiple precision
arithmetic units to effectively support synthesis of low-power
application-specific integrated circuits. We propose a new architectural
scheme for collaborate addition of sets of variable precision data.
We have developed a novel resource allocation and computation assignment
methodology for a set of multiple precision arithmetic units. The
optimization algorithms explore the trade-off allocating low-width
bus structures and executing multiple-cycle operations. Experimental
results indicate strong advantages of the proposed approach.
This paper summarizes the verification effort of a
complex ASIC designated to be an "all in one"
ISDN network router. This ASIC is unique because
it actually consists of many independent components,
called "cores" (including the processor). The
integration of these components onto one chip
results in an ISOC (Integrated System On a Chip).
The complexity of verifying an ISOC is virtually
impossible without a proper methodology. This
paper presents the methodology developed for verifying
the router. In particular, the verification
method as well as the tools that were built to execute
this method are presented. Finally, a summary of
the verification results is given.
1.1 Keywords:
Systems on chip,verification, test and debugging.
This paper presents a synthesis tool ICEBERG for embedded in-circuit emulators (ICE's), that are part of the development environment for microcontroller (or microprocessor)-based systems (PIPER-II). the tool inserts and integrates the necessary in-circuit emulation circuitry into a given RTL core of a microcontroller, and thus turning the core into an embedded ICE. The ICE, based on the IEEE 1149.I JTAG architecture, provides standard debugging mechanisms, including boundary scan paths, partial scan paths, single stepping, internal resource monitoring and modification, breakpoint detection, and mode switching between debugging and free running modes. ICEBERG has been successfully applied to synthesize the embedded ICE for an industrial microcontroller HT48100 from its RTL core.
The purpose of this paper is to develop a flexible design for test methodology for testing a core-based system on chip (SOC). The novel feature of the approach is the use an embedded microprocessor/memory pair to test the remaining components of the SOC. Test data is downloaded using DMA techniques directly into memory while the microprocessor uses the test data to test the core. The test results are tranferred to a MISR for evaluation. The approach has several important advantages over conventional ATPG such as achieving at-speed testing, not limiting the chip speed to the tester speed during test and achieving great flexibility since most of the testing process is based on software. Experimental results on an example system are discussed.
This paper explores a standard-cell design methodology based on netlist partitioning as a solution for the problem of lack of convergence in the conventional methodology in deep submicron technologies. A synthesized design block is partitioned along unpredictable nets that are identified from the netlist structure. the size of each partition is restricted so that the longest possible local net in a partition can be sufficiently driven by an average library gate, hence allowing statistical wire-load modeling for the local nets. The block is resynthesized using a hybrid wire-load model that takes into account accurate wire-load information on the unpredictable nets derived after floorplanning the partitions, and uses custom statistical wire-load models within each partition. Final placement is restricted to respect the initial floorplan. The methodology was implemented using existing commercial tools for synthesis and layout. Experimental results show high correlation between synthesis estimates and post-placement measurements of wire-loads and gate delays with the new methodology. The trade-offs of partitioning, current limitations of the methodology and future work to overcome these limitations are also discussed.
This paper describes the experience and the lessons learned during the design of an ATM traffic shaper circuit using behavioral synthesis. The experiment is based on the comparison of the results of two parallel design flows starting from the same specification. The first used a classical design method based on RTL synthesis. The second design flow is based on behavioral synthesis. The experiment has shown that behavioral synthesis is able to produce efficient design in terms of gate count and timing while bringing a threefold reduction in design effort when compared to RTL design methodology.
Due to the unavoidable need for system debugging, performance tuning, and adaptation to new standards, the engineering change (EC) methodology has emerged as one of the crucial components in synthesis of systems-on-chip. We introduce a novel design methodology which facilitates design-for-EC and post-processing to enable EC with minimal perturbation. Initially, as a synthesis pre-processing step, the original design specification is augmented with additional design constraints which ensure flexibility for future correction. Upon alteration of the initial design, a novel post-processing technique achives the desired functionality with a near-minimal perturbation of the initially optimized design. The key contribution we introduce is a constraint manipulation technique which enables reduction of an arbitrary EC problem into its corresponding classical synthesis problem. As a result, in both pre- and post-processing for EC, classical synthesis algorithms can be used to enable flexibility and perform the correction process. We demonstrate the developed EC methodology on a set of behavioral and system synthesis tasks.
Reconfigurable Computing is emerging as an important new organizational structure for implementing computations. It combines the post-fabrication programmability of processors with the spatial computational style most commonly employed in hardware designs. The result changes traditional "hardware" and "software" boundaries, providing an opportunity for greater computational capacity and density within a programmable media. Reconfigurable Computing must leverage traditional CAD technology for building spatial designs. Beyond that, however, reprogrammablility introduces new challenges and opportunities for automation, including binding-time and specialization optimizations, regularity extraction and exploitation, and temporal partitioning and scheduling.
We present an automated temporal partitioning and loop transformation approach for developing dynamically reconfigurable designs starting from behavior level specifications. An Integer Linear Programming (ILP) model is formulated to achieve near-optimal latency designs. We, also present a loop restructuring method to achieve maximum throughput for a class of DSP applications. This restructuring transformation is performed on the temporally partitioned behavior and results in near-optimization of throughput. We discuss efficient memory mapping and address generation techniques for the synthesis of reconfigurable designs. A Case study on the Joint Photographic Experts Group (JPEG) image compression algorithm demonstrates the effectiveness of our approach.
This work presents an overview of the principles that underlie the
speed-up achievable by dynamic hardware reconfiguration,
proposes a more precise taxonomy for the execution models for
reconfigurable platforms, and demonstrates the advantage of
dynamic reconfiguration in the new implementation of a
neighborhood image processor, called DRIP. It achieves a real-time
performance, which is 3 times faster than its pipelined non-reconfigurable
version.
Keywords:
Reconfigurable architecture, image processing, FPGA
We present a novel formulation, called the WaMPDE, for solving systems with forced autonomous components. An important feature of the WaMPDE is its ability to capture frequency modulation (FM) in a natural and compact manner. This is made possible by a key new concept: that of warped time, related to normal time through separate time scales. Using warped time, we obtain a completely general formulation that captures complex dynamics in autonomous nonlinear systems of arbitrary size or complexity. We present computationally efficient numerical methods for solving large practical problems using the WaMPDE. Our approach explicitly calculates a time-varying local frequency that matches intuitive expectations. Applied to VCOs, WaMPDE-based simulation results in speedups of two orders of magnitude over transient simulation.
Design of communications circuits often requires computing steady-state responses to multiple periodic inputs of differing frequencies. Mixed frequency-time (MFT) approaches are orders of magnitude more efficient than transient circuit simulation, and perform better on highly nonlinear problems than traditional algorithms such as harmonic balance. We present algorithms for solving the huge nonlinear equation systems the MFT approach generates from practical circuits.
Matrix-implicit Krylov-subspace methods have made it possible to efficiently compute the periodic steady-state of large circuits using either the time-domain shooting-Newton method or the frequency-domain harmonic balance method. However, the harmonic balance methods are not so efficient at computing steady-state solutions with rapid transitions, and the low-order integration methods typically used with shooting-Newton methods are not so efficient when high accuracy is required. In this paper we describe a Time-Mapped Harmonic Balance method (TMHB), a fast Krylov-subspace spectral method that overcomes the inefficiency of standard harmonic balance in the case of rapid transitions. TMHB features a non-uniform grid to resolve the sharp features in the signals. Results on several examples demonstrate that the TMHB method achieves several orders of magnitude improvement in accuracy compared to the standard harmonic balance method. The TMHB method is also several times faster than the standard harmonic balance method in reaching identical solution accuracy.
As the sizes of general and special purpose processors increase rapidly, generating high quality manufacturing tests which can be run at native speeds is becoming a serious problem. One solution is a novel method for functional test generation in which a transformed module is built manually, and which embodies functional constraints described using virtual logic. Test generation is then performed on the transformed module using commercial tools and the transformed module patterns are translated back to the processor level. However, the technique is useful only if the virtual logic can be generated automatically. This paper describes an automatic functional constraint extraction algorithm and a procedure to build the transformed module. We describe the tool, FALCON, used to extract the functional constraints of a given embedded module from a Verilog RTL model. The constraint extraction for embedded modules of benchmark processors using FALCON takes only a few seconds. We show that this method can generate functional patterns in a time several orders of magnitude less than one using a conventional, at view of the circuit.
We describe a property based test generation procedure that uses static compaction to generate test sequences that achieve high fault coverages at a low computational complexity. A class of test compaction procedures are proposed and used in the property based test generator. Experimental results indicate that these compaction procedures can be used to implement the proposed test generator to achieve high fault coverage with relatively smaller run times.
In this paper, we present multiple error diagnosis algorithms to overcome two significant problems associated with current error diagnosis techniques targeting large circuits: their use of limited error models and a lack of solutions that scale well for multiple errors. Our solution is based on a non-enumerative analysis technique, based on logic simulation (3-valued and symbolic), for simultaneously analyzing all possible errors at sets of nodes in the circuit. Error models are introduced in order to address the "locality" aspect of error location and to identify sets of nodes that are "local" with respect to each other. Theoretical results are provided to guarantee the diagnosis of modeled errors and robust diagnosis approaches are shown to address the cases when errors do not correspond to the modeled types. Experimental results on benchmark circuits demonstrate accurate and extremely rapid location of errors of large multiplicity.
Validation of RTL circuits remains the primary bottleneck in improving design turnaround time, and simulation remains the primary methodology for validation. Simulation-based validation has suffered from a disconnect between the metrics used to measure the error coverage of a set of simulation vectors, and the vector generation process. This disconnect has resulted in the simulation of virtually endless streams of vectors which achieve enhanced error coverage only infrequently. Another drawback has been that most error coverage metrics proposed have either been too simplistic or too inefficient to compute. Recently, an effective observability-based statement coverage metric was proposed along with a fast companion procedure for evaluating it. The contribution of our work is the development of a vector generation procedure targeting the observability-based statement coverage metric. Our method uses repeated coverage computation to minimize the number of vectors generated. For vector generation, we propose a novel technique to set up constraints based on the chosen coverage metric. Once the system of interacting arithmetic and Boolean constraints has been set up, it can be solved using hybrid linear programming and Boolean satisfiability methods. We present heuristics to control the size of the constraint system that needs to be solved. We present experimental results which show the viability of automatically generating vectors using our approach for industrial RTL circuits. We envision our system being used during the design process, as well as during post-design debugging.
This paper describes a two-state methodology for register
transfer level (RTL) logic simulation in which the use of the X-state
is completely eliminated inside ASIC designs. Examples
are presented to show the gross pessimism and optimism that
occurs with the X in RTL simulation. Random two-state
initialization is offered as a way to detect and diagnose startup
problems in RTL simulation. Random two-state initialization (a)
is more productive than the X-state in gate-level simulation, and
(b) provides better coverage of startup problems than X-state in
RTL simulation. Consistent random initialization is applied (a)
as a way to duplicate a startup state using a slower diagnosis-oriented
simulator after a faster detection-oriented simulator
reports the problem, and (b) to verify that the problem is
corrected for that startup state after the design change intended to
fix the problem. In addition to combining the earlier ideas of
two-state simulation, and random initialization with consistent
values across simulations, an original technique for treatment of
tri-state Z's arriving into a two-state model is introduced.
Keywords:
RTL, simulation, 2-state, X-state, pessimism, optimism, random,
initialization.
This paper presents a new approach for extracting timing information defined in a simulation vector set on register transfer level (RTL) and reusing them in the behavioral specification. Using a VHDL RTL simulation vector set and a VHDL behavioral specification as entry, the timing information are extracted and as well as the specification transformed in a Partial Order based Model (POM). The POM expressing the timing information is then mapped on the specification POM. The result contains the behavioral specification and the RTL timing and is retransformed in a corresponding VHDL specification. Additionally, timing information contained in the specification can be checked using the RTL simulation vectors.
Satisfiability (SAT) is a computationally expensive algorithm central to many CAD and test applications. In this paper, we present the architecture of a new SAT solver using reconfigurable logic. Our main contributions include new forms of massive fine-grain parallelism and structured design techniques based on iterative logic arrays that reduce compilation times from hours to a few minutes. Our architecture is easily scalable. Our results show several orders of magnitude speed-up compared with a state-of-the-art software implementation, and with a prior SAT solver using reconfigurable hardware.
In this paper, we introduce a new approach for locating and diagnosing faults in combinational circuits. The approach is based on automatically designing a circuit which implements a closest-match fault location algorithm specialized for the combinational circuit under diagnosis (CUD). This approach eliminates the need for large storage required by a software based fault diagnosis. In this paper, we show the approach's feasibility in terms of hardware resources, speed, and how it compares with software based techniques.
Configurable computing machines are an emerging class of hybrid architectures where a field programmable gate array (FPGA) component is tightly coupled to a general-purpose microprocessor core. In these architectures, the FPGA component complements the general-purpose microprocessor by enabling a developer to construct application-specific gate-level structures on-demand while retaining the flexibility and rapid reconfigurability of a fully programmable solution. High computational performance can be achieved on the FPGA component by creating custom data paths, operators, and interconnection pathways that are dedicated to a given problem, thus enabling similar structural optimization benefits as ASICs. In this paper, we present a new programming environment for the development of applications on this new class of configurable computing machines. This environment enables developers to develop hybrid hardware/software applications in a common integrated development framework. In particular, the focus of this paper is on the hardware compilation part of the problem starting from a software-like algorithmic process-based specification.
As we move to the 0.18mm node and beyond, the dominant trend in device and process technology is a simple continuation of several decades of scaling. However, some serious challenges to straightforward scaling are on the horizon. This paper will review the present status of process technology and examine the likely departures from scaling in the various areas. The 0.18mm node is seeing the first major new materials introduced into the Si process for many years in the interconnect, and major departures from the traditional process are being actively considered for the transistor. However, it is probable that continued scaling will continue to dominate advanced processes for several generations to come.
This paper reviews the recent advances of SOI for digital CMOS VLSI applications with particular emphasis on the design issues and advantages resulting from the unique SOI device structure. The technology/device requirements and design issues/challenges for high-performance, general-purpose microprocessor applications are differentiated with respect to low-power portable applications. Particular emphases are placed on the impact of floating-body in partially-depleted devices on the circuit operation, stability, and functionality. Unique SOI design aspects such as parasitic bipolar effect and hysteretic V T variation are addressed. Circuit techniques to improve the noise immunity and global design issues are discussed.
Closed form solutions for the 50% delay, rise time, overshoots, and settling time of signals in an RLC tree are presented. These solutions have the same accuracy characteristics as the Elmore delay model for RC trees and preserves the simplicity and recursive characteristics of the Elmore delay. The solutions introduced here consider all damping conditions of an RLC circuit including the underdamped response, which is not considered by the classical Elmore delay model due to the non-monotone nature of the response. Also, the solutions have significantly improved accuracy as compared to the Elmore delay for an overdamped response. The solutions introduced here for RLC trees can be practically used for the same purposes that the Elmore delay is used for RC trees.
A closed form expression for the propagation delay of a CMOS gate driving a distributed RLC line is introduced that is within 5% of dynamic circuit simulations for a wide range of RLC loads. It is shown that the traditional quadratic dependence of the propagation delay on the length of an RC line approaches a linear dependence as inductance effects increase. The closed form delay model is applied to the problem of repeater insertion in RLC interconnect. Closed form solutions are presented for inserting repeaters into RLC lines that are highly accurate with respect to numerical solutions. An RC model as compared to an RLC model creates errors of up to 30% in the total propagation delay of a repeater system. Considering inductance in repeater insertion is also shown to significantly save repeater area and power consumption. The error between the RC and RLC models increases as the gate parasitic impedances decrease which is consistent with technology scaling trends. Thus, the importance of inductance in high performance VLSI design methodologies will increase as technologies scale.
The concept of improving the timing behavior of a circuit by relocating registers is called retiming and was first presented by Leiserson and Saxe. They showed that the problem of determining an equivalent minimum area (total number of registers) circuit is polynomial-time solvable. In this work we show how this approach can be reapplied in the DSM domain when area-delay trade-offs and delay constraints are considered. The main result is that the concavity of the trade-off function allows for a casting of this DSM problem into a classical minimum area retiming problem whose solution is polynomial time solvable.
A method that characterizes the timing of Intellectual
Property (IP) blocks while taking into account IP
functionality is presented. IP blocks are assumed to
have multiple modes of operation specified by the
user. For each mode, our method calculates IO path
delays and timing constraints to generate a timing
model. The method thus captures the mode-dependent
variation in IP delays which, according to our
experiments, can be as high as 90%. The special manner
in which delay calculation is performed guarantees
that IP delays are never underestimated. The
resulting timing models are also compacted through a
process whose accuracy is controlled by the user.
1.1 Keywords:
Timing analysis, false path, functional (mode) dependency,
IP characterization.
We present a new algorithm for detecting both combinationally and sequentially false timing paths, one in which the constraints on a timing path are captured by justifying symbolic functions across latch boundaries. We have implemented the algorithm and we present, here, the results of using it to detect false timing paths on a recent PowerPC microprocessor design. We believe these are the first published results showing the extent of the false path problem in industry. Our results suggest that the reporting of false paths may be compromising the effectiveness of static timing analysis.
In this paper, we present a new method to the built-in self-testable
data path synthesis based on integer linear programming
(ILP). Our method performs system register assignment, built-self-test
(BIST) register assignment, and interconnection
assignment concurrently to yield optimal designs. Our
experimental results show that our method successfully
synthesizes BIST circuits for all six circuits experimented. All
the BIST circuits are better in area overhead than those
generated by existing high-level BIST synthesis methods.
Keywords:
high-level BIST synthesis, built-in self-test, BIST, ILP.
In this paper, we propose a general test application scheme for existing scan-based BIST architectures. The objective is to further improve the test quality without inserting additional logic to the Circuit Under Test (CUT). The proposed test scheme divides the entire test process into multiple test sessions. A different number of capture cycles is applied after scanning in a test pattern in each test session to maximize the fault detection for a distinct subset of faults. We present a procedure to find the optimal number of capture cycles following each scan sequence for every fault. Based on this information, the number of test sessions and the number of capture cycles after each scan sequence are determined to maximize the random testability of the CUT. We conduct experiments on ISCAS89 benchmark circuits to demonstrate the effectiveness of our approach.
We describe an on-chip test generation scheme for synchronous sequential circuits that allows at-speed testing of such circuits. The proposed scheme is based on loading of (short) input sequences into an on-chip memory, and expansion of these sequences on-chip into test sequences. Complete coverage of modeled faults is achieved by basing the selection of the loaded sequences on a deterministic test sequence T0 , and ensuring that every fault detected by T0 is detected by the expanded version of at least one loaded sequence. Experimental results presented for benchmark circuits show that the length of the sequence that needs to be stored at any time is on the average 10% of the length of T0 , and that the total length of all the loaded sequences is on the average 46% of the length of T0.
The paper addresses the problem of analyzing the performance degradation caused by noise in power supply lines for deep submicron CMOS devices. We first propose a statistical modeling technique for the power supply noise including inductive DI noise and power net IR voltage drop. The model is then integrated with a statistical timing analysis framework to estimate the performance degradation caused by the power supply noise. Experimental results of our analysis framework, validated by HSPICE, for benchmark circuits implemented on both 0.25 m, 2.5 V and 0.55 m, 3.3 V technologies are presented and discussed. The results show that on average, with the consideration of this noise effect, the circuit critical path delays increase by 33% and 18%, respectively for circuits implemented on these two technologies.
In deep submicron technology, IR-drop and clock skew issues become more crucial to the functionality of chip. This paper presents a floorplan-based power and clock distribution methodology for ASIC design. From the floorplan and the estimated power consumption, the power network size is determined at an early design stage. Next, without detailed gate-level netlist, clock interconnect sizing, the number and strength of clock buffers are planned for balanced clock distribution. This early planning methodology at the full-chip level enables us to fix the global interconnect issues before the detailed layout composition is started.
Many design for test techniques for analog circuits are ineffective at detecting multiple parametric faults because either their accuracy is poor, or the circuit is not tested in the configuration it is used in. We present a DFT scheme that offers the accuracy needed to test high-quality circuits. The DFT scheme is based on a circuit that digitally measures the ratio of a pair of capacitors. The circuit is used to completely characterize the transfer function of a switched capacitor circuit, which is usually determined by capacitor ratios. In our DFT scheme, capacitor ratios can be measured to within 0.01% accuracy, and filter parameters can be shown to be satisfied to within 0.1% accuracy. A filter can be shown to satisfy all its functional specifications through this characterization process. We believe the accuracy of our scheme is at least an order of magnitude greater than that offered by any other scheme reported in the literature.
The assumption in moving system modelling to higher levels is that this improves the design process by allowing exploration of the architecture, providing an unambiguous specification and catching system errors early. We used the interface-based high level abstractions of VHDL+ in a real design, and in parallel with the actual project to investigate the validity of these claims.
Standard interfaces for hardware reuse are currently defined at the structural level. In contrast to this, our contribution defines the reuse interface at the behavioral register transfer (RT) level. This promotes direct reuse of functionality and avoids the integration problems of structural reuse. We present an object oriented reuse interface in C++ and show the use of it within two real-life designs.
In this paper a newly developed object model is
presented which allows the description of hardware/
software systems in all its parts. An adaption
of the component model JavaBeans allows
to combine different kinds of reuse in one unitary
language. A model based design flow and
some tools are presented and applied to a JPEG
example.
1.1 Keywords:
Object oriented hardware modeling, simulation, codesign.
While the number of embedded systems in consumer electronics is growing dramatically, several trends can be observed which challenge traditional codesign practice: An increasing share of functionality of such systems is implemented in software; flexibility or reconfigurability is added to the list of non-functional requirements. Moreover, networked embedded systems are equipped with communication capabilities and can be controlled over networks. In this paper, we present a suitable methodology and a set of tools targeting these novel requirements. JACOP is a codesign environment based on Java and supports specification, co-synthesis and prototyping of networked embedded systems.
This tutorial paper surveys the potential implications of subwave-length optical lithography for new tools and flows in the interface between layout design and manufacturability. We review control of optical process effects by optical proximity correction (OPC) and phase-shifting masks (PSM), then focus on the implications of OPC and PSM for layout synthesis and verification methodologies. Our discussion addresses the necessary changes in the design-to-manufacturing flow, including infrastructure development in the mask and process communities, evolution of design methodology, and opportunities for research and development in the physical lay-out and verification areas of EDA.
Software synthesis from a concurrent functional specification is a key problem in the design of embedded systems. A concurrent specification is well-suited for medium-grained partitioning. However, in order to be implemented in software, concurrent tasks need to be scheduled on a shared resource (the processor). The choice of the scheduling policy mainly depends on the specification of the system. For pure dataflow specifications, it is possible to apply a fully static scheduling technique, while for algorithms containing data-dependent control structures, like the if-then-else or while-do constructs, the dynamic behaviour of the system cannot be completely predicted at compile time and some scheduling decisions are to be made at run-time. For such applications we propose a Quasi-static scheduling (QSS) algorithm that generates a schedule in which run-time decisions are made only for data-dependent control structures. We use Free Choice Petri Nets (FCPNs), as underlying model, and define quasi-static schedulability for FCPNs. The proposed algorithmis complete, in that it can solve QSS for any FCPN that is quasi-statically schedulable. Finally, we show how to synthesize from a quasi-static schedule a C code implementation that consists of a set of concurrent tasks.
This paper presents a new algorithm for exact estimation of the minimum memory size required by programs dealing with array computations. Memory size is an important factor affecting area and power cost of memory units. For programs dealing mostly with array computations, memory cost is a dominant factor in the overall system cost. Thus, exact estimation of memory size required by a program is necessary to provide quantitative information for making high-level design decisions. Based on formulated live variables analysis, our algorithm transforms the minimum memory size estimation into an equivalent problem: integer point counting for intersection/union of mappings of parameterized polytopes. Then, a heuristics was proposed to solve the counting problem. Experimental results show that the algorithm achieves the exactness traditionally associated with totally-unrolling loops while exploiting the reduced computation complexity by preserving original loop structure.
Fixed-point DSPs are a class of embedded processors with highly irregular architectures. This irregularity makes it difficult to generate high-quality machine code from programming languages such as C. In this paper we present a novel constraint driven approach to code selection for irregular processor architectures, which provides a twofold improvement of earlier work. First, it handles complete data flow graphs instead of trees and thereby generates better code in presence of common subexpressions. Second, the presented technique is not restricted to computation of a single solution, but it generates alternative solutions. This feature enables the tight coupling of different code generation phases, resulting in better exploitation of instruction-level parallelism. Experimental results indicate that our technique is capable of generating machine code that competes well with handwritten assembly code.
Generation of optimized DSP code from a high level language
such as C is very time consuming since current DSP compilers are
generally unable to produce efficient code. We present a software
estimation methodology from a C description that helps for a
rapid development of DSP applications. Our tool VESTIM
provides both a performance evaluation for assembly code
generated by the compiler and an estimation of an optimized
assembly code. Blocks of applications G.721 and G.728 have been
evaluated using VESTIM. Results show that estimations are very
accurate and allow software development time to be significantly
reduced.
Keywords:
DSP, Code generation, Performance Estimation.
In this paper, we describe the software environment
for Daytona, a single-chip, bus-based,
shared-memory, multiprocessor DSP. The software
environment is designed around a layered
architecture. Tools at the lower layer are
designed to deliver maximum performance and
include a compiler, debugger, simulator, and
profiler. Tools at the higher layer focus on
improving the programmability of the system
and include a run-time kernel and parallelizing
tools. The run-time kernel includes a low-over-head,
preemptive, dynamic scheduler with multiprocessor
support that guarantees real-time
performance to admitted tasks.
1.1 Keywords:
Multiprocessor DSP, media processor, software environment, run-time
kernel, RTOS
A number of researchers have proposed using digital marks to
provide ownership identification for intellectual property. Many
of these techniques share three specific weaknesses: complexity of
copy detection, vulnerability to mark removal after revelation for
ownership verification, and mark integrity issues due to partial
mark removal. This paper presents a method for watermarking
field programmable gate array (FPGA) intellectual property (IP)
that achieves robustness by responding to these three weaknesses.
The key technique involves using secure hash functions to
generate and embed multiple small marks that are more
detectable, verifiable, and secure than existing IP protection
techniques.
Keywords:
Field programmable gate array (FPGA), intellectual property
protection, watermarking
We present a methodology for the watermarking of synchronous sequential circuits that makes it possible to identify the authorship of designs by imposing a digital watermark on the state transition graph of the circuit. The methodology is applicable to sequential designs that are made available as firm Intellectual Property (IP), the designation commonly used to characterize designs specified as structural descriptions or circuit netlists. The watermarking is obtained by manipulating the state transition graph of the design in such a way as to make it exhibit a chosen property that is extremely rare in non-watermarked circuits, while, at the same time, not changing the functionality of the circuit. This manipulation is performed without ever actually computing this graph in either implicit or explicit form. We present both theoretical and experimental results that show that the watermarking can be created and verified efficiently.
While previous watermarking-based approaches to intellectual property protection (IPP) have asymmetrically emphasized the IP provider's rights, the true goal of IPP is to ensure the rights of both the IP provider and the IP buyer. Symmetric fingerprinting schemes have been widely and effectively used to achieve this goal; however, their application domain has been restricted only to static artifacts, such as image and audio. In this paper, we propose the first generic symmetric fingerprinting technique which can be ap-plied to an arbitrary optimization/synthesis problem and, therefore, to hardware and software intellectual property. The key idea is to apply iterative optimization in an incremental fashion to solve a fingerprinted instance; this leverages the optimization effort already spent in obtaining a previous solution, yet generates a uniquely fingerprinted new solution. We use this approach as the basis for developing specific fingerprinting techniques for four important problems in VLSI CAD: partitioning, graph coloring, satisfiability, and standard-cell placement. We demonstrate the effectiveness of our fingerprinting techniques on a number of standard benchmarks for these tasks. Our approach provides an effective tradeoff between runtime and resilience against collusion.
The economic viability of the reusable core-based design paradigm depends on the development of techniques for intellectual property protection. We introduce the first dynamic watermarking technique for protecting the value of intellectual property of CAD and compilation tools and reusable core components. The essence of the new approach is the addition of a set of design and timing constraints which encodes the author's signature. The constraints are selected in such a way that they result in minimal hardware overhead while embedding the signature which is unique and difficult to detect, remove and forge. We establish the first set of relevant metrics which forms the basis for the quantitative analysis, evaluation, and comparison of watermarking techniques. We develop a generic approach for signature data hiding in designs, which is applicable in conjunction with an arbitrary behavioral synthesis task, such as scheduling, assignment, allocation, and transformations. Error correcting codes are used to augment the protection of the signature data from tampering attempts. On a large set of design examples, studies indicate the effectiveness of the new approach in a sense that the signature data, which are highly resilient, difficult to detect and remove, and yet easy to verify, can be embedded in designs with very low hardware overhead.
This work describes the design and implementation of an energy-efficient, scalable encryption processor that utilizes variable voltage supply techniques and a high-efficiency embedded variable output DC/DC converter. The resulting implementation dissipates 134nJ/bit @ V DD = 2.5V, when encrypting at its maximum rate of 1Mb/s using a maximum datapath width of 512 bits. The embedded converter achieves an efficiency of 96% at this peak load. The processor is 2-3 orders of magnitude more energy efficient than optimized assembly code running on a low-power processor such as the StrongARM.
In this paper, we consider the problem of maximizing the battery life (or duration of service) in battery-powered CMOS circuits. We first show that the battery efficiency (or utilization factor) decreases as the average discharge current from the battery increases. The implication is that the battery life is a super-linear function of the average discharge current. Next we show that even when the average discharge current remains the same, different discharge current profiles (distributions) may result in very different battery lifetimes. In particular, the maximum battery life is achieved when the variance of the discharge current distribution is minimized. Analytical derivations and experimental results underline importance of the correct modeling of the battery-hardware system as a whole and provide a more accurate basis (i.e., the battery discharge times delay product) for comparing various low power optimization methodologies and techniques targeted toward battery-powered electronics. Finally, we calculate the optimal value of Vdd for a battery-powered VLSI circuit so as to minimize the product of the battery discharge times the circuit delay.
This paper presents a methodology for cycle-accurate simulation of energy dissipation in embedded systems. The ARM Ltd. [1] instruction-level cycle-accurate simulator is extended with energy models for the processor, the L2 cache, the memory, the interconnect and the DC-DC converter. A SmartBadge, which can be seen as an embedded system consisting of StrongARM-1100 processor, memory and the DC-DC converter, is used to evaluate the methodology with the Dhrystone benchmark. We compared performance and energy computed by our simulator with measurements in hardware and found them in agreement within a 5% tolerance. The simulation methodology was applied to design exploration for enhancing a SmartBadge with real-time MPEG feature.
Power consumption in clock of large high performance VLSIs can be reduced by adopting Globally Asynchronous, Locally Synchronous design style (GALS). GALS has small overheads for the global asynchronous communication and local clock generation. We propose methods to a) evaluate the benefits of GALS and account for its overheads, which can be used as the basis for partitioning the system into optimal number/size of synchronous blocks, and b) automate the synthesis of the global asynchronous communication. Three realistic ASICs, ranging in complexity from 1 to 3 million gates, were used to evaluate GALS benefits and overheads. The results show an average power saving of about 70% in clock with negligible overheads.
Chatoyant models free-space opto-electronic components
and systems and performs simulations and
analyses that allow designers to make informed system
level trade-offs. Recently, the use of MEM
bulk and surface micro-machining technology has
enabled the fabrication of micro-optical-mechanical
systems. This paper presents our models for
diffractive optics and new analysis techniques
which extend Chatoyant to support optical MEMS
design. We show these features in the simulation
of two optical MEM systems.
Keywords:
Optical MEMS, MEMS-CAD, MOEMS, micro-optics
This paper presents a comprehensive analysis of the thermal effects in advanced high performance interconnect systems arising due to self-heating under various circuit conditions, including electrostatic discharge. Technology (Cu, low-k etc) and scaling effects on the thermal characteristics of the interconnects, and on their electromigration reliability has been analyzed simultaneously, which will have important implications for providing robust and aggressive deep submicron interconnect design guidelines. Furthermore, the impact of these thermal effects on the design (driver sizing) and optimization of the interconnect length between repeaters at the upper-level signal lines are investigated.
A 550MHz 64b PowerPC processor was developed for fabrication in Silicon-On-Insulator (SOI) technology from a processor previously designed and fabricated in bulk CMOS [1]. Both the design and the associated CAD methodology (point tools, flow, and models) were modified to handle demands specific to SOI technology. The challenge was to improve the cycle time by adapting the circuit design, timing, and chip integration methodologies to accommodate effects unique to SOI.
The increasing complexity and geographical separation of design data, tools and teams has created a need for a collaborative and distributed design environment. In this paper we present a framework that enables collaborative and distributed Web-based CAD, in which the designers can collaborate on a design and efficiently utilize existing design tools on the Internet. The framework includes a Java-based hierarchical collaborative schematic/block editor with interfaces to distributed Web tools and cell libraries, infrastructure to store and manipulate design objects, and protocols for tool communication, message passing and collaboration.
Inductance effects in on-chip interconnects have become significant for specific cases such as clock distributions and other highly optimized networks [1,2]. Designers and CAD tool developers are searching for ways to deal with these effects. Unfortunately, accurate on-chip inductance extraction and simulation in the general case are much more difficult than capacitance extraction. In addition, even if ideal extraction tools existed, most chip designers have little experience designing with lossy transmission lines. This tutorial will attempt to demystify on-chip inductance through the discussion of several illustrative examples analyzed using full-wave extraction and simulation methods. A specialized PEEC (Partial Element Equivalent Circuit) method tailored for chip applications was used for most cases. Effects such as overshoot, reflections, frequency dependent effective resistance and inductance will be illustrated using animated visualizations of the full-wave simulations. Simple examples of design techniques to avoid, mitigate, and even take advantage of on-chip inductance effects will be described.
In this survey paper we describe the combination of: discretized integral formulations, sparsification techniques, and krylov-subspace based model-order reduction that has led to robust tools for automatic generation of macromodels that represent the distributed RLC effects in 3-D interconnect. A few computational results are presented, mostly to point out the problems yet to be addressed.
IC inductance extraction generally produces either port
inductances based on simplified current path assumptions
or a complete partial inductance matrix. Combining
either of these results with the IC interconnect
resistance and capacitance models significantly complicates
most IC design and verification methodologies. In
this tutorial paper we will review some of the analysis
and verification problems associated with on-chip inductance,
and present a subset of recent results for partially
addressing the challenges which lie ahead.
Keywords:
Interconnect; Inductance; Model Order Reduction.
As the family of Alpha microprocessors continues to scale into
more advanced technologies with very high frequency edge rates
and multiple layers of interconnect, the issue of characterizing
inductive effects and providing a chip-wide design methodology
becomes an increasingly complex problem. To address this issue,
a test chip has been fabricated to evaluate various conductor
configurations and verify the correctness of the simulation
approach. The implementation of and results from this test chip
are presented in this paper. Furthermore the analysis has been
extended to the upcoming EV7 microprocessor, and important
aspects of the derivation of its design methodology, as pertains to
these inductive effects, are discussed.
Keywords:
Alpha microprocessor, semiconductor, interconnect, buses,
inductance, resistance, capacitance, RLC, noise, cross-talk,
transmission line.
We present a system that automatically generates a cycle-accurate and bit-true Instruction Level Simulator (ILS) and a hardware implementation model given a description of a target processor. An ILS can be used to obtain a cycle count for a given program running on the target architecture, while the cycle length, die size, and power consumption can be obtained from the hardware implementation model. These figures allow us to accurately and rapidly evaluate target architectures within an architecture exploration methodology for system-level synthesis. In an architecture exploration scheme, both the ILS and the hardware model must be generated automatically, else a substantial programming and hardware design effort has to be expended in each design iteration. Our system uses the ISDL machine description language to support the automatic generation of the ILS and the hardware synthesis model, as well as other related tools.
This paper presents the machine description language LISA for the generation of bit-and cycle accurate models of DSP processors. Based on a behavioral operation description, the architectural details and pipeline operations of modern DSP processors can be covered. Beyond the behavioral model, LISA descriptions include other architecture-related information like the instruction set. The information provided by LISA models enables automatic generation of simulators and assemblers which are essential elements of DSP software development environments. in order to proof the applicability of our approach, a realized model of the Texas Instruments TMS320C6201 DSP is presented and derived LISA code examples are given.
The growing requirements on the correct design of a high-performance system in a short time force us to use IP's in many designs. In this paper, we propose a new approach to select the optimal set of IP's and interfaces to make the application program meet the performance constraints in ASIP designs. The proposed approach selects IP's with considering interfaces and supports concurrent execution of parts of task in kernel as software code with others in IP's, while the previous state-of-the-art approaches do not consider IP's and interfaces simultaneously and cannot support the concurrent execution. The experimental results on real applications show that the proposed approach is effective in making application programs meet the performance constraints using IP's.
Analog synthesis tools have failed to migrate into mainstream use primarily because of difficulties in reconciling the simplified models required for synthesis with the industrial-strength simulation environments required for validation. MAELSTROM is a new approach that synthesizes a circuit using the same simulation environment created to validate the circuit. We introduce a novel genetic/ annealing optimizer, and leverage network parallelism to achieve efficient simulator-in-the-loop analog synthesis.
This paper presents a novel approach for synthesis of analog systems from behavioral VHDL-AMS specifications. We implemented this approach in the VASE behavioral-synthesis tool. The synthesis process produces a netlist of electronic components that are selected from a component library and sized such that the overall area is minimized and the rest of the performance constraints such as power, slew-rate, bandwidth, etc. are met. The gap between system level specifications and implementations is bridged using a hierarchically-organized, design-space exploration methodology. Our methodology performs a two-layered synthesis, the first being architecture generation, and the other component synthesis and constraint transformation. For architecture generation we suggest a branch-and-bound algorithm, while component synthesis and constraint transformation use a Genetic Algorithm based heuristic method. Crucial to the success of our exploration methodology is a fast and accurate performance estimation engine that embeds technology process parameters, SPICE models for basic circuits and performance composition equations. We present a telecommunication application as an example to illustrate our synthesis methodology, and show that constraint-satisfying designs can be synthesized in a short time and with a reduced designer effort.
This paper presents a method to reduce the complexity of a linear or linearized (small-signal) analog circuit. The reduction technique, based on quality-error ranking, can be used as a standard reduction engine that ensures the validity of the resulting network model in a specific (set of) design point(s) within a given frequency range and a given magnitude and phase error. It can also be used as an analysis engine to extract symbolic expressions for poles and zeroes. The reduction technique is driven by analysis of the signal flow graph associated with the network model. Experimental results show the effectiveness of the approach.
We present our practical experience in the modeling and integration of cycle/phase-accurate instruction set architecture (ISA) models of digital signal processors (DSPs) with other hardware and software components. A common approach to the modeling of processors for HW/SW co-verification relies on instruction-accurate ISA models combined (i.e. wrapped) with the bus interface models (BIM) that generate the clock/phase-accurate timing at the component's interface pins. However, for DSPs and new microprocessors with complex architectural features this approach is from our perspective not acceptable. The additional extensive modeling of the pipeline and other architectural details in the BIM would force us to develop two detailed processor models with a complex BIM API between them. We therefore propose an alternative approach in which the processor ISAs themselves are modeled in a full cycle/phase-accurate fashion. The bus interface model is then reduced to just modeling the connection to the pins. Our models have been integrated into a number of cycle-based and event-driven system simulation environments. We present one such experience in incorporating these models into a VHDL environment. The accuracy has been verified cycle-by-cycle against the gate/RTL level models. Multi-processor debugging and observability into the precise cycle-accurate processor state is provided. The use of co-verification models in place of the RTL resulted in system speedups up to 10 times, with the cycle-accurate ISA models themselves reaching performances of up to 123K cycles/sec.
One possible solution to the verification crisis is
to bridge the gap between formal verification
and simulation by using hybrid techniques.
This paper presents a study of such a functional
verification methodology that uses coverage of
formal models to specify tests. This was applied
to a modern superscalar microprocessor and
the resulting tests were compared to tests generated
using existing methods. The results
showed some 50% improvement in transition
coverage with less than a third the number of
test instructions, demonstrating that hybrid
techniques can significantly improve functional
verification.
1.1 Keywords:
Functional verification, test generation, formal models,
transition coverage
Dynamic-current based test techniques can potentially address the drawbacks of traditional and Iddq test methodologies. The quality of dynamic current based test is degraded by process variations in IC manufacture. The energy consumption ratio (ECR) is a new metric that improves the effectiveness of dynamic current test by reducing the impact of process variations by an order of magnitude. We address several issues of significant practical importance to an ECR-based test methodology. We use the ECR to test a low-voltage submicron IC with a microprocessor core. The ECR more than doubles the effectiveness of the dynamic current test already used to test the IC. The fault coverage of the ECR is greater than that offered by any other test, including Iddq.We develop a logic-level fault simulation tool for the ECR and techniques to set the threshold for an ECR-based test process. Our results demonstrate that the ECR offers the potential to be a high-quality low-cost test methodology. To the best of our knowledge, this is the first dynamic-current based test technique to be validated with manufactured ICs.
This paper describes a physical model for spiral
inductors on silicon which is suitable for circuit
simulation and layout optimization. Key issues
related to inductor modeling such as skin effect
and silicon substrate loss are discussed. An
effective ground shield is devised to reduce substrate
loss and noise coupling. A practical
design methodology based on the trade-off
between the series resistance and oxide capacitance
of an inductor is presented. This method
is applied to optimize inductors in state-of-the-art
processes with multilevel interconnects. The
impact of interconnect scaling, copper metallization
and low-K dielectric on the achievable
inductor quality factor is studied.
1.1 Keywords:
Spiral inductor, quality factor, skin effect, substrate loss,
substrate coupling, patterned ground shield, interconnects
At present there are two common types of integrated circuit inductor simulation tools. The first type is based on the Greenhouse methods[1], and obtains a solution in a fraction of a second; however, because it does not use solutions of the inductor charge and current distributions, it has limited accuracy. The second type, method of moments (MoM) solvers, determines the charge and current variations by decomposing the inductor into thousands of sub elements and solving a matrix. However, this process takes between minutes and hours to obtain a reasonably accurate solution. In this paper, we present a series of algorithms for solving inductors, of radius small compared to the wave length of the electrical signal, that equal or exceed the accuracy of MoM solvers, but obtain those solutions in roughly 1 second.
We present an efficient method for optimal design and synthesis of CMOS inductors for use in RF circuits. This method uses the the physical dimensions of the inductor as the design parameters and handles a variety of specifications including fixed value of inductance, minimum self-resonant frequency, minimum quality factor, etc. Geometric constraints that can be handled include maximum and minimum values for every design parameter and a limit on total area. Our method is based on formulating the design problem as a special type of optimization problem called geometric programming, for which powerful efficient interior-point methods have recently been developed. This allows us to solve the inductor synthesis problem globally and extremely efficiently.Also, we can rapidly compute globally optimal trade-off curves between competing objectives such as quality factor and total inductor area. We have fabricated a number of inductors designed by the method, and found good agreement between the experimental data and the specifications predicted by our method.