Ultralow power wireless communications, such as paging receivers operating from a single cell, have prompted the development and discovery of circuit techniques and architectures which lower power consumption of RF IC's ten-fold or more compared to today's norms. This paper illustrates many of these design principles, mainly in a CMOS context. It argues for seeking strategic combinations of high quality off-chip passives with RF integrated circuits, and searching for better architectures in wireless receivers (and transmitters) to lower power. The principles are illustrates with specific examples.
This paper presents a 1 V digital signal processor used in the Danalogic hearing aid manufactured by GN Danavox. The processor is the first general purpose programmable used in behind-the-ear and in-the-ear hearing aid applications. It is integrated with memories, in a 0.5u CMOS process with standard thresholds. At 2 MHz processing speed, the processor consumes 800uA from a single cell battery. Using a dual multiply-accumulate architecture, the processor executes a 256 point block floating-point FFT in just 2900 instruction cycles.
1-V ultra low-power SRAM circuit techniques are described for word-bit configurable memory macrocells. A shared bitline SRAM cell architecture with modified address assignment is proposed to reduce wasted memory-cell current to zero while suppressing the area penalty. For the new SRAM cell design, we devise a multiplexer-merged charge-transfer amplifier for high-sensitive read operation and a bitline precharge scheme with an equalizing line for high-speed write-recovery operation. A 1-V operating 64-kb (2kw x 16b x 2) test chip was designed using a 0.35-mm multithreshold-voltage CMOS (MTCMOS) logic process. The simulated power dissipation is 1/4 (486 mW) that of the conventional 1-V word-bit configurable SRAM macrocell with a 13% area increase.
Retractile clock-powered logic is presented as a low-over-head energy-recovery logic style. It uses energy-efficient clock-steering circuits, pass-transistor logic, and a four-phase clocking scheme to recover energy from all circuit nodes but the latches. A 16-bit retractile clock-powered adder is described and evaluated through HSPICE simulations. The simulation results indicate that this approach can offer superior energy versus delay performance but the benefit depends strongly on the switching activity of the clock-powered nodes.
This paper describes the impact of crosstalk noise on low power design techniques based on voltage scaling. It is shown that this power saving strategy aggravates the crosstalk noise problem and reduces circuit noise immunity. A new energy-efficient, noise-tolerant dynamic circuit technique is presented to address this problem. In a 0.35m CMOS technology and at a given supply voltage, the proposed technique provides an improvement in noise-immunity of 1.8X(for an AND gate) and 2.5X(for an adder carry chain) over domino at the same speed. We use this fact to operate the noise-tolerant circuit at a lower supply voltage to obtain energy savings of about 30%, while expending 30% more area. Also, to achieve a given noise immunity, the proposed technique consumes 40% less energy compared to existing noise-tolerance techniques.
In this paper, we propose a framework for low-energy digital signal processing (DSP) where the supply voltage is scaled beyond the critical voltage required to match the critical path delay to the throughput. This deliberate introduction of input-dependent errors leads to degradation in the algorithmic performance, which is compensated for via algorithmic noise-tolerance (ANT) schemes. The resulting setup that comprises of the DSP architecture operating at sub-critical voltage and the error control scheme is referred to as soft DSP. It is shown that technology scaling renders the proposed scheme more effective as the delay penalty suffered due to voltage scaling reduces due to short channel effects. The effectiveness of the proposed scheme is also enhanced when arithmetic units with a higher "delay-imbalance" are employed. A prediction based error-control scheme is proposed to enhance the performance of the filtering algorithm in presence of errors due to soft computations. For a frequency selective filter, it is shown that the proposed scheme provides 60% reduction in energy dissipation for filter bandwidths up to 0.5 pi (where 2 pi corresponds to the sampling frequency fs) over that achieved via conventional voltage scaling, with a maximum of 0.5dB degradation in the output signal-to-noise ratio (SNRsigma). It is also shown that the proposed algorithmic noise-tolerance schemes can be used to improve the performance of DSP algorithms in presence of bit-error rates of up to 103 due to deep submicron (DSM) noise.
Turbo code becomes popular for the next generation wireless communication systems because of its remarkable coding performance. One of the problems for decoding turbo code in the receiver is the complexity and the high power consumption since multiple iterations of Soft Output Viterbi Algorithm (SOVA) have to be carried out to decode a data frame. In this paper, we address the issues of reducing the complexity and power consumption of the turbo code decoder. An approach using cyclic redundancy checking (CRC) to adaptively terminate the SOVA iteration of each frame is presented. This results in system that has variable workload of which the amount of computation required for each data frame is different. Dynamic voltage scaling is then used to further reduce the power consumption. However, since the workload is not yet known at the time when the data is being decoded, optimum voltage assignment is not feasible. In this work, we propose two heuristic algorithms to assign supply voltage for different decoding iterations. Simulation results show that significant reduction of power consumption is achieved comparing with system using fixed supply voltage.
Spread spectrum systems are being widely deployed today and are becoming more prevalent as most next-generation wireless systems are adopting it for their common air interface. These systems include the digital cellular IS-95A/B/C, IEEE 802.11 wireless local area networks, as well as third-generation wideband code-division multiple access systems. In spread-spectrum systems, the receiver must synchronize on to the transmitted pseudo-noise (PN) code to obtain the improvement performance achieved through spreading. Since PN acquisition must process the spread-spectrum signal at a speed much faster than the transmitted data rate, its energy consumption can become significant and should be minimized for portable applications. Typically, either matched filters or serial correlators are used to acquire the PN code timing. This paper describes a hybrid PN acquisition architecture which employs both matched filters and serial correlators to achieve a lower energy consumption and fast acquisition time as compared to the traditional approaches of using either matched filters or serial correlators alone. The hybrid architecture has been implemented in RTL VHDL and synthesized down to gate level in 0.5-micron CMOS library. Synthesis results show a factor of four reduction in energy for the hybrid scheme as compared to the matched filters architecture and a factor of two reduction in energy as compared to the serial architecture.
A system is proposed to convert ambient mechanical vibration
into electrical energy for use in powering autonomous low-power electronic
systems. The energy is transduced through the use of a variable capacitor,
which has been designed with MEMS (microelectromechanical systems) technology.
A low-power controller IC has been fabricated in a 0.6um CMOS process
and has been tested and measured for losses. Based on the tests, the system
is expected to produce 8uW of usable power.
Keywords - Energy Conversion, MEMS, Low-Power, Self-Powered
A variable supply-voltage (VS) scheme with a high power-conversion-efficiency
DC-DC converter is presented. A new pulse width modulation (PWM) circuit
for the DC-DC converter is proposed to reduce both of power consumption
and chip area. The power conversion efficiency reaches up to 95%, and the
area is less than half of the conventional design. The VS scheme contains
critical path replica circuits of an MPEG-4 codec LSI, and its output
voltage is controlled by monitoring delay time of the replica circuits.
Consequently the VS scheme can automatically generate minimal internal
supply voltage that meets the demand from the operation frequency of an
MPEG-4 codec LSI. The advantages of this circuit are successfully demonstrated
through fabrication of a test chip using a 0.3 um CMOS technology.
Keywords: DC-DC, low power, low voltage, PWM, variable supply voltage.
Several new building blocks are demonstrated, which enable low-power (1.1-1.8V) analog functionality in a single-poly, digital CMOS process. These cells facilitate the integration of analog converters on system-on-a-chip IC's without adding any extra cost to the process. A voice A/D, designed with these circuits, exhibited an SNR of 68 dB at an analog supply voltage of 1.1V, and 75dB at 1.8V. This is despite the noisy digital environment of an on-chip DSP operating at 60 Mhz and a digital supply voltage of 2.5V.
In this paper, we propose a technique that uses an additional mini cache, the L0-Cache, located between the instruction cache (I-Cache) and the CPU core. This mechanism can provide the instruction stream to the data path and, when managed properly, it can effectively eliminate the need for high utilization of the more expensive I-Cache. In this work, we propose, implement, and evaluate a series of run-time techniques for dynamic analysis of the program instruction access behavior, which are then used to proactively guide the access of the L0-Cache. The basic idea is that only the most frequently executed portions of the code should be stored in the L0-Cache since this is where the program spends most of its time. We present experimental results to evaluate the effectiveness of our scheme in terms of performance and energy dissipation for a series of SPEC95 benchmarks. We also discuss the performance and energy tradeoffs that are involved in these dynamic schemes.
Modern microprocessors employ one or two levels of on-chip
caches to bridge the burgeoning speed disparities between the
processor and the RAM. These SRAM caches are a major
source of power dissipation. We investigate architectural
techniques, that do not compromise the processor cycle time,
for reducing the power dissipation within the on-chip cache
hierarchy in superscalar microprocessors. We use a detailed
register-level simulator of a superscalar microprocessor that
simulates the execution of the SPEC benchmarks and SPICE
measurements for the actual layout of a 0.5 micron, 4metal
layer cache, optimized for a 300 MHz. clock. We show that a
combination of subbanking, multiple line buffers and bit-line
segmentation can reduce the on-chip cache power dissipation
by as much as 75% in a technology-independent manner.
Key words: Low power caches, power estimation.
Turbo codes are the most recent breakthrough in coding theory. Although their decoding algorithm is highly data dominated, no systematic memory optimization study has been performed yet. We have applied the IMEC Data Transfer and Storage Exploration (DTSE) methodology to the MAP (Maximum A Posteriori) class of turbo decoding algorithms. We present an extensive overview of our optimizations and tradeoffs, which result in a parametric family of new optimized algorithms. The optimal choice of parameters depends on the specific turbo code and on the implementation target, which can be either hardware or software.
This paper describes a new mixed-swing topology for dual-rail domino logic that results in a simultaneous energy and delay reduction. HSPICE simulation results for a 1-bit full adder cell show a 24% delay decrease and a 24% energy reduction for the mixed-swing topology compared to standard dual-rail domino. Energy and delay trends with supply voltage scaling are also presented for the adder cell. An 8-bit by 8-bit multiplier design with mixed-swing dual-rail domino adders is presented. Simulation results show this implementation to be 10% faster with an 18% energy savings.
The charge recovery databus is a scheme which reduces energy consumption through the application of adiabatic circuit techniques. Previous work [2] gives a solid theoretical analysis of this scheme, including quantitative data assuming random bus values. We extend this earlier work by presenting a quantitative analysis of the charge recovery databus using 15 benchmarks and 4 high-level bus coding schemes. We show that a very simple implementation of the charge recovery databus is capable of reducing average energy consumption by 28% beyond traditional high-level bus encoding techniques.
A new dynamic differential logic family, Short-Circuit Current
Logic (SC 2 L), is proposed for low-power high-performance
applications. It achieves low-power consumption by using an
aggressively reduced logic swing without requiring restoration
circuitry. Using a 0.35mm CMOS technology and a nominal
supply voltage of 3.3V, a SC 2 L full-adder 8 carry ripple adder
(CRA) is implemented. It offers an order of magnitude less
power-delay product than several other logic families.
Keywords:
Digital circuits, high-performance, low-power, low swing logic
In this paper, we propose a 'conforming inverted data store' scheme for reducing the power consumption in memory components. It reduces the power consumption by conforming memory contents to a precharging value of the memory. It selectively stores normal or inverted data so to reduce the total number of accessing bits different from the precharging value. In this way, bitline toggling during memory access is minimized and this ultimately contributes to reduction in power consumption. We develop two practical implementations for the proposed method, that are vertical strip, and horizontal strip inversion schemes. Simulation results indicate that implementation of the strip-based inversion schemes contribute to a power reduction up to 50%.
Numerous efforts in balancing the trade-off
between power, area and performance have
been done in the medium performance, medium
power region of the design spectrum.
However, not much study has been done at the
two extreme ends of the design spectrum,
namely the ultra-low power with acceptable
performance at one end (the focus of this paper),
and high performance with power within
limit at the other. One solution to achieve the
ultra-low power requirement is to operate the
digital logic gates in subthreshold region. We
analyze both CMOS and Pseudo-NMOS logic
families operating in subthreshold region. We
compare the results with CMOS in normal
strong inversion region and with other known
low-power logic, namely, energy recovery logic.
Our results show an energy per switching reduction
of two orders of magnitude for an 8x8
carry save array multiplier when it is operated
in subthreshold region.
1.1 Keywords:
Ultra-low power, digital logic, subthreshold circuits
Adiabatic circuits offer a promising alternative to conventional circuitry for low energy design. Their operation is nevertheless subject to fundamental energy-speed trade-offs, just like any other physical realization of boolean logic. Thus, adiabatic circuits with very low energy consumption at low frequencies fail to function at high operating frequencies. Conversely, high-speed adiabatic circuits tend to be dissipative at low clock rates. This paper describes SCAL, a single-phase source-coupled adiabatic logic family that operates efficiently across a wide range of operating frequencies. In layout-based simulations with 0.5um CMOS process parameters, pipelined carry-lookahead adders developed in our logic function correctly from 10MHz up to 280MHz. Our SCAL adders are less dissipative than corresponding designs in alternative adiabatic families that remain functional across the same frequency range. Moreover, they are about as dissipative as other adiabatic circuits that are geared towards very efficient operation at low frequencies. In comparison with their CMOS counterparts, our SCAL adders are 3 to 10 times more energy efficient.
Data referencing during program execution can be a significant source of energy consumption especially for data-intensive programs. In this paper, we propose an approach to minimize such energy consumption by allocating data to proper registers and memory. Through careful analysis of boundary conditions between consecutive blocks, our approach efficiently handles various control structures including branches, merges and loops, and achieves superior allocation results for the whole program. The computational cost of our approach for solving the global register allocation problem is rather low comparing with known approaches while the quality of our results is very encouraging.
In this paper, we propose a modeling approach for the average power consumption of macro-blocks that are typically used in digital signal processing (DSP) systems, such as adders, multipliers and delay elements, in terms of their input/output signal switching statistics. The resulting power macro-model, consisting of a quadratic or cubic equation in four variables, can be used to estimate the average power consumed in the macro-block for any given input/output signal statistics. This enables high-level power estimation and allows one to compare the power performance of different competing DSP systems during high-level synthesis. This approach has been implemented and models have been built and tested for many macro-blocks.
A new method, called DVDV, for low-power design of high-performance
CMOS logic circuits is presented. DVDV utilizes
a library of gates with dual supply voltages (Vdd) and
dual threshold voltages (Vth) to achieve high-performance
with minimum dynamic and leakage power. A Depth-First-Search
(DFS) based heuristic for DVDV node assignment is
described. Exercising the techniques on a set of benchmarks
shows significant power savings over the dual-Vdd
(with a single Vth) scheme, and faster speeds than those possible
with the dual-Vth(and a single Vdd) approach.
1.1 Keywords:
low-power, low-voltage, logic design, CMOS, dual-Vdd, dual-Vth
This paper describes a completely on-chip voltage regulation technique for locally generating an adaptive low voltage power supply rail from a given higher voltage power supply without requiring any external component. The on-chip regulator, based on delay servoing, primarily comprises of a critical path replica, charge pump and a high performance voltage buffer which is the most critical component of the design. Simulation results in 0.5mm CMOS process demonstrate that the buffer offers a low DC output impedance, a high degree of voltage regulation (output ripple of 12% of Vdd) and a superior line regulation (upto the maximum clock frequency of 50MHz) even under strongly varying load conditions. The regulator response for a typical worst case load exhibits a maximum voltage fluctuation of 4% of Vdd with a reasonably fast response time.
Owing to their higher output dynamic two-stage amplifiers may become an interesting alternative to cascoded single-stage amplifiers for low voltage switched capacitor applications. Therefore, a comparison of the minimum power consumption of both approaches, based on an optimisation methodology, is It is worked out, which amplifier type should be used to achieve minimum power consumption for a given supply voltage, capacitor ratio and desired settling precision.
In this paper we present an approach to calculate
lower and upper bounds for the switching
activity in scheduled data flow graphs. The
technique can be used to prune the design space
in high level synthesis for low power before allocation
and binding of functional units and registers.
The low power allocation and binding
problem is formulated. It is shown that this
problem can be relaxed to the bipartite
weighted matching problem which is solvable
in where n is the number of functional
units or registers, respectively. The application
of the technique on benchmarks shows the
tightness of the bounds. Most of the investigated
bounds were less than 1% off the minimum
respectively maximum solutions.
1.1 Keywords: High-level power estimation, bounds
estimation
We present a novel macromodeling technique for estimating the energy dissipated in a logic circuit for every input vector pair (we call this the energy-per-cycle). The macromodel is based on classifying the input vector pairs on the basis of their Hamming distances and using a different equation-based macromodel for every Hamming distance. The variables of our macromodel are the zero-delay transition counts at three logic levels inside the circuit. We present an automatic characterization process by which such macromodels can be constructed. This energy-per-cycle macromodel provides a transient energy waveform, and can also be used to estimate the moving average energy over any time window. This approach has been implemented and models have been built and tested for many circuits. The average error observed in estimating the energy-per-cycle is under 20%. The model can also be used to measure the long-term average power, with an observed error of under 10% on average.
In this work we propose an exact technique for efficient computation of signal statistics during high-level synthesis for low-power of general control-dominated designs. Our approach does not require iterative simulation: simulation is performed once for all to collect boundary information that will be repeatedly exploited for computing signal statistics for alternative implementations.
The objective of this paper is to present an analytic
technique for power analysis under non-stationary
conditions. We use the transitive closure calculation to
identify the transient component in the behavior of the target
machine and then, based on the fundamental matrix and a
symbolic approach (or support from simulation), we find the
actual power distribution that corresponds to the transient
regime. The present technique complements the current
techniques (either for average or peak power estimation) to
handle the case when transient effects exist and cannot be
ignored.
1.1 Keywords:
power consumption, transient regime, Markov chains
The use of dual threshold voltages can significantly reduce the static power dissipated in CMOS VLSI circuits. With the supply voltage at 1V and threshold voltage as low as 0.2V the subthreshold leakage power of transistors starts dominating the dynamic power. Also, many times a large number of devices spend a long time in a standby mode where the leakage power is the only source of power consumption. We present a near-optimal approach to synthesize low static power CMOS VLSI circuits with two threshold voltages that reduces power consumption compared with a previous approach by up to 29.45%. Also, presented is a technique which finds static power optimal configurations for CMOS VLSI circuits when arbitrary number of threshold voltages are allowed.
Clock networks account for a significant fraction of the power dissipation of a chip and are critical to the performance. This paper presents theory and algorithms for building a low power clock tree. Two low power schemes are used: a reduced swing scheme and one using multiple supply voltages. We analyze the issue of tree construction and present conclusions relevant to various technology generations according to the National Technology Roadmap of Semiconductors (NTRS). Our experimental results show that the power could be saved an average of 45% for a 0.25 um technology using multiple supply voltages, and 31% using reduced swing buffers.
We developed a methodology and tools for synthesizing monotonic static CMOS networks, which consist of alternating low-skewed and high-skewed static gates. When used with a dual V T process, monotonic static CMOS can simultaneously reduce standby static power and increase performance by using low V T devices in the evaluation networks and making all other devices high V T . Experimental results show monotonic static CMOS to be 1.67 times faster than traditional static CMOS.
We present a novel input pattern generator for dynamic power network simulation. The obtained patterns successfully identify critical voltage drop areas for a set of industrial designs, which are difficult to be found using functional vectors. The search engine of the pattern generator for worst-case IR voltage drop is based on the multi-objective genetic algorithm. To achieve high coverage for critical voltage drop cells, we propose to model the search criteria into the maximum weighted matching of a bipartite graph, and guide the search direction according to the matching results. Experimental results show that, compared with the other approaches, our patterns give a higher coverage of critical voltage drop cells.
From Devices to Systems : Re-Directing the Future of Low Power Design [p. 162] Moderator: Massoud Pedram
We discuss key barriers to continued scaling of supply voltage and technology
for microprocessors to achieve low-power and high-performance. In particular,
we focus on short-channel effects, device parameter variations, excessive
subthreshold and gate oxide leakage, as the main obstacles dictated by
fundamental device physics. Functionality of special circuits in the
presence of high leakage, SRAM cell stability, bit line delay scaling, and
power consumption in clocks & interconnects, will be the primary design
challenges in the future. Soft error rate control and power delivery
pose additional challenges. All of these problems are further compounded
by the rapidly escalating complexity of microprocessor designs. The
excessive leakage problem is particularly severe for battery-operated,
high-performance microprocessor.
Keywords: Microprocessor, VLSI design, memory, low-power design.
A recent trend in low power design has been the employment of reduced precision processing methods for decreasing arithmetic activity and average power dissipation. Such designs can trade off power and arithmetic precision as system requirements change. This work explores the potential of Distributed Arithmetic (DA) computation structures for low power precision-on-demand computation. We present two proof-of-concept VLSI implementations whose power dissipation changes according to the precision of the computation performed.
Gating the clock is an important technique used in low
power design to disable unused modules of a circuit. Gating
can save power by both preventing unnecessary activity in
the logic modules as well as by eliminating power dissipation
in the clock distribution network.There is an inherent
pitfall though in implementing gating groups for hierarchical
gated clock distribution because the groups are typically
developed at the logic level with no information of the physical
layout of the clocktree. Depending on the distribution of
underlying sinks, maintaining gating groups can cause a
wiring overhead that is potentially greater than the savings
due to reduced switching. We look at modifications of zero-skew
tree algorithms to consider both the physical and logical
aspects of hierarchical gating. The algorithms are
applied to data taken from a low power ASIC design. The
best gated clocktree is created using both physical and logical
information.
Keywords: clocktree, clockgating, low power, physical
design
While guarded evaluation has proven an effective energy saving
technique in arithmetic circuits, good methodologies do not exist
for determining when and how to guard for maximal savings.
Three new internal guarding techniques are presented in adders
that increase energy savings up to 38% over existing external
guarding techniques. This allows guarded evaluation to be effective
at duty cycles as much as 20% higher than are currently practical.
A modeling methodology is presented defining the energy
and energy delay of a unit in a generic application space. These
models can easily be incorporated into an automated selection
technique to determine the optimal guarded implementation. This
technique is tested on a DSP ASIP, increasing overall energy
savings by preventing unnecessary guarding. The data is generalized
and it is observed that guarding is most beneficial when the
ratio of guarding transistors to driven computational transistors is
1 /10 or lower.
Keywords:
Guarded evaluation, low power design, datapath energy modeling.
This work presents the design of an energy efficient
FPGA architecture. Significant reduction
in the energy consumption is achieved by tackling
both circuit design and architecture optimization
issues concurrently. A hybrid
interconnect structure incorporating Nearest
Neighbor Connections, Symmetric Mesh Architecture,
and Hierarchical connectivity is used.
The energy of the interconnect is also reduced
by employing low-swing circuit techniques.
These techniques have been employed to design
and fabricate an FPGA. Preliminary analysis
show energy improvement of more than an
order of magnitude when compared to existing
commercial architectures.
1.1 Keywords:
FPGA, low power, low swing signalling
The goal of a dynamic power management policy is to reduce the power consumption of an electronic system by putting system components into different states, each representing certain performance and power consumption level. The policy determines the type and timing of these transitions based on the system history, workload and performance constraints. In this paper, we propose a new abstract model of a power-managed electronic system. We formulate the problem of system-level power management as a controlled optimization problem based on the management as a controlled optimization problem based on the theories of continuous-time Markov decision processes and stochastic networks. This problem is solved exactly and efficiently using a "policy iteration" approach. Our method is compared with existing heuristic approaches for different workload statistics. Experimental results show that power management method based on Markov decision process outperforms heuristic approaches in term of power dissipation savings for a given level of system performance.
The purpose of this paper is to report the power and performance of an application on a real system as the CPU frequency varies. Previous work in CPU speed-setting considered only the power of the CPU and only CPU's that vary supply voltage with frequency. This work takes a broader approach, considering total system power, battery capacity and main memory bandwidth. The results, which are up to a factor of four less than ideal, show that all three must be considered when setting the CPU speed, whether the speed is fixed at a single value or varied dynamically during operation.
We propose a technique for reducing the energy required by firmware code to execute on embedded systems. The method is based on the idea of compressing the most commonly executed instructions so as to reduce the energy dissipated in memory accesses. Instruction decompression is performed on the y by a hardware module located between processor and memory: No changes to the processor architecture are required. Hence, our technique is well-suited for systems employing IP cores whose internal architecture cannot be modified. We describe a number of decompression schemes and architectures that effectively trade off hardware complexity for memory energy and bandwidth reduction, as proved by experimental data collected by executing several sample programs.
Energy-efficient design of battery-powered embedded systems demands optimizations in both hardware and software. In this work we leverage cycle-accurate energy consumption models to explore compiler and source code optimizations aimed at reducing energy consumption. In addition, we extend cycle-accurate architectural power simulation with battery models that provide battery lifetime estimates. The enhanced simulator and software optimizations are used to study and optimize the power dissipation of Smart-Badge , a wearable system based on the ARM microprocessor developed by HP Laboratories. We found that standard compiler optimizations give less than 1% energy savings. Source code optimizations are capable of up to 90% energy savings. In addition, our analysis of battery lifetime for the MPEG decoder implemented on the SmartBadge shows that battery efficiency varies greatly with discharge currents on cycle-by-cycle basis and can cause up to 16% reduction in battery lifetime.
A new compact physics-based Alpha-Power Law MOSFET Model is introduced to enable projections of low power circuit performance for future generations of technology by linking the simple mathematical expressions of the original Alpha-Power Law Model with their physical origins. The new model, verified by HSPICE simulations and measured data, includes: 1) a subthreshold region of operation for evaluating the on/off current trade-off that becomes a dominant low power design issue as technology scales, 2) the effects of vertical and lateral high field mobility degradation and velocity saturation, and 3) threshold voltage roll-off. Model projections for MOSFET CV/I indicate a 2X-performance opportunity compared to NTRS extrapolations for the 250, 180, and 150nm generations subject to maximum leakage current estimates of the roadmap. NTRS and model calculations converge at the 70nm technology generation, which exhibits pronounced on/off current interdependence for low power gigascale integration (GSI).
This paper investigates the basic mechanisms of hysteretic delay and noise margin variations for floating-body Partially-Depleted SOI CMOS domino circuits in detail. Three cases, based on whether the input signals are "domino input signals" from other domino circuits; "static input signals" from static circuits or latches; or a combination of "domino and static input signals" are examined and differentiated. It is shown that hysteretic delay variation is larger and noise margin worse for the later case with "mixed domino and static input signals." Although the delay and noise margin disparities between the three types of input signals are significant at beginning of the clock cycles, they converge as the circuit approaches steady-state.
Scaling of supply voltage (Vdd) is essential for controlling active power dissipation in complex digital circuits. Transistor threshold voltage (Vt) variation is one of the key limiters to Vdd scaling. Several adaptive body biasing schemes have been proposed earlier to reduce the impact of die-to-die Vt variation. Unfortunately, body bias degrades short channel effect (SCE) in the MOSFET. As technology is scaled down, this adverse effect of body biasing poses an increasingly serious challenge to controlling SCE and results in worse within-die Vt variation. The scaling trends of body bias values required to reduce die-to-die Vt variations and the resulting increase in within-die Vt variation are presented across three different technology generations.
Autonomous transceivers working in the ISM UHF bands should meet both requirements
of a long battery lifetime and a small overall volume, thus implying to cut
the receiving power consumption down to less than 1mW. Ultimately, this goal
will only be reached by using original topologies and lowering the supply
voltage down to single battery cell operation.
A RF front-end and a power-amplifier (PA) designed for the 433 MHz European
ISM band are presented. Both RF building blocks have been integrated in a
standard 0.5 um digital CMOS process with 0.65 V threshold voltages. The
front-end includes an LNA and a downconverter mixer. It achieves a total
double sideband (DSB) noise figure of 9 dB, with a dynamic range of 85 dB
for a 60 kHz bandwidth, while dissipating only 250 uW at 1.2 V supply voltage.
The PA includes two fully integrated Class A stages together with an output
Class C amplifier. It achieves a +4 dBm output power with a 15% overall
efficiency under 1.2 V supply voltage.
Keywords: Low Power, Low Voltage, UHF band, Low Noise Amplifier (LNA),
Mixer, Flicker Noise, Power Amplifier (PA).
Motivated by the emerging needs for low power, low cost narrow-band wireless communication systems, the first micropower RFIC front-end has been implemented in standard CMOS technology. The front-end, an LNA combined with a down-conversion mixer, has been designed and fabricated in a HP 0.8 mm CMOS process. This mandates the use of high-Q discrete inductors to provide sufficient gain for the LNA. Employing these design methods, the front-end supply current is less than 110 mA with a 3V supply voltage for operation at 450 MHz. High-Q inductors have been manufactured using low-temperature co-fired ceramic (LTCC) technology. The front-end's gain is 25 dB with an IIP3 of -15 dBm. This is the lowest current consumption reported to date for a CMOS front-end operating at this frequency.
A Differentially controlled monolithic LC-VCO along with a differential charge pump are used to implement a differential PLL for substrate noise immunity. The differential VCO control is achieved with minimal increase in the power consumption and without sacrificing the tuning range. In a 0.5um CMOS technology the measured VCO phase noise is -119dBc @1.0MHz and the tuning range is 26% of the 1.25GHz center frequency, at a total power consumption of 4.0mA from 3V supply. The common mode rejection of the VCO control lines is more than 2000 at DC. The new differential charge pump architecture provides common mode correction without the need for a clean reference.
A low-power, high-speed logic style using Passive Precharge and Rippled Power is proposed. Ultra-low threshold voltage (Vt) devices permit high speed operation, while the heavy leakage current pre-charges dynamic nodes. High Vt devices prevent leakage through the logic. The high Vt devices provide power evaluate a sequence of logic gates and are activated in series for periods of time which are short relative to the clock period. The power effectively ripples through the logic path. These innovations combine to produce low power circuits that maintain very high speeds. A 16 bit by 16 bit multiplier was simulated in HSPICE using this logic style. We achieved a clock rate of 1 GHz with a latency of 1.3 ns. At that clock frequency the power dissipation is 10.9 mW.
We demonstrate that, there is an optimum reverse body bias, unique to any technology generation, that minimizes the standby leakage power consumption of an IC design implemented in that technology. We also show: (1) the optimum reverse body bias value reduces by ~2X per technology generation, and (2) the maximum achievable leakage power reduction by reverse body biasing diminishes by ~4X per generation under constant field technology scaling scenario. Optimum point occurs as a result of reduction in subthreshold leakage with applied reverse bias. Therefore, new junction engineering techniques to reduce the bulk band-to-band tunneling leakage current component across the junction are needed to preserve the effectiveness of reverse body biasing for standby leakage control in future technologies.
As we approach Gigascale Integration, chip power consumption is becoming a critical system parameter. Clock-gating idle units provides needed reductions in power consumption. However, it introduces inductive noise that can limit voltage scaling. This paper introduces an architectural approach for reducing inductive noise due to clock-gating through gradual activation/deactivation of units. This technique provides a 2x reduction in ground bounce on a 16 bit ALU simulated in SPICE, while reducing simulated SPEC95 performance by less than 5% on a typical superscalar architecture.
Dynamic logic circuits [2] are used in high-performance circuits due to their speed and area advantage over static CMOS circuits. One well-known dynamic logic family is the domino CMOS family, which, however, suffers from its inability to perform inversions. Various methods have been proposed to overcome this restriction. One such method is the dual-output domino logic family. In the standard dual-output domino logic gate shown in Figure 1 each dual-output gate consists of two standard domino logic gates, producing the output, R and its complement, R. The advantage of the dual-output.
This paper presents low-power asynchronous barrel shifters for variable length encoders and decoders useful in portable applications using multimedia standards. Our approach is to create multi-level asynchronous barrel shifters optimized for the skewed shift control statistics often found in these codecs. For common shifts, data passes through one level, whereas for rare shifts, data passes though multiple levels. We compare our optimized designs with the straight-forward asynchronous and synchronous designs. Both pre- and post-layout HSPICE simulation results indicate that, compared to their synchronous counterparts, our designs provide over a 40%savings in average energy consumption for a given average performance.
Various high-speed techniques have been developed
for multipliers, but with the increasing
popularity of mobile computing, a recent goal
has been to minimize power dissipation. A popular
delay-reduction technique applied to
adder circuits is polarity inversion of bits. As
this optimization reduces transistor count, it
also has the potential for lowering power dissipation,
and can be effectively applied to Wallace
tree partial product reduction stages. We
illustrate how this technique reduces power,
interconnect capacitance, and chip area. Power
reduction of up to 25% is achieved.
1.1 Keywords:
Multiplier, low power, inverse polarity.
A fair amount of work has been done in recent years on reducing power consumption
in caches by using a small instruction buffer placed between the execution
pipe and a larger main cache [1,2,6]. These techniques, however, often degrade
the overall system performance. In this paper, we propose using a small
instruction buffer, also called a loop cache, to save power. A loop cache has
no address tag store. It consists of a direct-mapped data array and a loop
cache controller. The loop cache controller knows precisely whether the next
instruction request will hit in the loop cache, well ahead of time. As a
result, there is no performance degradation.
Keywords: Low cost, low power, embedded systems, small program loops,
instruction buffering.
A methodology for power efficient partitioning of real-time data-dominated system specifications is presented. The proposed methodology aims at reducing the memory requirements in realizations of such applications by applying extensive code transformations in the initial system specification before partitioning over processors. This reorganization basically aligns the data production and consumption between the different procedures of the initial specification thus reducing the memory size requirements (and the resulting power) of the system's realizations especially those in the interfaces between different processors. The main novel contribution is that performance issues are explicitly taken into account during power oriented system-level transformations. The proposed methodology can be applied both in a parallel (programmable) processor context and also in heterogeneous hardware-software architectures.
This paper proposes a new approach using way prediction for achieving high performance and low energy consumption of set-associative caches. By accessing only a single cache way predicted, instead of accessing all the ways in a set, the energy consumption can be reduced. This paper shows that the way-predicting set-associative cache improves the ED (energy-delay) product by 60-70% compared to a conventional set-associative cache.
Distributed hypermedia system that supports collaboration is an emerging platform for creation, discovery, management and delivery of information. We present an approach to low power system design space exploration for distributed hypermedia applications. Traditionally, low power design and synthesis of application specific programmable processors has been done in the context of given number of operations required to complete a task. Our approach utilizes the modern advances in compiler technology and architectural enhancements that are well matched to the compiler technology. This work is, to the best of our knowledge, the first attempt to address the need for synthesis of low power hypermedia processors. Also, this is the first work to address the power efficiency through exploiting instruction level parallelism (ILP) found in hypermedia tasks by an production quality ILP compiler. Using the developed framework we conduct an extensive exploration of low power system design space for a hypermedia application under area and throughput constraints. The framework introduced in this paper is very valuable in making early low power design decisions such as architectural configuration trade-offs including the cache and issue width trade-off under area and throughput constraint, and the number of branch units and issue width.
In this paper, we present CubicPower, which is a dynamic power estimator based on Verilog/VHDL simulators. We propose the power characterization model and the probabilistic contribution measure (PCM) algorithm to calculate the actual power consumption of cell instances with given switching information. In addition to PCM, the state dependency and non-switching activity of gates are taken into account for more accurate power estimation. Experimental results of CubicPower show less than 10% error compared with the results of PowerMill simulation and the measured values of the IMS test equipment. Due to the PCM algorithm, CubicPower is more accurate than the leading commercial dynamic power estimator at the gate level and is 2-3 orders of magnitude faster than PowerMill.
This paper reviews specific circuit styles and strategies employed in the design of CMOS VLSI on partially-depleted (PD) SOI. These strategies address issues and problems that arise on PD SOI circuits (mainly due to the floating-body effect) such as delay hysteresis, noise margin reduction, etc. These circuit approaches also try to utilize SOI-specific properties to achieve a larger performance gain than that of a simple re-map of a bulk design to SOI. Although many aspects of CMOS design pertaining to SOI will be covered, the emphasis will be on dynamic and static circuits and high-performance SRAM's.
This tutorial presents a cohesive view of power-conscious system-level design. We consider systems as consisting of a hardware platform executing software programs. We address the problems of power estimation and minimization for such systems. We consider the major constituents of systems: processors, memories and communication resources. We analyze power dissipation in these components and we survey computer-aided power reduction techniques. We also consider global system-level control schemes, such as dynamic power management. We conclude by pointing out further research problems which are still open in this domain.