# A Reliable Low-Power Fast Skew-Compensation Circuit

Yi-Ming Wang

Dept.of Electrical Engineering Chung-Cheng University, Taiwan Chia-Yi, 621

Tel: +886-5-2720411 ext. 23282 Fax: +886-5-2720862 e-mail: elc@vlsi.ee.ccu.edu.tw

Abstract - A reliable low-power fast skew-compensation circuit is proposed. Operating on the clock with a 50% duty cycle, the new design is more reliable compared to conventional SMD-based circuits [1]-[3], which can operate only on the pulsed clock. This new circuit also gets phase locking within two clock cycles. The test circuit works successfully between 600-MHz  $\sim$  800-MHz with a power consumption of 25- $\mu$ W/MHz  $\sim$  36- $\mu$ W/MHz. When measured at 616.9-MHz and 791.4-MHz, the static phase is 76.8-ps and 124.5-ps, respectively.

### I. Introduction

The de-skewing clock buffer is an important building circuit for high-performance *SOC* chips. For applications that need fast locking, the mirror-type skew-compensation circuits, including Synchronous Mirror Delay (*SMD*) [1], Interleaved SMD (*ISMD*) [2], and Direct-skew-detect SMD (*DSMD*) [3] that performs the open-loop operation are commonly adopted.

However, SMD-based circuits need to operate on the pulsed clock. There are four problems associated with the pulsed clock. First, the design of a narrow-pulse clock is particularly difficult if the cell delay time is designed very small for a good phase resolution. Second, the pulse-width may be shrunk by the logic operation. Third, even if the pulsed clock can be successfully propagated to the output of the circuit, this narrow pulse may be easily filtered out by the following RC interconnections. Last, the pulse-width is sensitive to the process-voltage-temperature-loading (*PVTL*) variations. It is hard to obtain a robust design that operates on the pulsed clock.

In this work, we propose a new reliable low-power fast skew-compensation circuit. There are three important design concepts in the new design. The first is a new measure-and-compensate architecture. With the new architecture, the proposed circuit not only achieves fast locking, but also removes the need for a pulsed clock and gets rid of all the related problems. Moreover, the new architecture will be proved to be very power efficient because much less devices are active in the phase locking state. The second is a frequency-independent phase adjustment method. This design helps to shorten the delay line to half compared to the conventional circuits. Due to the shortened delay line, this design also reduces the maximal power consumption. Hence, the new circuit is named as the half-delay-line skew-compensation circuit (HDSC). Last, the new design keeps all the coarse delay cells as small as possible to save power, and uses a fine delay cell to reduce the static phase error.

The rest of this paper is organized as follows. Section II describes the architecture design and operating principles. Experimental results are presented in section III. Conclusions are given in the last section.

## II. Architecture Design

The block diagram of the circuit containing a *HDSC* and a clock driver is shown in Fig. 1. After the circuit is enabled, *HDSC* enters the (skew) measurement phase and the (clock) synchronization phase in sequence to achieve fast locking. The detailed operations are described as follows with the aids of Fig. 2(a).

#### A. Measurement Phase

The signals TDC\_start, TDC\_stop, and  $\overline{Init}$  / Sel are set low initially, and the external clock signal passes through path 1. When the clock signal reaches the output of the clock driver, the first positive edge of CK\_int will pull the signal TDC\_start high. After TDC\_start goes high, the next positive edge of CK\_ext pulls another

Jinn-Shyan Wang

Dept.of Electrical Engineering Chung-Cheng University, Taiwan Chia-Yi, 621

Tel: +886-5-2720411 ext. 33202 Fax: +886-5-2720862 e-mail: ieegsw@ccu.edu.tw



Fig. 1. The *block* diagram containing a *HDSC* and a clock driver.



Fig. 2. Operations of *HDSC* (a) without and (b) with phase adjuster.

signal TDC\_stop high. The paths for generating TDC\_start and TDC\_stop are designed balanced. Therefore, the interval between the positive edges of TDC\_start and TDC\_stop is just the difference between the clock cycle time and the clock skew, i.e.  $T_{ck} - tsk = T_{ck} - (td1 + td2 + td4 + 2 \cdot td3)$ , where t3 and t4 are the delay time of a multiplexer and the phase adjuster (PA), respectively. In this moment we assume t4 is equal to zero, and the details about PA will be described later. This time interval is exactly the delay length that the delay line should provide.

Before enabling the *HDSC*, the selection signals  $(S_0, S_1, S_2, ...,$  $S_{n-1}$ ) are pre-set to (1, 0, 0, ..., 0). So, when TDC\_start goes high, this wave will propagate along path 2 of Fig. 1 and enter the delay line from the first delay cell. A delayed version of TDC\_start is used as the selection signal  $\overline{Init}/Sel$ . If CK\_ext is high when  $\overline{Init}/Sel$ goes high, the output of MUX2 will also be kept high. If CK\_ext is low when Init/Sel goes high, the output of MUX2 will be pulled low. A possible variation way of outputs of delay cells  $D_0 \sim D_{n-1}$  is drawn in Fig. 3. D<sub>0</sub>~D<sub>n-1</sub> are sent to the input terminals of Time-to-Digital Converter (TDC), which is composed of a set of positive-edge-triggered flip-flops (PETDFFs) triggered by TDC\_stop. When TDC captures the outputs of delay cells, the wave front of TDC start will reach the output of the m-th delay cell. This means  $m \cdot tdc = T_{ck} - tsk$ , which is the amount of delay time to be provided by the delay line. In the example of Fig. 3, we assume m=3 and  $(D_0, D_1, D_2, D_3, ..., D_{n-1})=(0, 1, 1, 0, ..., 0)$ , where  $D_0$  is pulled down by the wave front of CK ext.



| CKT bit                     | 0 1 2 3 4 n-3 n-2 n-1 |
|-----------------------------|-----------------------|
| TDC Outputs<br>@TDC_stop_/  | 01100•••000           |
| Priority Encoder<br>Outputs | 00100•••000           |
| Bit-Reverser<br>Outputs     | 00000•••100           |

Fig. 3. Operation of the delay control unit.

#### B. Synchronization Phase

The purpose of the synchronization phase is to determine the appropriate positions that the active delay cells in the steady state should be placed, and to align the rising edges of the external and internal clock signals at the end of this phase. One of the possible placement methods is to let the clock signal enter the delay line from the first delay cell while being tapped out at the output of the m-th delay cell. However, such a design is not power-efficient because the clock signal still passes through all the other delay cells. *HDSC* adopts a power-efficient design, where the clock signal only passes through the required number of delay cells that are near the output of the delay line. In this manner, other preceding cells are kept quiet to save power. Two more steps are needed for converting the selection signals, and the conversion procedures are explained as follows using the example in Fig. 3.

- 1. A priority encoder is used to find the position of the rightmost '1' from the outputs of *TDC*. The outputs of the priority encoder are shown in the attached table in Fig. 3.
- 2. Then, a bit-reverser will reverse the order of the outputs of the priority encoder to obtain the desired selection signals, as shown in the last row of the attached table in Fig. 3.

As described before, TDC captures the outputs of delay cells at the second positive edge of  $CK\_ext$ . During the second cycle of  $CK\_ext$ , the wave front of  $CK\_ext$  goes along path 3 of Fig. 1. The delay time of TDC is designed to be smaller than the delay time of (td3+td1), so the selection signals can be determined before the wave front enters the delay line. Therefore, the path from  $CK\_ext$  to  $CK\_int$  has a total propagation delay of  $T_{ck}$ . In this manner the clock skew is eliminated at the end of the second clock cycle.

To shorten the delay line and to reduce the static phase error, a phase adjuster (PA) and a fine delay cell (fdc) are added in the HDSC as shown in Fig. 1. The fdc provides a small delay offset. This offset is set about half of the coarse cell delay. This offset reduces the maximal static phase error to half of the coarse cell delay. Actually, in HDSC the maximal static phase error is estimated to be a little more than half of the coarse cell delay, owing to the loading unbalance between the input buffer and the dummy input buffer.

To see how the PA works, let's temporarily ignore the fdc placed before the PA. As illustrated in Fig. 2(b), an additional control signal  $\overline{0^0}/180^0$  is used to indicate if the clock skews longer than half of the cycle time. When clock skew is larger than half of the cycle time  $(\overline{0^0}/180^0$ ="'0"), the PA will generate an in-phase signal and the clock path is same as the design without PA. The other way the PA will generate an out-of-phase signal and TDC\_stop will be pulled high by the following negative edge of CK\_ext and not anymore by the positive edge of CK\_ext. Hence, the interval between the positive edges of TDC\_start and TDC\_stop is now smaller by half of the clock cycle than that in the design without PA. By the design with PA, the delay line length and the maximal power consumption are reduced at the same time. At last, a dummy PA (DPA) is also added in the clock's measurement path, and it is used to compensate the intrinsic delay caused by the PA in the synchronization path.

In summary, the *HDSC* can operate on a normal clock signal, and will not encounter all the problems with the pulsed clock. The *HDSC* is power efficient in that the clock signal in the steady state only passes through the input buffer and the clock driver once, and the circuits not on the clock path of the steady state are kept quiet.

## III. Experimental Results

An *HDSC* test chip has been implemented to verify the feasibility of new design techniques. This experiment focuses on the verification of *HDSC* at high frequencies. The cell delay was designed to be a little bit larger to shorten the delay line. The frequency range is also reduced to shorten the delay line to reduce the power consumption. Chip features are summarized in Table 1. The chip microphotograph and measurement setup are shown in Fig. 4(a) and 4(b), respectively. The probed waveforms with a clock frequency of 616.9-MHz and 791.6-MHz are shown in Fig. 4(c) and Fig. 4(d), respectively.



Fig. 4. (a) The chip photograph, (b) the measurement setup, and probed waveforms at (c) 616.9 MHz and (d) 791.6 MHz.

The estimated maximal static phase error is a little larger than 110-ps, while the measured static phase error at 616.9-MHz and 791.6-MHz is 76.8-ps and 124.5-ps, respectively. Both the number of active delay cells and the power consumption depend on the amount of the clock skew and thus on the loading of the clock signal generated by the probing conditions. If the clock frequency is 791.6-MHz, the power consumption is between 19.86-mW and 28.55-mW. This means that the *HDSC* only consumes 25- $\mu$ W/MHz  $\sim$  36- $\mu$ W/MHz.

TABLE 1 SUMMARY OF CHIP FEATURES

| Bellining of Child Enteres                         |                             |                                       |                   |  |
|----------------------------------------------------|-----------------------------|---------------------------------------|-------------------|--|
| Process                                            |                             |                                       | 0.35-μm 1P4M CMOS |  |
| Core size                                          |                             |                                       | 300-μm × 620-μm   |  |
| Supply voltage                                     |                             |                                       | 3.0 V             |  |
| Number of delay cells                              |                             |                                       | 7                 |  |
| Cell delay                                         |                             |                                       | 220 ps            |  |
| Simulated frequency range                          |                             |                                       | 600-MHz ~ 800-MHz |  |
| Locking cycles @ $f=791.6$ MHz and $tsk = 5.13$ ns |                             |                                       | 16                |  |
| Estimated maximal phase error                      |                             |                                       | > 110 ps          |  |
| f = 616.9MHz                                       | Measured static phase error |                                       | 76.8 ps           |  |
|                                                    | Simulated                   | if the number of active delay cell =1 | 15.29-mW          |  |
|                                                    | power consumption           | if the number of active delay cell =7 | 21.94-mW          |  |
|                                                    |                             | leasured static phase error           | 124.5 ps          |  |
| f = 791.6MHz                                       | Simulated                   | if the number of active delay cell =1 | 19.86-mW          |  |
|                                                    | power<br>consumption        | if the number of active delay cell =7 | 28.55-mW          |  |

#### References

- [1] Takanori Saeki, et al., "A 2.5-ns clock access, 250-MHz, 256-Mb SDRAM with synchronous mirror delay," *IEEE J. Solid-State Circuits*, vol. 31, no. 11, pp. 1656-1668, Nov. 1996
- [2] Kihyuk Sung, et al., "Low power clock generator based on an area-deduced interleaved synchronous mirror delay scheme," 2002 IEEE International Symposium on Circuits and Systems, pp. 671 -674, 2002.
- [3] Takanori Saeli, et al., "A direct-skew-detect synchronous mirror delay for application-specific integrated circuis," *IEEE J. Solid-State Circuits*, vol. 34, no.3, pp. 372-379, March 1999.