Multi-Frequency Wrapper Design and Optimization for Embedded Cores Under Average Power Constraints

Qiang Xu
ECE Department
McMaster University
ON L8S 4K1, Canada
xuq3@mcmaster.ca

Nicola Nicolici
ECE Department
McMaster University
ON L8S 4K1, Canada
nicola@ece.mcmaster.ca

Krishnendu Chakrabarty
ECE Department
Duke University
Durham, NC 27708, USA
krish@ee.duke.edu

ABSTRACT

This paper presents a new method for designing test wrappers for embedded cores with multiple clock domains. By exploiting the use of multiple shift frequencies, the proposed method improves upon a recent wrapper design method that requires a common shift frequency for the scan elements in the different clock domains. We present an integer linear programming (ILP) model that can be used to minimize the testing time for small problem instances. We also present an efficient heuristic method that is applicable to large problem instances, and which yields the same (optimal) testing time as ILP for small problem instances. Compared to recent work on wrapper design using a single shift frequency, we obtain lower testing times and the reduction in testing time is especially significant under power constraints.

Categories and Subject Descriptors
B.7.2 B.7.3 [Integrated Circuits]: [Design Aids, Reliability and Testing]

General Terms
Algorithms, Performance, Design, Reliability

Keywords
Wrapper Design, Multiple Clock Domains, Scan Control Unit

1. INTRODUCTION

Modern systems-on-a-chip (SOCs) use embedded cores that operate internally with multiple clock domains. For example, for a digital video processing SOC reported in [19], the number of clock domains for each core ranges from 2 to 12. In addition, some cores may operate internally at very high rates, typically employing phase-locked loops (PLL) to generate on-chip clocks from far slower external reference signals. For these high performance cores with increasing number of clock domains, there are two major test challenges: (i) Traditional techniques (e.g., $I_{DQ}$ or functional testing) used for detecting timing-related defects are less effective [7, 13]; (ii) Clock skew during test might corrupt test data and render the test useless [17]. Therefore, to ensure a high quality of defect screening, it is essential that core tests are able to be conducted at rated-speed without clock skew problem.

Many solutions for scan-based at-speed testing have been introduced [7, 13, 18]. In addition, several techniques [14, 17] have been proposed to test designs with multiple clock domains. However, regardless of their effectiveness, these endeavors mainly consider testing at the chip level and they need to be adapted for testing core-based SOCs, which employ a “divide and conquer” test strategy at core level. On the other hand, most of the existing test wrapper architectures and wrapper design algorithms [2, 5, 6, 9, 11] are only applicable to single-frequency embedded core test. Cumbersome and invasive design techniques such as the insertion of anti-skew latches are needed to make these techniques applicable to current-generation embedded cores. The forthcoming P1500 standard does not provide any direct or non-invasive support for the modular testing of cores with multiple clock domains.

To the best of our knowledge, [21] provides the only strategy in the literature for at-speed testing of cores with multiple clock domains. This solution described a P1500-compliant wrapper [10], which effectively solved the clock skew problem. However its limitation lies in the fact that different clock domains share the same shift clock. Consequently, because this single shift frequency directly impacts the tradeoff between the average power consumption and scan time, excessive test application time (TAT) may result under tight average power constraints. Elevated average power can cause structural damage to the silicon, bonding wires, or the package. It also adds to the thermal load that must be transported away from the core under test. In addition, if all the flip-flops update their state on the same clock edge during shift, the simultaneous switching noise can cause a large voltage drop that may lead to erroneous data transfer, thus invalidating the testing process [12].

The above problems can be addressed by allowing distinct shift frequencies for scan chains in different clock domains, which can be in the range of tens to hundreds of MHz [16, 20] based on the scan chain design and test power requirements. For example, if each clock domain can operate at a distinct shift frequency, lower TAT may be achieved under tight power constraints. Furthermore, by introducing a phase in between the shift clocks used for different clock domains, the number of flip-flops that latch values at the same time will be limited to the number of flip-flops per clock domain, thus avoiding the excessive voltage drop on power/ground lines. Therefore, in this paper, we propose a power-constrained wrapper for cores with multiple clock domains, by extending the design procedure from [21], so that different clock domains can use distinct shift clock signals. Note that these distinct shift clock signals are generated inside the proposed core wrapper. Therefore, unlike in [16], we do not require the tester to shift data at multiple frequencies. Many low- and medium-end testers are not equipped with advanced port scalability features, which allow groups of channels to be driven at different data rates. Thus, the proposed core
wrapper allows testing at multiple scan data rates even with less expensive testers. Furthermore, the proposed solution facilitates an added degree of freedom for trading-off power dissipation against test time, without changing the width of tester lines connected to the core. This will, in turn, enable a better design space exploration of system-level test schedules when both power constraints and the depth of the tester buffers need to be accounted for.

The rest of this paper is organized as follows. Section 2 introduces a novel scan control unit for cores with multiple clocks. Power-constrained wrapper optimization by exploiting multiple shift frequencies is described in Section 3. Experimental results and conclusion are described in Sections 4 and 5 respectively.

2. DESIGN OF SCAN CONTROL UNIT

In this section, we first briefly review the multi-frequency wrapper described in [21], which forms the basis of the proposed solution. Next we discuss the new features of the scan control unit that are required for multiple shift clocks.

In [21], to avoid clock skew during the shift phase in scan testing, logic blocks belonging to different clock domains are grouped as different virtual cores (VCs); see Figure 1. For each VC, a single-frequency virtual wrapper, containing the wrapper scan chains (WSCs) for the respective group, is assigned. At-speed capture for transition-hazard clock domains is controlled by the scan control block, which avoids clock skew during the capture phase. In addition, advanced ATPG techniques that are able to handle multi-frequency design as described in [3, 8], are used for test pattern generation. The virtual wrapper is connected to the core interface through internal virtual test bus (VTB) lines. To tradeoff the TAT against test power, the number of internal VTB lines ($W_{vtb}$) is not necessarily the same as the external test access mechanism (TAM) width assigned to the core ($W_{ext}$). Instead, the bandwidth matching technique [4] is utilized to map the external TAM wires to the internal VTB lines. That is, by introducing frequency converters VTB – DIU (VTB – MIU) on the input (output) of the core under test, the internal VTB lines are able to operate at a lower frequency $f_1$ that satisfies $W_{ext} \times f_1 \geq W_{vtb} \times f_s$, where $f_s$ is the tester frequency. It is important to note that at-speed test is controlled by

1Bandwidth is defined as the product of the width and the frequency of a scan architecture.

on-chip high-speed clock (e.g., from PLL) instead of the tester and, consequently, the proposed technique is particularly relevant when used in conjunction with low-speed testers. To save hardware overhead, both $f_1$ and $f_s$ are determined by dividing $f_{TCK}$ (frequency of TCK, driven by the highest-speed functional clock) by powers of 2. Note that the above is just a brief description of the previously proposed multi-frequency wrapper architecture and, for the sake of completeness, further details can be found in [21].

The scan control block is a major part of the wrapper, which provides the scan enable and shift/capture clock signals to all the VCs. Figure 2(b) depicts the block diagram of the proposed scan control design. When compared to [21] (Figure 2(a)) in which the different VCs share the same shift clock, the proposed clock division unit outputs multiple shift clock signals for different VCs. From the timing diagram shown in Figure 3, we can observe that virtual core 3 is shifted at 50MHz while virtual cores 1 and 2 are shifted at 100MHz. We can also observe that the clock phases of Gated_clock[1] and Gated_clock[2] are different. As a result, the capability to generate distinct shift clocks for different VCs not only expands the solution space during wrapper optimization (as detailed in the following section), but it also decreases the peak power consumption during the scan shift phase. Another novel feature of the proposed scan control unit is that the scan phase is controlled solely on-chip, i.e., it does not need the scan enable signal provided from the automatic test equipment (ATE), as is the case in the previous approach [21]. Since current-generation SOCs may contain tens or even hundreds of cores, if the scan enable signals for all the cores are provided from the ATE, the number of pins available for test data will be reduced, thus increasing TAT [1]. Since the start of the test can be determined by decoding the wrapper instruction and because the length of each test pattern is known, all the scan enable signals can be generated internally. As shown in Figure 2(b), the TestStart signal obtained by detecting the change of the wrapper mode to INTEST is used to control the capture finite state machine (FSM) and the mux control unit, which is mainly composed of counters that control the change between the shift and capture phases.

In terms of DFT area, the capture window size and the number of clock domains decide the hardware overhead of the scan control block, which is generally less than 10% of the size of the existing P1500 and scan logic (e.g., when scan and P1500 introduce over 4000 gates, the scan control block adds less than 400 gates [21]). For complex cores with hundreds of thousands of gates this is insignificant, when compared to the benefits of facilitating at-speed multi-frequency test of IP-protected cores.

3. WRAPPER OPTIMIZATION

Since the new scan control design enables the scan chains for different clock domains to shift data at distinct frequencies, thereby saving TAT under tight power constraints, we propose a new wrapper optimization procedure to minimize TAT. The problem can be stated as follows:

Problem $P_{mfw-optim}$. Given the test set parameters for the multi-frequency core, including

- the number of clock domains $N_C$;
- for each clock domain (virtual core) $i$, the number of primary inputs, primary outputs, and bidirectional I/Os, the number of scan chains and scan chain lengths for fixed-length scan chains (or the number of scan cells when scan chains are flexible), the number of test patterns and the average power consumption $P_i$;
- the maximum allowed average test power $P_{ave}$;
- the ATE shift frequency $f_i$;

...
3.1 Wrapper Optimization using an ILP Model

Suppose the possible shift frequencies for each VC are \( f_i \in \{ F_1, F_2, \ldots, F_M \} \), which satisfy (i) \( F_{i+1} = \frac{F_i}{2} \), \( k \in \{ 1, 2, \ldots, M - 1 \} \) (the "divided by a power of 2" relationship guarantees easy hardware implementation); (ii) \( F_1 \times 1 + F_M \times (N_v - 1) \leq f_i \times W_{ext} \), i.e., the external scan bandwidth exceeds the internal bandwidth when the number of VTB lines for every VC is 1 and one clock domain shifts at \( F_1 \), while all the other clock domains shift at \( F_M \); and (iii) when all VCs shift at \( F_M \), the maximum allowed average test power \( P_{ave} \) is not violated. Hence, when the number of trial frequencies \( M \) is given (we assume \( M = 4 \) in this article), the values of \( F_1, \ldots, F_M \) can be pre-determined based on the above constraints.

Let \( W_i \) denote the number of virtual test bus lines assigned to clock domain \( i \). Now the maximum possible value of \( W_i \) will be \( W_{\text{max}} = \frac{F_M}{f_i} \times W_{ext} - N_v + 1 \). We are able to pre-calculate \( T_i(F_k, j) \), which is the test application time for clock domain \( i \), when \( W_i \) is equal to \( j \) and \( f_i \) is equal to \( F_k \). We consider that the given value of \( P_i \) is the power consumption for domain \( i \) when shifted at \( F_k \). Let us define the binary variable \( \delta_{ij} \) as \( \delta_{ij} = 1 \) only if \( W_i = j \), where \( j \in \{ 1, 2, \ldots, W_{\text{max}} \} \). In addition, let us define the binary variable \( \theta_{ik} \) as \( \theta_{ik} = 1 \) only if domain \( i \) is given a shift frequency \( F_k \), where \( k \in \{ 1, 2, \ldots, M \} \). Then the TAT of the core is:

\[
T_{\text{core}} = \max_i \left \{ \sum_{j=1}^{W_{\text{max}}} \sum_{k=1}^{M} \delta_{ij} \theta_{ik} T_i(F_k, j) \right \}
\]

The following constraints must be satisfied:

1. \( \sum_{j=1}^{W_{\text{max}}} \delta_{ij} = 1, 1 \leq i \leq N_v \), i.e., every virtual core is assigned to exactly one virtual test bus.

2. \( \sum_{k=1}^{M} \theta_{ik} = 1, 1 \leq i \leq N_v \), i.e., every virtual core is shifted in exactly one frequency.

\[\text{Figure 2: Comparison of Scan Control Blocks}\]

\[\text{Figure 3: Comparison of Timing Diagrams}\]
Since we have λ constraints, which yields the following ILP model:

\[ \sum_{i=1}^{N_C} W_i \times f_{ext} \leq W_{ext} \times f_i, \text{i.e., the external scan bandwidth is not exceeded.} \]

Since we have λ constraints, which yields the following ILP model:

\[ \sum_{i=1}^{N_C} W_i \times f_{ext} \leq W_{ext} \times f_i, \text{i.e., the external scan bandwidth is not exceeded.} \]

\[ f_{ii} = \sum_{j=1}^{M} \theta_i f_k \sum_{k=1}^{M} \theta_k 2^{M-k} f_M \]

constraint 4 can be converted to:

\[ \sum_{i=1}^{N_C} \sum_{j=1}^{M} 2^{M-k} \delta_j \theta_j j \leq W_{ext} \times \left( \frac{f_i}{f_M} \right) \]

The non-linear term \( \delta_j \theta_k \) must be linearized so that we can use the linear programming tools to solve this problem. This is done by introducing a new variable \( \lambda_{jk} = \delta_j \theta_k \) with additional constraints, which yields the following ILP model:

**Objective:** Minimize \( \max \{ \sum_{i=1}^{N_C} W_i f_{ii} \times 2^{M-k} \lambda_{jk} T_i(F_j, j) \} \), subject to the following constraints:

1. \( \sum_{j=1}^{M} \delta_j = 1, \quad 1 \leq i \leq N_C \)
2. \( \sum_{k=1}^{M} \theta_k = 1, \quad 1 \leq i \leq N_C \)
3. \( \sum_{i=1}^{N_C} \sum_{j=1}^{M} 2^{M-k} \theta_j k \leq P_{ave} \)
4. \( \sum_{i=1}^{N_C} \sum_{j=1}^{M} 2^{M-k} \lambda_{jk} j \leq W_{ext} \times \left( \frac{f_i}{f_M} \right) \)
5. \( \delta_{ij} + \theta_{ik} - \lambda_{jk} \leq 1, \quad 1 \leq i \leq N_C, \quad 1 \leq j \leq N_{max}, \quad 1 \leq k \leq M \)
6. \( \delta_{ij} + \theta_{ik} - 2 \lambda_{jk} \geq 0, \quad 1 \leq i \leq N_C, \quad 1 \leq j \leq N_{max}, \quad 1 \leq k \leq M \)

It should be noted that with the binary attribute of \( \delta_j \) and \( \theta_k \), the above constraints 5 and 6 effectively constrain \( \lambda_{jk} = \delta_j \theta_k \). The number of variables \( Num \) and constraints \( Num \) for this ILP model are \( Num_{max} + N_C + N_{max} + M \). Constraints 7 and 8, respectively. Because \( Num \) and \( Num \), can easily be in the range of thousands for a core with a large number of \( N_C \) and/or \( W_{ext} \), using an ILP solver to obtain the optimal TAM configuration requires large computation time. Hence in the next section we introduce an efficient heuristic for problem \( P_{mfw-opt} \), which can achieve near-optimal result within seconds.

### 3.2 Heuristic for Wrapper Optimization

The algorithm for core wrapper design with multiple shift frequencies (CWDMSF) takes as inputs the tester frequency (\( f_t \)), the test parameters of the multi-frequency core (\( C \)), the TAM width (\( W_{ext} \)), the pre-determined possible shift frequency (\( \{F_1, \ldots, F_M\} \)), the number of clock domains \( N_C \) and the maximum test power consumption \( P_{ave} \) and it outputs the wrapper design VC, including the shift frequency \( f_{si} \) and the number of VTB lines \( VTB_{i \rightarrow} \) for each virtual core \( VC \). The pseudocode for this procedure is shown in Figure 4.

The algorithm initializes the virtual cores, by assigning to each VC the inputs, the scan chains and the outputs which operate in its clock domain (line 1). In line 2 all the VTB lines are initialized to operate at the lowest possible frequency \( F_{min} \). Line 3 computes the power consumption \( P_{curr} \) (at this moment \( P_{curr} \) is the power consumption for clock domain \( i \) when shifted at \( F_M \)) and if \( P_{curr} > P_{ave} \) then the program exits because it cannot satisfy the power consumption constraint (line 4). Otherwise, each virtual core \( VC \) is first allocated one VTB line and then a single frequency core wrapper design algorithm SFCWD (e.g., Design_wrapper [2]) is performed to get an initial testing time (lines 5-8), which will be used as the starting point for virtual test bus line allocation (lines 9-21).

Depending on \( N_{TB} \), the algorithm proceeds as follows. First, all the virtual cores are sorted based on their TAM and the bottleneck VC (with longest TAM) is identified (line 11). Then the following steps will iteratively assign the remaining VTB lines to virtual cores. The basic idea is to assign more virtual test bus lines to the bottleneck virtual core and at the same time try different possible shift frequencies. Although increasing the frequency will lower TAM, if the current bottleneck VC is assigned a higher frequency without considering the increase in power, a suboptimal solution may be obtained because the available power budget for the next iteration is reduced. To account for this problem, we build a cost function that combines TAM and power, and we select the shift frequency that can obtain the minimum cost instead of minimum TAM. This is done in Algorithm 2 (Figure 5), which assigns VTB lines to the bottleneck VC. \( NoWeights \) number of power weights in the cost function are tried and we select the one which gives the shortest TAM (line 21).

In Algorithm 2, only one VTB line that operates at \( F_{min} \) is assigned each time. As a result, the bottleneck VC is first transformed to a temporary VC which operates at \( F_{min} \) (line 4). The cost function is built as in line 11, in which \( normalWeight \) is a constant used to match the TAM and the power consumption into comparable values. In our experiments, we select \( NoWeights = 100 \) and \( normalWeight = 200 \) to limit the run time to a few seconds. Inside the inner loop (lines 8-19), the algorithm selects the shift frequency
Algorithm 2: AssignVTBtovc

INPUT: \( V_{C_{\text{comp}}}, P_{d_{\text{ave}}}, P_{d_{\text{min}}}, N_{d_{\text{max}}}, \text{powerWeight} \)

OUTPUT: \( V_{C_{\text{comp}}}, \text{VTB}_{C_{\text{comp}}}, N_{d_{\text{max}}} \)

1. \( \tau_{\text{temp}} = \tau_{\text{orig}} \)
2. \( P_{d_{\text{ave}}} = P_{d_{\text{comp}}}, P_{d_{\text{other}}} = P_{d_{\text{comp}}} - P_{d_{\text{orig}}} \)
3. while \( (\tau_{\text{temp}} \geq \tau_{\text{orig}}) \&\& (N_{d_{\text{temp}}} > N_{d_{\text{comp}}}) \) do
4. \( f_{\text{temp}} = f_{\text{sys}}; \text{VTB}_{C_{\text{temp}}} = \text{VTB}_{C_{\text{orig}}} \times \frac{\text{temp}}{\tau_{\text{sys}}} \)
5. \( \text{VTB}_{C_{\text{comp}}}++; N_{d_{\text{comp}}}++; \)
6. \( \text{notTrials} = M; \)
7. \( \text{minCost} = \infty; \)
5. \( \text{V} \)
8. \( \text{while} (\text{notTrials} > 0) \) do
9. \( \text{do SFCWD}; \)
10. \( \text{compute} \tau_{\text{temp}}, P_{d_{\text{temp}}}; \)
11. \( \text{build the cost function} \)
12. \( \text{currCost} = (\tau_{\text{temp}} - \tau_{\text{orig}}) \times \frac{P_{d_{\text{ave}}} + \text{powerWeight}}{\tau_{\text{temp}} - \tau_{\text{orig}}}; \)
13. \( \text{if} (\tau_{\text{temp}} < \tau_{\text{orig}} \&\& \text{currCost} < \text{minCost}) \) do
14. \( \text{record the current virtual core wrapper design}; \)
15. \( \text{if} (\text{VTB}_{C_{\text{temp}}}/2 = 00 \&\& P_{d_{\text{other}}} + 2 \cdot P_{d_{\text{temp}}} < P_{d_{\text{ave}}}) \) do
16. \( f_{\text{temp}} = f_{\text{sys}} \times 2; \)
17. \( \text{VTB}_{C_{\text{temp}}} = \text{VTB}_{C_{\text{orig}}} \times 2; \)
18. \( \text{else} \) do
19. \( \text{break}; \)
20. \( \text{return} \tau_{\text{temp}}, f_{\text{temp}}, \text{VTB}_{C_{\text{temp}}}, N_{d_{\text{max}}}; \)

Figure 5: Procedure for Assigning VTBLines to the Bottleneck Virtual Core.

that minimize the cost and at the same time satisfies the power constraint (lines 12, 15). Whenever a VTBLine is assigned, SFCWD is performed again to get the new testing time (line 9). This program exits when the TAT of the bottleneck VC is reduced or all the virtual test bus lines are assigned with no TAT reduction.

The worst-case complexity of the single frequency wrapper design algorithm Design_wrapper is shown to be \( O(sc \cdot \log sc + sc \cdot W_{\text{ext}}) \) in [2], where \( sc \) is the number of internal scan chains. The worst-case complexity of the proposed CWDMSF algorithm is \( O(N_{c} \cdot \sum_{i=1}^{N_{c}} \log sc_{i} + W_{\text{ext}} \cdot sc_{\text{max}} \cdot \log sc_{\text{max}} + W_{\text{ext}} \cdot sc_{\text{max}}) \), where \( sc_{i} \) and \( sc_{\text{max}} \) are the number of internal scan chains for clock domain \( i \) and the maximum number of scan chains of all clock domains, respectively. The computational complexity is therefore quadratic in the number of clock domains and the number of external TAM wires. For a core with a fixed number of clock domains, it is quadratic in the number of external TAM wires.

4. EXPERIMENTAL RESULTS

To illustrate the importance of employing multiple shift frequencies in the wrapper architecture, this section shows the comparison between the wrapper design algorithm proposed in this paper and the one based on a single shift frequency reported in [21]. Benchmark SOC s available in the public domain do not contain clock domain information about the embedded cores. Therefore, we present results here for a hypothetical, but representative, multi-frequency embedded core. The hTCADT00 core used in [21] does not have a large number of clock domains and flip-flops. In order to show the TAT variations under power constraints, we have constructed a complex hypothetical multi-frequency core hTCADT01. This core has seven clock domains as shown in Table 1, where \( f_{\text{func}} \) denotes the functional frequency; \( N_{I}, N_{O}, N_{B}, \) and \( N_{S} \) are the number of inputs, outputs, bidirectionals and scan chains in the specific clock domain, respectively; the length of each scan chain in clock domain \( i \) is shown in column \( SC_{\text{length}, i} \); and \( P \) is the power consumption when shifting at 100MHz and is calculated as \( P_{f} = 5 \cdot SC_{\text{length}, i} \cdot (f_{f} \in SC_{\text{length}, i}) \). We assume the power consumption of a VC is proportional to the number of memory elements in it. Note that since the maximum internal frequency for the experimental core is \( f_{\text{max}} = 200MHz \), and we assume that the maximum frequency of the tester is 120MHz in our experiments, the tester will shift test data at \( f_{t} = 100MHz \), thus synchronizing with a division of \( f_{\text{max}} \).

Table 2 compares the test application time of hCADT01 when different power constraints \( P_{d_{\text{ave}}} \) are considered. \( T_{[2]} \) denotes the TAT for the single frequency shift architecture from [21] and \( T_{\text{new}} \) stands for the TAT obtained by the multi-frequency shift architecture from this paper derived using the heuristic approach from Section 3.2. \( \Delta T \) is computed as \( \Delta T = \frac{T_{\text{new}} - T_{[2]}}{T_{[2]}} \). Even when there is no power constraint (i.e., \( P_{d_{\text{ave}}} \) is infinite), we can observe the shifting time is reduced for almost all the given TAM widths from 1 to 16. We can also observe that the proposed architecture leads to much shorter TAT when the power constraint is tighter. For example, when the given TAM width is \( W_{\text{ext}} = 6 \), and the power constraint \( P_{d_{\text{ave}}} = 1500 \), \( T_{\text{new}} \) is only half of \( T_{[2]} \). This is because all the VC s are constrained to shift at 12.5MHz to meet the power requirements in the single-frequency shift architecture from [21], and clock domain 5 (\( f_{\text{func}} = 50MHz \)) dominates with TAT=41.68\( \mu s \). For the architecture proposed in this paper, clock domain 5 is able to shift at 25MHz which results in TAM=20.84\( \mu s \) while still meeting the power constraint.

We have also implemented the ILP method using a public linear programming solver \( lp_{\text{solve}} \) [15]. We obtain the same results as the heuristic method when \( W_{\text{ext}} \leq 4 \). When the external TAM width is larger, \( lp_{\text{solve}} \) does not run to completion in 10 hours, using a 900MHz Pentium III PC with 256MB memory. The execution time of the heuristic is, however, only a few seconds. Nevertheless, the ILP method is useful in that shows the heuristic yields optimal results for \( W_{\text{ext}} \leq 4 \). In addition, for \( W_{\text{ext}} > 4 \), the lower bounds are obtained using LP-relaxation (the variables \( \theta_{k} \) and hence also \( \lambda_{ij} \) in the ILP model were “relaxed” to reals). Due to the nature of LP-relaxation these lower bounds are not “tight”, which implies that they may not be reachable with integer values. The lower bounds for both \( W_{\text{ext}} \leq 4 \) (obtained through ILP) and \( W_{\text{ext}} > 4 \) (from LP-relaxation), are shown in columns \( T_{\text{LB}} \) of Table 2, from which we can observe that the proposed heuristics generate close values to them.

5. CONCLUSION

We have presented a new method for designing test wrappers for embedded cores with multiple clock domains, by allowing scan chains in different clock domains to shift test data at distinct frequencies. We also proposed an ILP model and efficient heuristics to optimize the wrapper in terms of testing time. Experimental results have been presented for a hypothetical, but representative, multi-frequency embedded core. Compared to recent work on wrapper design using a single shift frequency, we obtain lower testing times, and the reduction is especially significant under power constraints.

6. REFERENCES

Table 1: hCADT01 Clock Domain Information

<table>
<thead>
<tr>
<th>W_ext</th>
<th>W_in</th>
<th>P_new = 1500</th>
<th>P_new = 3000</th>
<th>P_new = 4500</th>
<th>P_new = m</th>
</tr>
</thead>
<tbody>
<tr>
<td>16</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>6.72</td>
</tr>
<tr>
<td>15</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>6.91</td>
</tr>
<tr>
<td>14</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>7.41</td>
</tr>
<tr>
<td>13</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>8.17</td>
</tr>
<tr>
<td>12</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>8.77</td>
</tr>
<tr>
<td>11</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>9.20</td>
</tr>
<tr>
<td>10</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>9.62</td>
</tr>
<tr>
<td>9</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>10.42</td>
</tr>
<tr>
<td>8</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>12.82</td>
</tr>
<tr>
<td>7</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>14.73</td>
</tr>
<tr>
<td>6</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>17.52</td>
</tr>
<tr>
<td>5</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>20.84</td>
</tr>
<tr>
<td>4</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>24.05</td>
</tr>
<tr>
<td>3</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>27.56</td>
</tr>
<tr>
<td>2</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>31.76</td>
</tr>
<tr>
<td>1</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>36.02</td>
</tr>
<tr>
<td>0</td>
<td>20.84</td>
<td>20.84 -50</td>
<td>10.42</td>
<td>10.42 -50</td>
<td>41.68</td>
</tr>
</tbody>
</table>

Table 2: Comparison of Test Application Time for hCADT01 with Different Power Constraints


