# Compact 12-Port Multi-Bank Register File Test-Chip in 0.35µm CMOS for Highly Parallel Processors

Tetsuya Sueyoshi, Hiroshi Uchida, Hans Jürgen Mattausch, and Tetsushi Koide

Research Center for Nanodevices and Systems
Hiroshima University
1-4-2 Kagamiyama, Higashi-Hiroshima, 739-8527, Japan
Phone: +81-824-24-6265
Fax: +81-824-22-7185

email: {sueyoshi, hjm, koide}@sxsys.hiroshima-u.ac.jp

Abstract - We designed a compact, high-speed, and low-power bank-type 12-port register file test chip for highly-parallel processors in 0.35  $\mu m$  CMOS technology. In this full-custom test chip design, 72% smaller area, 25% shorter access cycle time, and 62% lower power consumption are achieved in comparison to the conventional 12-port-cell-based register file.

## I. INTRODUCTION

Important techniques for improving processor performance exploit possibilities for parallel execution of instructions. Conventional superscalar processors, used in today's PCs and notebooks, detect independent instructions in the normal program flow and can execute up to 4 of these instructions in parallel. A new development, used in so called Simultaneous Multi-Threading (SMT) processors, divides the processor's work load into largely independent program parts called threads. By executing these threads in parallel, SMT processors achieve a further increase of the total number of instructions, which are executed in parallel. It is expected that the number of simultaneously executed instructions as well as the number of threads, which are simultaneously active, will continue to increase. To enable this development, it is necessary to supply a sufficient number of access ports as well as registers (also called entries) for the processor's register file. For example, a register file design that supports an 8-issue/4-thread SMT processor has to be equipped with 24-ports and 512 entries [1]. If the conventional multiport-cell based architecture is used for such a register file, the problems of unacceptable increase in area, access time, and power consumption arise [2]. We propose to solve these problems with a multiport register file having a multibank structure. Here we report the design of a test chip in a 0.35µm CMOS technology, which verifies the effectiveness of our proposal [3,4].

#### II. MULTI-BANK REGISTER FILE ARCHITECTURE

We have selected the *Hierarchical Multiport-memory Architecture (HMA)* [2] shown in Fig. 1 for the bank-type register file. HMA reduces the multiport-memory's area by using 1-port-cell banks. The main difference between HMA and the conventional crossbar architecture is that HMA realizes the crossbar function in distributed form with a *1:N port convertor* attached to each bank and not in the conventional centralized form. In this way the necessary number of transistors and global wirings can be reduced without degrading the functionality [5]. In order to solve access conflicts to the banks, an *access conflict management circuit* is included on the 2nd hierarchy level of HMA. Access conflicts to the banks for each port are detected, and permission/prohibition of these accesses are decided.

The block diagram of the proposed HMA register file is shown in Fig. 2. At the beginning of an access cycle the access conflict management circuit and row/column bank selectors deter-

Yosuke Mitani and Tetsuo Hironaka

Faculty of Computer Sciences Hiroshima City University 3-4-1 Ozuka-Higashi, Asaminami-ku, 731-3194, Japan Phone: +81-82-830-1566 Fax:+81-82-830-1792

email: hironaka@csys.ce.hiroshima-cu.ac.jp

mine, which port can access which bank. For a read access, an address is supplied at the selected read-port and the data from the accessed register in the bank is transferred to the read-port. At the time of writing, address and data are supplied at the selected write-port and the data is written into the accessed register of the bank.

The main problem of a bank-type register file is due to possible access conflicts. If there is access from two or more ports simultaneously to one bank, only one port can access the bank and the others cannot. In consequence the processing time of the related instructions is delayed and the processor-performance is degraded. To handle this problem we developed an efficient register-access scheduling methodology, which includes, for example, register renaming, combination of accesses to the same register, and register access queues [4]. In order to examine the effectiveness of this scheduling methodology and to determine the required bank number, we simulated the performance of our bank-type register file in comparison to an ideal register file. The result of the simulation experiment for a register file with 12 ports (8 read-ports, 4-write-ports) and 128 entries is shown in Fig. 3. Such a register file is typical for a processor executing 4 instructions in parallel. It turns out that an instruction processing performance nearly equivalent to the ideal 12-port register file requires only 4 banks.

# III. DESIGN COMPARISON AND TEST-CHIP FABRICATION

We have designed the 12-port HMA register file with 4 banks and 128 32-bit registers as well as the conventional register file with 12-port memory cells in a 0.35µm CMOS technology with 3-metal layers [4]. The layout comparison of both designs is shown in Fig. 4. An L-shaped floorplan for the banks of the HMA register file was applied, to enable the routing of global wirings for outputs and inputs on top of the banks. As shown in Fig. 5, a dense design with minimized length and area consumption of the wiring could be achieved. The photomicrograph of the fabricated HMA register file in 0.35µm CMOS with 3 metal layers is depicted in Fig. 6. The area of the HMA register file is drastically reduced by 72%. This is explained by the quadratic area increase of the multiport memory cell in the conventional design [2] which amounts to about 40 times for a 12-port cell in comparison to the 1-port cell. On the other hand, in the HMA register file, the area-overhead of the 1:N port convertor for a bank is only 40 %. The smaller area of the HMA register file has the effect, that parasitic capacitances in the access path of the register file are much smaller and that the length of global wires is much shorter. Consequently, access-cycle time as well as power dissipation are also improved.

The summary of the register file design comparison including measured data is shown in Table I. In this table, (a) and (b) give simulation results from the layout, and (c) gives the measurement results of the fabricated test-chip. The bank-type HMA register file is verified to achieve enormous improvements,

reducing the area by 72%, the access cycle by 25% and the power dissipation by 62%, when compared with the conventional 12-port memory cell based register file. This demonstrates the effectiveness of the proposed HMA-based register file architecture.

# IV. CONCLUSIONS

Bank-type multiport register file architecture for highly parallel processors is proposed and successfully verified by a 12-port test chip with 128 registers in a 0.35µm CMOS technology. Detailed analysis showed that very large improvements in area efficiency (72%), access cycle time (25%), and power consumption (62%) are achieved in comparison to the conventional 12-port-cell-based register file. The bank-type register file was also verified to be applicable in superscalar processors without loss in processor performance, when applying an appropriate access scheduling methodology.



Fig.1. Hierarchical Multiport-memory Architecture (HMA).



Fig.3. Evaluation of register access scheduling efficiency for a 12-port bank-type register file. An execution time of 1 corresponds to an ideal 12-port register file.



- (a) Multiport SRAM cell based register file (conventional).
- (b) Hierarchical bank-type register file (proposal).

Fig.4. Layout comparison in a 0.35μm CMOS technology (12-ports, 128-registers).

## ACKNOWLEDGMENTS

The test-chip has been fabricated in the chip fabrication program of VDEC, the University of Tokyo in collaboration with Rohm Co. and Toppan Printing Co., and Cadence Design Systems, Inc.

This research is supported by Semiconductor Technology Academic Research Center (STARC), Yokohama, Japan.

#### REFERENCES

- [1] R. P. Preston, et al., "Design of an 8-wide superscalar RISC microprocessor with simultaneous multithreading," Dig. of Tech. Papers, ISSCC2002, pp. 334-335, 2002.
- [2] H. J. Mattausch, et al., "Area-efficient multi-port SRAMs for onchip data-storage with high random-access bandwidth and large storage capacity," IEICE Trans. Electron., Vol. E84-C, No. 3, pp. 410-417, 2001.
- [3] Y. Mitani, et al., "Access conflict resolution methods for superscalar processors with multi-bank register file," Tech. Rep. of IEICE, ARC-2002-150, pp. 41-46, 2002 (in Japanese).
- [4] H. Uchida, et al., "Small-area multi-port register files with multi-bank structure," Tech. Rep. of IEICE, ICD-2002-155, pp. 175-180, 2002 (in Japanese).
- [5] S. Fukae, et al., "Optimized bank-based multi-port memories through a hierarchical multi-bank structure," Proc. of SASIMI2003, pp. 323-330, 2003.



Fig. 2. Designed hierarchical bank-type register file architecture (4 write ports, 8 read ports, 128 registers).







Fig.6. Chip photo of the fabricated bank-type 12-port register file test chip in a 0.35µm, 3-metal CMOS technology.

Table 1. Design-comparison results between conventional 12-port-cell-based and hierarchical bank-type 12-port register files. (a) and (b) are circuit-simulation results and (c) are measured results of the test-chip.

|                                            | (a) 12-port        | (b) 12-port   | (c) 12-port   |
|--------------------------------------------|--------------------|---------------|---------------|
|                                            | SRAM-cell-         | HMA           | HMA           |
|                                            | based registerfile | register file | register file |
|                                            | (simulated)        | (simulated)   | (measured)    |
| Area [mm <sup>2</sup> ] (ratio)            | 7.84 (1)           | 2.23 (0.28)   |               |
| Cycle time [ns] (ratio)                    | 14 (1)             | 10.7 (0.76)   | 10.5 (0.75)   |
| Power dissipation<br>(@50MHz) [mW] (ratio) | 543 (1)            | 212 (0.39)    | 204 (0.38)    |