A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Roman Lysecky, Frank Vahid*

Department of Computer Science and Engineering
University of California, Riverside
{rlysecky, vahid}@cs.ucr.edu

*Also with the Center for Embedded Computer Systems at UC Irvine

Abstract
In previous work, we showed the benefits and feasibility of having a processor dynamically partition its executing software such that critical software kernels are transparently partitioned to execute as a hardware coprocessor on configurable logic – an approach we call warp processing. The configurable logic place and route step is the most computationally intensive part of such hardware/software partitioning, normally running for many minutes or hours on powerful desktop processors. In contrast, dynamic partitioning requires place and route to execute in just seconds and on a lean embedded processor. We have therefore designed a configurable logic architecture specifically for dynamic hardware/software partitioning. Through experiments with popular benchmarks, we show that by specifically focusing on the goal of software kernel speedup when designing the FPGA architecture, rather than on the more general goal of ASIC prototyping, we can perform place and route for our architecture 50 times faster, using 10,000 times less data memory, and 1,000 times less code memory, than popular commercial tools mapping to commercial configurable logic. Yet, we show that we obtain speedups (2x on average, and as much as 4x) and energy savings (33% on average, and up to 74%) when partitioning even just one loop, which are comparable to commercial tools and fabrics. Thus, our configurable logic architecture represents a good candidate for platforms that will support dynamic hardware/software partitioning, and enables ultra-fast desktop tools for hardware/software partitioning, and even for fast configurable logic design in general.

Keywords
Hardware/software partitioning, FPGA fabric, configurable logic, synthesis, place and route, platforms, system-on-a-chip, dynamic optimization, codesign, self-improving chips, just-in-time compilation, warp processors, reconfigurable computing.

1. Introduction
Dynamic software optimization is becoming an increasingly popular method of improving software performance and power. In dynamic optimization, a user executes a standard binary program on a processor system. The processor system itself then monitors the executing binary, detects the frequently executed kernels, and optimizes those kernels. Existing optimizations include dynamic recompilation and caching of previous binary translation results [3][12][18]. Dynamic optimizations can be carried out by extra tasks sharing the processor and/or by extra hardware.

One advantage of dynamic optimization is that dynamic optimization fits seamlessly into traditional software design flows, requiring no special desktop tools and no special profiling step. While designers familiar with hardware design flows may not mind an extra tool or profiling step, the vast majority of software design flows are solidly established and not open to such changes. Further advantages include that dynamic optimization can be based on real runtime data, large amounts of such data, and changing runtime data, all of which represent additional benefits compared to desktop-based optimization. A drawback is that the optimization algorithms may have to be less powerful than desktop algorithms. Nevertheless, dynamic optimizations continue to increase in popularity.

Current dynamic software optimizations typically exhibit performance and power improvements on the order of 10-20%. However, with the advent of single-chip platforms having both microprocessors and configurable logic on the same chip, like Triscend’s E5 and A7 [29] platform, Altera’s Excalibur [1], Atmel’s Field Programmable System Level Integrated Circuit (FPSLIC) [2], and Xilinx’s Virtex-II Pro [33], a far more powerful optimization has become possible. Re-implementing the software kernels as a hardware coprocessor on the configurable logic, known as hardware/software partitioning, can result in overall software speedups of 200%-1000% [4][9][10][11][17][30], as well as reducing system energy [15][16][26][31].

Until recently, hardware/software partitioning has only been implemented as a desktop CAD tool, typically incorporated into a software compiler that partitions high-level code, like C or C++. Recently, we showed [27] that desktop hardware/software partitioning could be done starting from binaries rather than from high-level code, with competitive resulting performance and energy. Additionally, a recently introduced commercial tool performs coprocessor synthesis from standard software binaries [8]. Binary-level partitioning approaches can produce excellent results by using decompilation techniques to retrieve most of the high-level information typically lost at the binary level [7]. Such binary-level partitioning opens the door to dynamic hardware/software partitioning, in which an executing binary is dynamically optimized by moving software kernels to configurable logic – a process we call warp processing, since performance and energy are automatically warped during software execution.

Given the critical kernels of an application, dynamic partitioning requires decompilation, compiler optimization, behavioral synthesis, logic synthesis, and finally placement and routing onto a configurable logic architecture. While implementing all those tools on-chip for dynamic execution on a lean processor may at first sound absurd given the long computation times and huge resource usage of those tools’ desktop counterparts, we have previously shown the feasibility of implementing many such tools on-chip [20][21][25]. The key is to recognize that dynamic tools need only focus on speeding up
kernels, which typically consist of only a few dozen lines of code and result in hardware consisting of only 10,000 to 30,000 gates. Furthermore, dynamic tools map to only one target technology. In contrast, desktop tools must handle much bigger designs and must be much more general.

In [25], we presented the benefits and feasibility of dynamic partitioning using prototype tools, executing on a lean embedded processor and producing good results for a number of popular embedded system benchmarks. However, our earlier work in dynamic hardware/software partitioning used a very basic configurable logic architecture as a proof-of-concept. That architecture only supported combinational logic, could only implement loops with sequential memory accesses, and incurred a large routing overhead. Nevertheless, place and route was still the most computationally expensive step (as is also true using desktop tools) in dynamic partitioning – an order of magnitude more expensive than all the other steps combined. Therefore, we set out to design a new configurable logic architecture and underlying configurable logic fabric; along with a modified place and route algorithm, that together would support a much larger range of benchmarks while requiring reasonable computation time and memory resources.

In this paper, we present our warp configurable logic architecture (WCLA) for dynamic hardware/software partitioning, specifically targeted at speeding up critical loops of embedded systems applications. We did so in part by evaluating digital signal processors (DSP) and incorporating into our architecture features from the DSP domain specifically designed at increasing loop performance, such as data address generators and loop control hardware. Additionally, we analyzed potential architectural features of our configurable logic architecture with regards to the impacts on place and route tools.

In this paper, we summarize related work, introduce our WCLA, and provide results comparing our tools and architecture to a commercial tool and configurable logic, showing that we obtain similar speedups and energy savings, yet use orders of magnitude less runtime, data memory, and code memory.

2. Previous Work
Many configurable logic architectures have been developed to increase embedded software performance. These approaches use Field-Programmable Gate Arrays (FPGAs), reconfigurable computing, or even custom ASIC processors to achieve improvements in software execution. Existing approaches for improving embedded software performance can be classified as general (FPGAs), fine-grained configurable systems, coarse-grained configurable system, and custom ASIC processors.

Many techniques have been proposed for hardware/software partitioning. One common method for implementing hardware in such an approach is using traditional commercially available FPGAs. However, traditional FPGAs are not well suited for use in dynamic hardware/software partitioning. Traditional FPGAs are typically designed to handle an extremely wide variety of designs and are frequently used to prototype ASIC circuits. To support these vastly different designs, FPGA vendors, such as Xilinx [33] and Altera [1], design FPGAs with complex logic cells having embedded sequential components, large routing resources, large input/output resources, capabilities to support sequential logic, etc. However, in a dynamic hardware/software partitioning approach, traditional FPGAs provide more capabilities than are needed. A configurable logic architecture for implementing critical loops typically has a very simple interface to the main processor and memory and thus does not require general input/output capabilities of traditional FPGAs. Critical loops often consist of simple combinational logic or small sequential circuits and thus do not require large numbers of logic cells. Furthermore, due to their complexity, traditional FPGAs require complex synthesis, technology mapping, and place and route tools, which are not targeted for very fast execution.

Many researchers have developed techniques using FPGAs or fine-grained configurable logic in ways to improve software performance through partitioning. DISC is a run-time configurable system that dynamically swaps in hardware regions into an FPGA when needed during software execution [32]. Chimaera is a similar approach that uses an FPGA as a coprocessor tightly integrated into a processor’s datapath [13]. The Garp project couples an extended MIPS processor with a reconfigurable coprocessor under direct control of the software executing on the processor [14]. Although fine-grained configurable systems have shown very good speedups, their use in a dynamic hardware/software partitioning is limited by their reliance upon complex FPGA architectures. Furthermore, with respect to improving performance of critical loops, approaches that tightly integrate the configurable logic within the datapath are not able to eliminate loop overhead, consisting of branches, comparisons, and memory address calculations.

Other approaches for increasing embedded software performance rely upon coarse-grained configurable logic architecture. MorphSys is a reconfigurable computing platform that incorporates a RISC processor with an array of reconfigurable processing components [19]. The configurable components are coarse-grained ALU-like components that perform operations including two-operand logic functions, arithmetic functions, and multiply-accumulate.

For accelerating the execution of critical software loops, coarse-grained configurable logic architectures are limited in the number of applications that can be mapped to them. To overcome the limitations of coarse-grained configurable logic, some approaches have proposed using heterogeneous configurable logic consisting of coarse-grained units along with fine-grained configurable logic. The Chameleon [24] project and the Pleiades [31] project both propose using heterogeneous configurable logic in conjunction with a general-purpose processor. These approaches benefit from using custom designed coarse-grained units to handle commonly used operations while supporting custom operations, such as bit manipulation, using an FPGA. However, we cannot include coarse-grained functional units to support all operations frequently found within critical loops. Hence, in developing a configurable logic architecture for warp processing, specifically targeting speeding up critical loops, we must carefully analyze the possible inclusion of any coarse-grained units.

Tensilica has developed the Xtensa architecture that allows designers to customize the Xtensa processor by adding custom instructions using the Tensilica Instruction Extension (TIE) language [28]. After describing the newly added instructions, a designer is provided with a synthesizable description of the extended processor along with the associated software development tools. However, for optimizing the performance of critical software loops within embedded applications, the Xtensa processor suffers from the same drawbacks of tightly integrated fine-grained configurable logic approaches.
3. Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Our original dynamic-partitioning-oriented configurable logic architecture (CLA) presented in [25] incorporated a configurable logic fabric comprised of 3-input 2-output lookup tables surrounded by routing resources. While the original CLA was capable of supporting several embedded applications, the amount of routing resources used for those applications was quite large. Furthermore, some benchmarks with larger critical loops would not route using that architecture. Table 1 presents the routability and percent of routing resources used for 13 embedded benchmarks from NetBench, MediaBench, EEMBC, and Powerstone for our original CLA. While the original CLA was developed as a proof-of-concept, the routability of such an architecture is limited. The original CLA only supports five of the 13 embedded benchmarks, and on average routing for the five routable benchmarks requires 57% of the total routing resources available. Hence, we see a need to develop a configurable logic architecture and underlying configurable logic fabric that can support a larger range of applications while being simple enough to allow for lean on-chip synthesis and place and route tools.

Table 1: Routability (Routable) and percent routing resources used (% Routing) for the original CLA and our WCLA.

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Original CLA</th>
<th>SCLF</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Routable % Routing</td>
<td>Routable % Routing</td>
</tr>
<tr>
<td>brev</td>
<td>yes 67%</td>
<td>yes 12%</td>
</tr>
<tr>
<td>g3fax1</td>
<td>yes 76%</td>
<td>yes 18%</td>
</tr>
<tr>
<td>g3fax2</td>
<td>yes 40%</td>
<td>yes 14%</td>
</tr>
<tr>
<td>url</td>
<td>yes 40%</td>
<td>yes 14%</td>
</tr>
<tr>
<td>logmin</td>
<td>yes 63%</td>
<td>yes 14%</td>
</tr>
<tr>
<td>pktflow</td>
<td>no</td>
<td>yes 14%</td>
</tr>
<tr>
<td>canrrd</td>
<td>no</td>
<td>yes 14%</td>
</tr>
<tr>
<td>bitmp</td>
<td>no</td>
<td>yes 12%</td>
</tr>
<tr>
<td>tblook</td>
<td>no</td>
<td>yes 47%</td>
</tr>
<tr>
<td>ttsprk</td>
<td>no</td>
<td>yes 47%</td>
</tr>
<tr>
<td>matrix01</td>
<td>no</td>
<td>yes 61%</td>
</tr>
<tr>
<td>idctrn01</td>
<td>no</td>
<td>yes 22%</td>
</tr>
<tr>
<td>g721</td>
<td>no</td>
<td>yes 9%</td>
</tr>
<tr>
<td><strong>Average:</strong></td>
<td><strong>57%</strong></td>
<td><strong>23%</strong></td>
</tr>
</tbody>
</table>

Since we are targeting critical loops that usually iterate many times before completion, our WCLA must be able to access memory and to control the execution of the loop. One approach to handle memory accesses and loop control is to implement a finite state machine (FSM). However, using a FSM would require a configurable logic fabric supporting sequential logic circuits and would further require more complex synthesis, technology mapping, and place and route tools that must now consider scheduling and timing constraints. Instead, we found that DSP processors typically contain data address generators and loop control hardware to achieve a zero loop overhead, meaning that cycles are not wasted computing loop bounds and sequential memory addresses. We include a DADG with LCH in our warp configurable logic architecture to handle all memory accesses as well as to control the execution of the loop. However, our DADG is restricted to memory accesses that follow a regular access pattern.

Figure 1 shows the overall organization of our proposed warp-processor configurable logic architecture (WCLA) for dynamic hardware/software partitioning, consisting of a data address generator (DADG) with loop control hardware (LCH), three input and output registers, a 32-bit multiplier-accumulator (MAC), and our simple configurable logic fabric (SCLF). Our configurable logic architecture handles all memory accesses to and from the configurable logic using the data address generator, which is capable of generating addresses for up to three distinct arrays. Furthermore, the data retrieved and stored to and from each array is located within one of the three registers Reg0, Reg1, and Reg2. These three registers also act as the inputs to our configurable logic fabric and can be mapped as inputs to the 32-bit (MAC) or directly mapped to the configurable logic fabric. Finally, we connect the outputs from our configurable logic fabric as inputs to the three registers using a dedicated bus.

As mentioned earlier, we must carefully analyze the addition of any coarse-grained hardware components within our warp configurable logic architecture. We found that there are many operations typically seen within critical software loops, including addition, subtraction, multiplication, etc. Most of these operations are easily implemented using fine-grained configurable logic. However, multipliers that operate within a single cycle are large and require many interconnects within the internal components. Furthermore, while we often see multiplications in critical code regions, they are often in the form of a multiply-accumulate operation. Implementing a multiplier with a small configurable logic fabric is generally slow and requires a large amount of logic and routing resources. Therefore, we include a 32-bit multiplier-accumulator within our configurable logic architecture to help conserve resources while reducing technology mapping and place and route execution times required for multipliers.
Figure 2(a) shows our SCLF consisting of an array of combinational logic blocks (CLB) surrounded by switch matrices (SM) for routing between CLBs. Each CLB is connected to a single switch matrix to which all inputs and outputs of the CLB are connected. We handle routing between CLBs using the switch matrices, which can route signals in one of four directions to an adjacent SM (represented as solid lines) or to a SM two rows apart vertically or two columns apart horizontally (represented as dashed lines).

Choosing the proper size for the CLBs is important, as the CLB size directly impacts area resources and delays within our configurable logic fabric. Several studies have analyzed the impacts of CLB size on both area and timing [6][23]. These studies have shown that look-up tables (LUT) with five or six inputs result in circuits with the best performance, and LUTs with less than three inputs result in significantly worse performance. Another study analyzed the impacts on cluster sizes of CLBs on speed and area of various circuits [22]. The cluster size of a CLB is the number of single output LUTs with the CLB. Their findings indicate that cluster sizes of 3 to 20 LUTs were feasible, and a cluster size of eight produced the best tradeoff between area and delay of the final circuits. However, while we would like to incorporate large cluster sizes within our configurable logic fabric, such clusters allow more flexibility during technology mapping and placement phases during dynamic partitioning, which in turn requires more complex technology mapping and placement algorithms to handle the added complexity.

Figure 2(b) shows our combinational logic block architecture. Each CLB consists of two 3-input 2-output LUTs, which provides the equivalent of a CLB consisting of four 3-input single output LUTs, and therefore should exhibit a reasonable tradeoff between area and delay. We chose 3-input 2-output LUTs to simplify our technology mapping and placement algorithms by restricting the choices our tools will analyze in determining the final circuit configuration. Additionally, the CLBs are capable of supporting carry chains through direct connections between horizontally adjacent CLBs and within the CLBs through internal connections between adjacent LUTs. Hardware components, such as adders, comparators, etc., frequently require carry logic and so providing support for carry chains simplifies the required routing for many hardware circuits.

Finally, Figure 2(c) shows our switch matrix architecture. Each switch matrix is connected using eight channels on each side of the switch matrix, four short channels routing between adjacent nodes and four long channels routing between every other switch matrix. Routing through the switch matrix can only connect a wire from one side with a given channel to another wire on the same channel but a different side of the switch matrix. Additionally, each of the four short channels is paired with a long channel and can be connected together within the switch matrix (indicated as a circle where two channels intersect) allowing wires to be routed using short and long connections. Designing the switch matrix in this manner simplifies the routing algorithm by only allowing the router to route a wire using a single pair of channels throughout the configurable logic fabric.

Commercially available FPGAs consist of similar routing resources but typically are capable of routing between switch matrices much further apart and often include routing channels spanning an entire row or column. While such routing resources are beneficial in terms of creating compact designs with less routing overhead, the flexible routing resources require complex place and route tools that are not amenable to on-chip execution. Therefore, we chose to limit the complexity of routing resources to allow for simplified place and route algorithms.

Finally, we developed a set of dynamic hardware/software partitioning tools, the Riverside On-Chip Partitioning (ROCPAR) tools, including synthesis, technology mapping, placement, and routing, based on the algorithms presented in [25] for our WCLA. While our tools still incorporate the same greedy algorithms of the original tools, we updated the algorithms to take advantage of the improved routing resources and larger CLBs within our SCLF. By designing a configurable logic architecture specifically for dynamic hardware/software partitioning, we have expanded the range the applications that a dynamic hardware/software partitioning approach can support. As shown in Table 1, our WCLA uses much less routing resources, using an average of 22% of the routing resources available for the given examples. Furthermore, for the five examples supported by the original architecture, our WCLA uses on average only 14% of the available resources, which corresponds to a 4X improvement over the original CLA, for the same silicon area.

4. Results

We compare a dynamic hardware/software partitioning using our WCLA with a typical hardware/software partitioning approach targeting a Xilinx Virtex-E FPGA, comparing speedup and energy reduction for 13 embedded systems benchmarks from NetBench, MediaBench, EEMBC, and Powerstone. Our experimental framework consists of an ARM7 processor executing at 75 MHz coupled with either our WCLA or a Xilinx Virtex-E series FPGA. Furthermore, our WCLA executes at a fixed frequency of 60 MHz due to the current implementation of our MAC and DADG, while
the FPGA executes at the highest frequency possible for each design when synthesized and mapped using Xilinx ISE 4.1 [33]. For each benchmark, we determined the single most critical loop and partitioned the critical loop to hardware either using our ROCPAR or synthesizing a custom VHDL implementation of the critical loop. While partitioning a single critical loop produces good results, the speedups would be even greater if we considered multiple loops.

Figure 3 and Figure 4 highlight the speedup and energy reduction of our WCLA and the Xilinx FPGA for all 13 benchmarks. In determining speedups of the two approaches, all software execution times were determined using the SimpleScalar simulator [5] ported for the ARM instruction set. Additionally, we determined execution times for the hardware implementations using a high-level simulator for our configurable logic architecture and using VHDL simulations with the appropriate clock frequency for each implementation determined during synthesis. We calculated the energy required for each partitioned application using the equations in Figure 5. We used the Xilinx Virtex Power Estimator along with information provided by Xilinx ISE to determine total power consumed by the FPGA when active as well as the overall static power. The approach used by the Xilinx Power Estimator consists of providing information including the number of LUTs, number of flip-flops, average switching within the FPGA, and clock frequency to determine the power consumed by the FPGA. We implemented a similar estimation approach to determine the power consumed by our WCLA. We implemented a small version of our configurable logic architecture in VHDL and synthesized the design using Synopsys Design Compiler targeting the UMC 0.18 μm technology library provided by Artisan Components. Using gate-level simulations, we determined the power consumed by individual components within our configurable logic architecture.

Although our WCLA is much simpler than the Xilinx FPGA, on average our WCLA achieved a speedup of 2.1 with an average energy reduction of 33.1%. These results are very close to the average speedup of 2.2 and energy reduction of 36.2% achieved using a Xilinx FPGA. We initially thought that while our configurable logic architecture would not produce results better than a general FPGA, our configurable logic architecture should result in less overall energy consumption compared with an FPGA. However, for several benchmarks, including tblook, ttsprk, matrix01, and idctrn01, our WCLA had a higher energy consumption than the Xilinx FPGA. We determined that for matrix01 and idctrn01, the high energy consumption was mainly caused by the use of our embedded multiplier, which has a higher energy consumption than the dedicated multiplier support within the Xilinx FPGA. Additionally, all four benchmarks had large energy consumption resulting from a large usage of routing resources, indicating the need to develop new place and route algorithms that can run on-chip environment while producing good results — a task we are working on.

We also evaluated our ROCPAR tools, comparing them with Xilinx ISE 4.1. Table 2 displays the average data memory usage and code size in kilobytes and the average execution times in seconds of Xilinx ISE and ROCPAR executing on a 1.4 GHz Pentium workstation. Table 2 also displays execution time in seconds for ROCPAR executing on a 75 MHz ARM7 processor. On average, executing our simplified tools in an embedded environment requires less than 2 seconds, which is quite feasible. Furthermore, the maximum data memory required was on average less than 10 kilobytes. While Xilinx ISE was never designed to execute on-chip, the large data memory requirements indicate that the algorithms and data structures used by these tools are not suitable for on-chip execution either. Thus, new synthesis, technology mapping, and place and route tools are required for a dynamic hardware/software partitioning approach.

**5. Conclusions**

Dynamic hardware/software partitioning represents a far more powerful dynamic optimization than currently proposed dynamic software optimizations, the former achieving 200%-400% performance improvements rather than the typical 10%-20% of the latter — with greater improvements easily possible by partitioning more than one loop. The hardest step of dynamic partitioning is placing and routing onto a configurable logic

![Figure 3: Speedups for HW/SW partitioning using our WCLA and a Xilinx Virtex-E FPGA.](image)

![Figure 4: Percent energy reduction for HW/SW partitioning using our WCLA and a Xilinx Virtex-E FPGA.](image)

![Figure 5: Equation for determining energy consumption after hardware/software partitioning.](image)

### Table 2: Average data memory usage (kilobytes), code size (kilobytes), and execution time (seconds) of Xilinx ISE 4.1 and ROCPAR executing on a PC and a 75 MHz ARM7.

<table>
<thead>
<tr>
<th></th>
<th>Data Memory</th>
<th>Instruction Memory</th>
<th>Execution Time</th>
</tr>
</thead>
<tbody>
<tr>
<td>Xilinx ISE (PC)</td>
<td>54384</td>
<td>58700</td>
<td>9.1</td>
</tr>
<tr>
<td>ROCPAR (PC)</td>
<td>6</td>
<td>57</td>
<td>0.2</td>
</tr>
<tr>
<td>ROCPAR (ARM7)</td>
<td>6</td>
<td>57</td>
<td>1.4</td>
</tr>
</tbody>
</table>
fabric. We have simultaneously designed a new configurable logic architecture and simple configurable logic fabric along with accompanying place and route algorithms specifically intended for dynamic partitioning. Our architecture includes data address generators for accessing up to three distinct arrays in memory, loop control hardware to control loop iterations, a 32-bit multiplier-accumulator, and a configurable logic fabric with simple combinational logic blocks and routing resources. We have shown that our architecture, fabric, and accompanying algorithms result in place and route that is 50 times faster than commercial tools, using 10,000 times less data memory and 1,000 times less code memory, and running in a reasonable time of just a few seconds on a small embedded microprocessor. Yet, we also showed that our architecture and lean tools still obtain comparable speedups to commercial configurable logic and tools, with an average software speedup of 2.1 and energy savings of 33% when partitioning even just one loop – speedups and savings will be even more when additional loops are considered. We are continuing to improve our architecture, fabric, and tool set, to handle a larger set of applications.

6. Acknowledgements

This research was supported in part by the National Science Foundation (CCR-0203829, CCR-9876006) and by the Semiconductor Research Corporation (grant CSR-2002-RJ-1046G).

7. References