# UCRIVERSITY OF CALIFORNIA COMPUTER SCIENCE & RENGINEERING

# Compiling Code Accelerators for FPGAs

### Walid Najjar

*Computer Science & Engineering University of California Riverside* 

### Outline

### Introduction

- From glue logic to accelerators
- FPGA Accelerator Platforms

### 

- The Front-end
- The Back-end
- Applications
- Compilers for FPGA

W. Najjar



UCR

### Outline

UCR

- FPGA Primer
- Historical Evolution
- FPGA: A New HPC Platform?
- Analysis of the Speedup
- Conclusion





























### A historical perspective

The ages of FPGA evolution<sup>‡</sup>

- 1984 -- 1991 Age of Invention
- 1992 -- 1999 Age of Expansion
- + 2000 -- 2007 Age of Accumulation • 2008 -- 2015 Age of Specialization

<sup>‡</sup> Adapted from Steve Trimberger, Xilinx, Keynote address at FPL 2007,

Amsterdam, August 2007.

### W. Najjar

### **Age of Invention**

### 

UCR

- Technology
  - FPGAs used and designed as glue logic chips
  - Tight technology constraints
- Applications
  - FPGAs much smaller than most applications problem size

UCR

- Efficiency on chip is key
- Tools
  - Design automation is secondary



### Age of Accumulation

- Technology
  - FPGAs rapidly climb the process curves
  - Become the cutting edge devices
- Applications
  - Variety of applications drive introduction of
     CPUs, memory, DSP, special arithmetic, high-performance I/O
  - Complete system on chip (first SoCs?)
- Tools
  - Design tools must address system level issues

UCR

W. Najjar

### Next age?

### 

- Specialized acceleration
  - Customized acceleration circuit tailored to a specific code
  - Reconfigurable, static and dynamic
  - Steamed data to/from FPGA
- Applications
  - Image, signal and video processing
  - Security (encryption), intrusion detection

UCR

- Data mining
- Numerical applications

W. Najjar

### Outline

UCR

FPGA Primer

Historical Evolution

- FPGA: A New HPC Platform?
- Analysis of the Speedup
- Conclusion

### **FPGA: A New HPC Platform?**

David Strensky, FPGAs Floating-Point Performance -- a pencil and paper evaluation, in HPCwire.com Comparing a dual core Opteron to FPGA on fp performance: •Opteron: 2.5 GHz, 1 add and 1 mult per cycle. 2.5 x 2 x 2 = 10 Gflops •FPGAs Xilinx V4 and V5 with DSP cores Speed Logic DSP48 BRAM Total (MHz) (slices) (slices) 18bit/36bit Kbits Virtex 4

|           | LX160    | 500 | 67,584 | 96  | 288/0   | 5,185  |
|-----------|----------|-----|--------|-----|---------|--------|
|           | LX200    | 500 | 89,088 | 96  | 366/0   | 6,048  |
|           | Virtex 5 |     |        |     |         |        |
|           | LX220    | 550 | 34,560 | 128 | 384/192 | 6,912  |
|           | LX330    | 550 | 51,840 | 192 | 576/288 | 10,368 |
|           |          |     |        |     |         |        |
| W. Najjar | UCR      |     |        |     |         |        |











### Outline

UCR

22

- FPGA Primer
- Historical Evolution
- FPGA: A New HPC Platform?
- Analysis of the Speedup
- Conclusion

| Analysis of the speedup                                                                                                                                                                            |                                         |  |  |  |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|--|--|--|--|
| 0320000032000003                                                                                                                                                                                   | occesse occes                           |  |  |  |  |
| <ul> <li>Consider a loop</li> <li>N is the number of iterations</li> </ul>                                                                                                                         | $CPUcycles = I \times N \times CPI$     |  |  |  |  |
| <ul> <li>I is the number of CPU instructions<br/>per iteration</li> <li>O is the number of arithmetic or<br/>logic operations per iteration</li> <li>S = I - O is the number of support</li> </ul> | $FPGAcycles = \frac{N}{P} + k$          |  |  |  |  |
| instructions per iteration<br>• Index arithmetic<br>• Loop count<br>• Control operations<br>• Load and store                                                                                       | $Speedup = rac{CPUcycles}{FPGAcycles}$ |  |  |  |  |
| <ul> <li>Loop is mapped on FPGA unrolled P<br/>times</li> <li>k is number of stages in loop body<br/>pipeline on FPGA. Assume N/P &gt;&gt; k</li> </ul>                                            |                                         |  |  |  |  |
| W. Najjar UCR                                                                                                                                                                                      | 23                                      |  |  |  |  |



### **Inefficiency factor**

| Benchmarks            | CPU         | Ratio of iteration<br>Level parallelism | Inefficiency<br>factor |
|-----------------------|-------------|-----------------------------------------|------------------------|
| Prewitt edge          | MIPS        | 8                                       | 8.64                   |
| detection             | Pentium     | 8                                       | 6.19                   |
|                       | VLIW        | 2                                       | 7.02                   |
| Wavelet<br>transform  | MIPS        | 8                                       | 12.5                   |
|                       | Pentium     | 8                                       | 7.51                   |
|                       | VLIW        | 2                                       | 15.0                   |
| Max filter            | MIPS        | 8                                       | 46.7                   |
|                       | Pentium     | 8                                       | 26.3                   |
|                       | VLIW        | 2                                       | 27.6                   |
| From Guo et al. in 3  | 004 Sump Or | EPCA February 2004                      |                        |
| r rom Guo et al. In 2 | Gynip. Of   | TT GA, Footaaly 2004                    |                        |



### **Support instructions**

W. Najjar

|          |            | FPGA   | MIPS                    | Pentiu<br>m | VLIW   | Ratio<br>range |  |
|----------|------------|--------|-------------------------|-------------|--------|----------------|--|
|          |            | Memo   | Memory operations/pixel |             |        |                |  |
| Prewit   | t Load     | 0.125  | 8                       | 13          | 8      | 64 - 124       |  |
|          | Store      | 0.125  | 1                       | 7           | 1      | 8 - 56         |  |
| Wavele   | t Load     | 0.125  | 12                      | 14          | 8.75   | 96 - 112       |  |
|          | Store      | 0.125  | 7                       | 7           | 1      | 8 - 56         |  |
| Max filt | er Load    | 0.125  | 9                       | 9           | 9      | 72             |  |
|          | Store      | 0.125  | 1                       | 1           | 1      | 8              |  |
| Ratio    | of mem ops | on CPU | to mem                  | i ops on F  | PGA: 8 | to 124         |  |
| ar       |            |        | UCR                     |             |        |                |  |





### Outline

- FPGA Primer
- Historical Evolution
- FPGA: A New HPC Platform?
- Analysis of the Speedup
- Conclusion

W. Najjar

### **Summary of Advantages**

UCR

- Very large degree of on chip parallelism
  - 100s of concurrent iterations
  - Assuming enough memory or I/O bandwidth to supply data
- Very large on-chip storage
  - Reduces the pressure on memory bandwidth
    Fewer support instructions

UCR

- More efficient computations
  - Variable bit-width datapath
  - Table lookup for small operations
  - (known at compile time)







### Model 1

### 

UCR

- Embedded hard or soft CPU(s)
  - Xilinx: PPC400, MicroBlaze or PicoBlaze
    Altera: NIOS or NIOS II
- On chip bus
  - Interface to external memory via FPGA
- Advantage: cost, size and power
- Embedded systems

W. Najjar



UCR

### Model 3

### 

- Medium to high-end HPC systems
  - Very fast network
  - Large number of FPGAs
  - Very large memory
- Examples

W. Najjar

- Cray XD1 (defunct)
- SGI Altix 4700 with RASC blade
- SRC (new SRC 7)
- Cray XT3 with DRC modules (future)

UCR

### **Accelerator Platforms**

- SGI Altix 4700
  - Shared memory machine, fast interconnect: 12.8 GB/sec

UCR

- Itanium 2, 1.6 GHz •
- RASC RC100 Blade: 2 Virtex 4 LX200
- Memory size independent of number of CPUs Xtremedata XD1000
  - Altera Stratix II drop-in for AMD Opteron
  - Integrated interface to Hypertransport
     16 bits @ 800 M transfers/sec

  - Memory interface
  - 128 bits DDR-333up to 4 x 4 GB ECC
  - Flash memory
     For FPGA configuration or data









### **RASC Interfaces** Three mechanism Address shared memory: One page Direct I/O to local SRAM: Double buffered Streaming TIO ASIC connecting FPGA to external System via NUMAlink GB/s 72 72 GB/s M E M M E M Core Services 36 36 Algorithm 0

3.2 GB/s @

UCR

16MB

W. Najjar

Virtex-4 LX 200

3.2 GB/s 200MHz

: 16MF





























# BEE3 Highlights A Xilinx Virtex 5 V5 is a major improvement (65nm) 6-input LUT (64 bit DP RAM) Better Block RAMs Improved interconnect Better signal integrity 8 Infiniband/CX4 channels 4 x8 PCI Express low profile slots











### ROCCC

### **Riverside Optimizing Compiler for** Configurable Computing

Code acceleration

- By mapping of circuits to FPGA
- Achieve same speed as hand-written VHDL codes Improved productivity

  - Allows design and algorithm space exploration
- Keeps the user fully in control
  - We automate only what is very well understood

UCR





### Focus

### 

- Extensive compile time optimizations
  - Maximize parallelism, speed and throughput
  - Minimize area and memory accesses

### Optimizations

- Loop level: fine grained parallelism
- Storage level: compiler configured storage for data reuse

UCR

• Circuit level: expression simplification, pipelining

5





UCR

- Extensive compiler optimizations and transformations
- Analysis and hardware support for data reuse
- Efficient code generation and pipelining
- Import of existing IP cores
- Support for dynamic partial reconfiguration



























### Introduction

- The candidate codes to be mapped to HW are the most frequently executed loops
- Candidate loop bodies should conform to the following specifications:
  - No function calls that cannot be inlined
  - No pointers that cannot be de-aliased
  - No break, continue, switch/case, jump, goto statements

UCR

- Simple for loop headers as in:
- for(i = some\_lower\_boundl; i< some\_upper\_bound; i=i+step)</pre> Unroll counts should perfectly divide the loop trip count

W. Najjar







### **ROCCC's Restrictions**

- Does not compile arbitrary C code
  - No pointers
  - No explicit control statements
  - i.e. break, continue, return, exit, etc.
  - No function calls except ROCCC recognized macros e.g. ROCCC\_min, ROCCC\_max, ROCCC\_create\_lookup\_table)
    - ROCCC tries to inline all function calls except the macros recognized by ROCCC
  - Simple loop headers
    - loop lower bound, upper bound and step counts are compile time knows constants
  - Unroll counts should perfectly divide the loop trip count

UCR

 All array index expressions are in form loop\_counter+/step\_count

W. Najjar

### **User Interface**

### 

- Designating the candidate loop nest
- •begin\_hw() end\_hw() : dummy calls
- Loop labels:
- •to identify the loop for transformations to be applied to that particular loop
- Specifying the transformations
- •ROCCC does not employ an automatic parallelizer
- •User stays in control, specify transformations for each loop The circuit size
   Choice of hardware-efficient algorithm for the mapped software
   Performance/throughput of the generated circuit
- •.pass file: A text file that lists the transformations that are to be
- applied to the candidate loop nest through loop labels

UCR

W. Najjar

### User Interface: .pass file Allows the user decide on the loop level transformations & the parameters to these transformations • fully unroll <loop\_label>|<max\_iteration\_count> • partially unroll <loop\_label> <unroll\_factor> • generate tile <loop\_label1> <loop\_label2> <tile\_size1> <tile\_size2> • generate systolic array <loop\_label1> <loop\_label2> <systolic\_array\_size>

UCR

| Use                          | r Interface: FIR                                                                                                      |   |
|------------------------------|-----------------------------------------------------------------------------------------------------------------------|---|
| 0380                         | 199901589990158999015899991                                                                                           | Þ |
| begin_1<br>Ll: for<br>end_hw | hw();<br>c(i=0; i<=511; i=i+1)<br>B[i] = T[0]*A[i] + T[1]*A[i+1] + T[2]*A[i+2] +<br>T[3]*A[i+3] + T[4]*A[i+4];<br>(); |   |
|                              | .pass file                                                                                                            |   |
|                              | L1: partially unroll 15                                                                                               |   |
|                              |                                                                                                                       |   |
| W. Najjar                    | UCR                                                                                                                   | 7 |





### **Compiler Transformations**

### 

Pre-Optimization Passes Array Access •Switch Case to If Statements Transformations •Do While to While Statement Constant Propagation of "const" qualified Arrays Inlining Pass •Control Flow Analysis RAW/WAW Elimination Data Flow Analysis Scalar Replacement •Use/Def & Def/Use Chain •Feedback Array Access Builder Elimination +Lookup Table Expansion Array Renaming •DFA Expansion +Systolic Array Generation

**Compiler Transformations** 

### 

UCR

- Global Transformations
   Constant Propagation, Folding and Elimination of Algebraic Identities
- Identities •Code Hoisting &Sinking •Copy & Reverse Copy Propagation
  - Loop Fusing
     Loop Interchanging

Loop Level

Transformations •Partial & Full Loop Unrolling

- Loop Invariant Code Motion
  - Loop Peeling
    - Loop Normalization
    - Loop Tiling / Strip Mining
       Loop Unswitching

7

 \*Division & Multiplication By Constants Approximation
 \*Loop Un

UCR

If Conversion
Reduction Parallelization

•Common Subexpression Elimination

•Dead & Unreachable Code Elimination

Scalar Renaming

W. Najjar

W. Najjar

Phase Order of Transformations
1) Preprocessing Passes
2) Global & Loop Level Transformations
3) Array Transformations
4) Global Transformations
5) Output (Hi-CIRRF) Generation

### **GCC Preprocessing Passes**

### acc\_preprocess

- Adjusts the gcc generated SUIF code to conform to the SUIF's internally expected format
- Translates gcc AST to SUIF AST

### dismantle\_call\_expressions

- Exists for backward compatibility reasons
  - If not called, the gcc generated SUIF code would not work w/ the SUIF optimizations generated pre-gcc front end

**Preprocessing Passes** 

### 

UCR

- preprocessing\_hw\_sw\_boundary\_mark
   Marks the code section in between the begin\_hw() and end\_hw() calls in the source code to be sent to HW
- preprocessing\_roccc\_inline
   Inlines all function calls in C leaving only the calls to ROCCC defined macros
- preprocessing\_dowhile\_to\_while\_transform
   Converts do while() statements to equivalent if while() statement sequence

UCR

### preprocessing\_LUT\_decs

• Processes the user defined create\_lut(<const int array>) to later generate an LUT definition in HI-CIRRF

7

W. Najjar

W. Najjar

W. Najjar

### **Control Flow Analysis**

### control\_flow\_solve

- Builds control flow graph over the SUIF IR using annotations
- Should be called prior to any global optimization passes
  Preceding passes
- None

### control\_build\_loop\_info

- Light weight version of control\_flow\_solve
- Should be called prior to any loop optimization pass
- Preceding passes

None

UCR

### **Data Flow Analysis**

### ataflow\_solve

- Builds the dataflow graph for later transformations
  - Should be called prior to any global optimization passes
    Preceding pass
  - control\_flow\_solve

### ud\_du\_chain\_builder

- Builds the use/def-def/use chains for further
- optimizations

W. Najjar

Should be called prior to any global optimization passes
Preceding pass

UCR

dataflow\_solve

- Global Transformations

  Global\_constant\_propagate

  Propagates constants over the SUIF IR

  Preceding pass

  ud\_du\_chain\_builder

  Global\_constant\_fold

  Folds the constants w/ in SUIF expressions
  Eliminates algebraic identities
  - Should be used alternating w/ the global\_constant\_propagate
     Preceding pass

     ud\_du\_chain\_builder

UCR









## 









| Example:                                              | Dead Code Elimination                                                                                                 |
|-------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| If a variab<br>is last def<br>or until a<br>is remove | e (a) is never used from the time it<br>ned until either the program ends<br>s redefined, the unused definition<br>l. |
| c =;<br>a = a+5;<br>return c;                         | a = a+5;<br>b = c+7;<br>a = b+2;                                                                                      |
| W. Najjar                                             | UCR                                                                                                                   |



































| Example: Lo                                                                                              | op fusio                                       | n                                                                                                 |
|----------------------------------------------------------------------------------------------------------|------------------------------------------------|---------------------------------------------------------------------------------------------------|
| 0380000038                                                                                               |                                                | 00000000000000                                                                                    |
| Loop Fusion help<br>loop overhead<br>combining the b<br>loop.                                            | os reduce redu<br>and redund<br>oodies of mult | undancy by eliminating<br>ant computations by<br>iple loops into a single                         |
| for i1=1 to n do<br>A[i1] = A[i1] + k<br>endfor<br>for i2=1 to n-1 do<br>D[i2] = A[i2] * B[i2]<br>endfor | Peeling + fusion                               | A[1] = A[1] + k<br>for i3=1 to n-1 do<br>A[i3+1] = A[i3+1] + k<br>D[i3] = A[i3] * B[i3]<br>endfor |
| W. Najjar                                                                                                | UCR                                            |                                                                                                   |









### **Output Generation**

Several passes help generate the output and the sequence of these passes should stay the same as given in the gen-datapath under the utils directory
 The only exception is for systolic array generation. To enable call:

 array\_feedback\_eliminate 1

UCR

To disable: array\_feedback\_eliminate 0










































# **Lookup Tables Applications**

- Arrays of run-time constants
- Non linear access to arrays
- Found in applications such
  - Cryptography: Key dependent substitution boxes
  - Image processing: Color palettes
  - Bioinformatics : Score matrices for proteins
  - Scientific computing: Retrieving the sine of a number from a table instead of computing it each time

UCR

W. Najjar

#### Problem

#### 

- In software such accesses occur in expressions of the from
  - Table[ Input\_stream[ index\_expr ] ]:
- ROCCC compiles the above expression by creating a hardware LUT for it.
- ROCCC only assumes that:
  - Table is a const qualified array
  - whose values are known at compile time (for now)

UCR

User code

W. Najjar

# 

#### Identify the array

- ROCCC\_create\_lookup\_table(int id, ...)
- Placed prior to the loop body where the actual array access occurs
- Access the array
  - ROCCC\_lookup\_in\_table(int id, ...)
  - Looks up the input stream contents in the table defined by the ROCCC\_create\_lookup\_table call

UCR

# **Types of LUTs supported**

01589390158939015893901589390

- 1D LUT
   Cryptography applications as key dependent substitution boxes
  - Trigonometric functions in scientific applications.
  - Table [ Input\_stream [ index\_expr ] ]
- ID LUT w/ CAM
  - Bioinformatics applications where the input stream contents does not directly address the table itself

UCR

- if(input\_stream[index\_expr] == 'A') return Table[1];
   The above supression can be reduced to a single
- The above expression can be reduced to a single ROCCC\_create\_lookup\_table call in ROCCC

W. Najjar

# Types of LUTs supported 2D LUT: Table [Input\_stream1 [index\_expr1]][ Input\_stream2 [index\_expr2]]

- 2D LUT w/ CAM
  - Found in bioinformatics applications, for instance the lookup operations to BLOSUM and PAM matrices fall in this category.
  - if(S[index\_expr] == 'A' && T[index\_expr] == 'T') return Table[1][3];
  - The above expression again is reduced to a single ROCCC\_create\_lookup\_table call in ROCCC

UCR





































| Co        | Comparison - Clock Rate |        |       |                         |                                                        |  |  |  |  |
|-----------|-------------------------|--------|-------|-------------------------|--------------------------------------------------------|--|--|--|--|
|           | Code                    | Xilinx | ROCCC | %Clock                  | Comparable                                             |  |  |  |  |
|           | bit_correlat<br>or      | 212    | 144   | 0.679                   | clock rates                                            |  |  |  |  |
|           | mul_acc                 | 238    | 238   | 1.000                   |                                                        |  |  |  |  |
|           | udiv                    | 216    | 272   | 1.259                   |                                                        |  |  |  |  |
|           | square root             | 167    | 220   | 1.317                   |                                                        |  |  |  |  |
|           | cos                     | 170    | 170   | 1.000                   | 1                                                      |  |  |  |  |
|           | FIR                     | 185    | 194   | 1.049                   |                                                        |  |  |  |  |
|           | DCT                     | 181    | 133   | 0.735                   |                                                        |  |  |  |  |
|           | Wavelet*                | 104    | 101   | 0.971                   | 1                                                      |  |  |  |  |
|           | (* hand written VH      | IDL)   |       | Xilinx I<br>Xilinx Virt | •<br>SE 5.1i and IP core 5.1i<br>ex-II xc2v2000-5 FPGA |  |  |  |  |
| W. Najjar |                         |        | UCR   |                         | 13<br>0                                                |  |  |  |  |

Г



|              | _         |       |              |         |
|--------------|-----------|-------|--------------|---------|
| Code         | Xilinx IP | ROCCC | %Area(slice) | Averag  |
| bit_correlat | 9         | 19    | 2.11         |         |
| or           |           |       |              | area    |
| mul_acc      | 18        | 59    | 3.28         |         |
| udiv         | 144       | 495   | 3.44         | factor: |
| square root  | 585       | 1199  | 2.05         | luctori |
| cos          | 150       | 150   | 1.00         |         |
| FIR          | 270       | 293   | 1.09         |         |
| DCT          | 412       | 724   | 1.76         |         |
| Wavelet*     | 1464      | 2415  | 1.65         |         |

























# **Reduction in memory accesses**

# Up to 98% eliminated

- Data reused in smart buffer for sliding window applications
- Only data that is re-fetched is the bottom row(s) of a window

#### Automated

- Compiler generates smart buffer based on
  - Window size in x and y
  - Stride of window in x and y
  - Number of data values (pixels) per word fetched

UCR

Fetch bandwidth into FPGA















































































# Smart Buffer Performance

|            |                   | constant | variable | complex | 2D_lowpass | motion    |
|------------|-------------------|----------|----------|---------|------------|-----------|
|            |                   | FIR      | FIR      | FIR     | filter     | detection |
| Terrent    | Area (slices)     | 156      | 159      | 132     | 325        | 327       |
| Input      | # of regs         | 5        | 5        | 6       | 16         | 16        |
| buffer     | # of states       | 14       | 14       | 8       | 18         | 18        |
| A          | Bus size (bits)   | 8        | 8        | 16      | 16         | 16        |
| Terrent    | Area (slices)     |          | 159      |         |            | 150       |
| huffer     | # of regs         |          | 5        |         |            | 4         |
| Duffer     | # of states       |          | 14       |         |            | 4         |
| Б          | Bus size (bits)   |          | 8        |         |            | 16        |
| Outrast    | Area (slices)     | 11       | 11       | 12      | 73         | 73        |
| buffer     | # of regs         | 1        | 1        | 2       | 2          | 2         |
| C          | # of states       | 1        | 1        | 2       | 2          | 2         |
| C          | Bus size (bits)   | 8        | 8        | 8       | 16         | 16        |
| Die d      | Area (slices)     | 43       | 5 mltpl  | 99      | 144        | 164       |
| Data-path  | Bit size          | 8        | 8        | 8       | 8          | 8         |
| Overall    | area (slices)     | 210      | 329      | 243     | 542        | 714       |
| Clock      | rate (MHz)        | 94       | 68       | 85      | 69         | 42        |
| Execution  | n time (cycles)   | 262      | 1019     | 260     | 5980       | 5986      |
| Throughput | (iteration/cycle) | 0.96     | 0.25     | 0.48    | 0.16       | 0.16      |
|            |                   |          |          |         |            |           |
|            |                   | L        | ICR      |         |            |           |































| xperimental Results |         |              |        |      |       |           |  |  |
|---------------------|---------|--------------|--------|------|-------|-----------|--|--|
|                     |         |              | Cordic | DCT8 | FFT16 | RS-encode |  |  |
|                     |         | Area (slice) | 2      | 55   | 532   | 53        |  |  |
| In                  | put     | area (%)     | 0.3    | 6.7  | 24    | 64        |  |  |
| VVIC                | wiappei | addtl. cycle | 1      | 1    | 1     | 1         |  |  |
|                     | Output  | Area (slice) | 2      | 426  | 290   | 9         |  |  |
| Ou                  |         | area (%)     | 0.3    | 52   | 13    | 11        |  |  |
| VVI C               | ippei   | addtl. cycle | 1      | 1    | 1     | 1         |  |  |
| -                   |         | area (slice) | 663    | 817  | 2183  | 83        |  |  |
|                     | otal    | clock (MHz)  | 123    | 68.7 | 45.0  | 96.4      |  |  |
| Ch                  | circuit | total cycles | 23     | 23   | 200   | 20        |  |  |
| Nailar IICD         |         |              |        |      |       |           |  |  |



# **Dynamic Partial Reconfiguration**

#### Wrapped cores

- Are well defined and bounded entities
- Multiple cores can share a same wrapper
- Ideal set-up for dynamic partial reconfiguration

# Dynamic Partial Reconfiguration

- Core selection under software control
- With compiler support
- Implemented with JTAG and SelectMAP

# DPR Results

W. Najjar

#### 

UCR

| Design               | No.<br>slices | Bit stream<br>size<br>(Kbits) | Prog. time<br>JTAG<br>(ms) | Prog. time<br>SelectMAP<br>(ms) |   |
|----------------------|---------------|-------------------------------|----------------------------|---------------------------------|---|
| Static configuration | 13699         | 1415                          | 2318                       | 4                               | 5 |
| DCT8 partial         | 378           | 216                           | 354                        | 7.                              | 3 |
| FFT8 partial         | 512           | 426                           | 698                        | 14.                             | 3 |
| iiio partiai         | 512           | 420                           | 098                        | 14.                             | 5 |
|                      |               |                               |                            |                                 |   |
|                      |               |                               |                            |                                 |   |
|                      |               |                               |                            |                                 |   |
|                      |               |                               |                            |                                 |   |
| √ajjar               |               | UCR                           |                            |                                 |   |









# **NAMD** Computation

- Two types of forces
  - Van Der Waal and Electrostatic
- Two different calculations of forces
   Bonded
  - Forces between atoms in the same molecule
     Nonbonded
    - Forces between atoms in different molecules

UCR

#### W. Najjar

# Main Computation Loop for each timestep { for each atom { for every other atom { sum electrostatic forces sum Van Der Waal forces } } W. Najjar UCR

# NAMD's Optimizations Computations for reasonable timeframes could take weeks to months At some distance, the contributions of forces due to atoms becomes insignificant Apply distance cutoffs Apply periodic cutoffs Only calculate forces every N timesteps Results in 60 different variations of the innermost loop with slight differences to calculations performed























| NAMD R | esults |
|--------|--------|
|--------|--------|

#### 

| FPGA<br>Implementation |         | Speedup over<br>Itanium 2 1.6 GHz |
|------------------------|---------|-----------------------------------|
| Single<br>precision    | 149 MHz | 799.4                             |
|                        | 100 MHz | 535.6                             |
| Double<br>precision    | 168 MHz | 450.1                             |
|                        | 100 MHz | 267.8                             |

#### FPGA:

| <u>Itanium:</u>                          | FPGA:                         |
|------------------------------------------|-------------------------------|
| <ul> <li>Ideal: one full EPIC</li> </ul> | •Enough bandwidth for single  |
| instruction/cycle                        | precision                     |
| <ul> <li>Measured: actual</li> </ul>     | •Double precision: two cycles |
| execution time                           | for data for each iteration   |
|                                          |                               |

UCR



# **Memory Bandwidth Issues**

#### 

- Memory is the bottleneck
  - Single precision requires 48 bytes per cycle
  - Double precision vectors requires 96 bytes per cycle
  - SGI-RASC can feed Single precision once per cycle • We need 5.856 GB/s for single precision
  - Double precision needs two cycles to collect data, before iteration can start

UCR

W. Najjar

W. Najjar



UCR

W. Najjar

63





#### Smith-Waterman Code

- Dynamic Programming
  - Used in protein modeling, bio-informatics, data mining ...
  - A wave-front algorithm with two input strings A[i,j] = F(A[i,j-1], A[i-1,j-1], A[i-1,j])

F = CostMatrix(A[i,0],A[0,j])

#### Our Approach

- "Chunk" the input strings in fixed sizes k
- Build a k x k template hardware by compiling two nested loops (k each) and fully unrolling both.
- Host strip mines the two outer loops over this

UCR

template.



























































| Feedback Store | Elimination                                                                                                                                                                                    |         |
|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
|                | <pre>for(i=1; i<n; <="" a00="A[i-1](j-1];" a01="A[i-1](j];" a10="A[i](j-1];" a20="A[i+1](j-1];" for(j="1;" i="i+k)" j="j+1)" j<n;="" s0="S[j-1];" t0="T[i-1];" th="" {=""><th></th></n;></pre> |         |
| W. Najjar      | UCR                                                                                                                                                                                            | 20<br>8 |


































| Carriele Websamers |                      |         |            |            |            |  |  |  |  |
|--------------------|----------------------|---------|------------|------------|------------|--|--|--|--|
|                    | Xeon Itanium V4LX200 |         |            |            |            |  |  |  |  |
| Cells              |                      |         | 200        | 512        | 1024       |  |  |  |  |
| Area (%)           |                      |         | 2.8%       | 6.9%       | 17%        |  |  |  |  |
| Clock              | 2.8<br>GHz           | 1.6 GHz | 191<br>MHz | 188<br>MHz | 174<br>MHz |  |  |  |  |
| GCUPS              | 0.049                | 0.084   | 38.2       | 96.25      | 178.64     |  |  |  |  |
| Speedup            | 1                    | 1.7     | 780        | 1964       | 3645       |  |  |  |  |

Γ









|         | Dynamic Time warping |           |                |                |  |  |  |
|---------|----------------------|-----------|----------------|----------------|--|--|--|
|         | xeon                 | Itanium 2 | V4LX200        |                |  |  |  |
|         |                      |           | 256 cells, 73% | 512 cells, 61% |  |  |  |
|         |                      |           | 24 bit data    | 12 bit data    |  |  |  |
| Clock   | 2.8 GHz              | 1.6 GHz   | 42 MHz         | 52 MHz         |  |  |  |
| GCUPS   | 0.028                | 0.071     | 10.8           | 26.6           |  |  |  |
| Speedup | 1                    | 2.5       | 385            | 950            |  |  |  |
|         |                      |           |                |                |  |  |  |
|         |                      |           |                |                |  |  |  |
|         |                      |           |                |                |  |  |  |



## **Examples**

## ~`\$\$\$\$@\$~\$\$\$\$@\$~\$\$\$\$@\$~\$\$\$\$@\$@

- Molecular dynamics
  - NAMD code
- Bioinformatics
  - Using Smith-Waterman, a dynamic programming
    Similar: dynamic time warping, motif discovery
- Networking (Virtex II Pro)
  Intrusion detection using Bloom Filter
  Probabilistic exact string matching
- PCRE Matching
  - Perl Compatible Regular Expressions

W. Najjar

## **Bloom filter**

UCR

- Is a data structure used to test set membership of an element
- Bloom filter has an array of N elements all of which are set to '0' initially
- The members of the set are inserted to the filter using multiple hash functions.
- Each hash function returns a unique value in the range of 0 to N-1.

UCR

During insertion, all locations returned by the hash function are set to 1.

W. Najjar



## **Bloom filter for virus detection**

- Ours is the first bloom filter based virus detection code automatically generated from C.
- Each Signature Processing Engine (SPE) contains the generated bloom filter code and is used to detect signatures
   Bloom filter output contains false positives. Hence a RAM is
- Bloom filter output contains faise positives. Hence a RAM is used for absolute string comparison and to eliminate false positives.



22









UCR

W. Najjar



























