Science

TARCAD: A Template Architecture for Reconﬁgurable Accelerator Designs
Muhammad Shaﬁq, Miquel Peric` s a Nacho Navarro Eduard Ayguad´ e Computer Sciences Dept. Arquitectura de Computadors Computer Sciences Barcelona Supercomputing Center Universitat Polit` cnica de Catalunya Barcelona Supercomputing Center e Barcelona, Spain Barcelona, Spain Barcelona, Spain {muhammad.shaﬁq, miquel.pericas}@bsc.es nacho@ac.upc.edu eduard.ayguade@bsc.es

Abstract—In the race towards computational efﬁciency, accelerators are achieving prominence. Among the different types, accelerators built using reconﬁgurable fabric, such as FPGAs, have a tremendous potential due to the ability to customize the hardware to the application. However, the lack of a standard design methodology hinders the adoption of such devices and makes difﬁcult the portability and reusability across designs. In addition, generation of highly customized circuits does not integrate nicely with high level synthesis tools. In this work, we introduce TARCAD, a template architecture to design reconﬁgurable accelerators. TARCAD enables high customization in the data management and compute engines while retaining a programming model based on generic programming principles. The template features generality and scalable performance over a range of FPGAs. We describe the template architecture in detail and show how to implement ﬁve important scientiﬁc kernels: MxM, Acoustic Wave Equation, FFT, SpMV and Smith Waterman. TARCAD is compared with other High Level Synthesis models and is evaluated against GPUs, an architecture that is far less customizable and, therefore, also easier to target from a simple and portable programming model. We analyze the TARCAD template and compare its efﬁciency on a large Xilinx Virtex-6 device to that of several recent GPU studies.

I. I NTRODUCTION The integration levels of current FPGA devices advanced to the point where all functions of a complex application kernel can be mapped in a single chip. However, these high density FPGAs appear just like a sea of logic slices and embedded functionality cores such as general purpose processors, multipliers/adders, multi-ported SRAMs and DSP slices etc. Currently, it all depends on the FPGA application designer and how well he maps an application to the device. This practice is problematic for several reasons. First, it is a low-level approach that requires a great deal of effort for mapping the complete application. Second, reusability of modules across projects is signiﬁcantly reduced. And, last but not least, it is difﬁcult to scientiﬁcally compare hardware implementations that adhere to different high-level organizations and interfaces. This emphasizes the need to abstract out these particular hardware structures in a standard architectural design framework. Most of the studies that have ported applications to multiple accelerator architectures (like, for example, Cope et al [1], Garland et al. [2] or Shaﬁq et al. [3]) identify that two factors are the most critical ones to achieve high performance for an application. The ﬁrst factor is the intrinsic

parallelism available in the algorithm being mapped on the accelerator. The second factor is how efﬁciently the designer arranges the data to be fed to the computational resources. FPGA’s have the potential to exploit both of these factors in the best optimized way. However, future FPGAs will not become mainstream accelerators if they are unable to solve the long-standing challenge of implementing applications in a well deﬁned, simple and efﬁcient way. A plethora of application kernels from the HPC domain have been ported to reconﬁgurable devices. However, most designs are specialized to a single environment due to the lack of a standard design methodology. This work is a step towards the harmonization of data-ﬂow architectures for various FPGA-based applications written in HDLs (e.g. Verilog, VHDL) and High Level Languages (HLL). The architectures generated by HLL to HDL/Netlist tools (e.g. such as ROCCC [4] or GAUT [5]) also follow a simpliﬁed and standardized compilation target, but they have been designed speciﬁcally as compiler targets, which limits their applicability to HDL designers. In addition, these models are too constrained to support complex memory organizations or unorthodox compute engines which are often required to best exploit FPGAs. This work proposes an architectural template named TARCAD that allows to efﬁciently exploit FPGAs supported by a simple programming methodology. TARCAD not only enables HDL designers to work on a highly customizable architecture, it also deﬁnes a set of interfaces that make it attractive as a target for a HLL-to-HDL compilation infrastructure. This paper discusses the generic architectural layout of the TARCAD template for reconﬁgurable accelerators. The proposed architecture is based on the decoupling of the computations from the data management of the application kernels, a concept reminiscent of Smith’s Decoupled Access Execute (DAE) architectures [6]. This makes it possible to independently design specialized architectures for both parts of the kernel in a data-ﬂow envelope supported by our architectural layout. Computation scales depending on the size of the FPGA or the achievable bandwidth from the specialized memory conﬁguration that feeds the compute part. We evaluate the architectural efﬁciency of an FPGA device for several applications using TARCAD and compare it with GPUs. This is an interesting comparison because both platforms require applications with data level parallelism and control divergence independent kernels.

Host Input Assembler Thread Manager

External Memory Memory Fetch Unit Input Smart Buffers

Host

External Interface Communication Interface

sp

Data Path Unrolled Pipelined Loop Body
Data Cache Load/Store Data Cache Load/Store

Memory Unit

Output Smart Buffers

Processing Unit
Memory Store External Memory

Controller Data Path

data between the accelerator and the global memory. Among different options, PMC improves the accelerator kernel performance by providing programmable strided accesses. This makes it possible for PMC to directly handle 1D, 2D and 3D tiling of large data sets rather than doing the same in software at the host processor. C. The Application Speciﬁc Data Management Block TARCAD’s application speciﬁc management block helps to arrange data for efﬁcient usage inside the computations. This block consists of four sub-blocks. These can be identiﬁed in Figure-2 as Data-Set (DS) Manager, Conﬁgurable Memory Input Control, Algorithm Speciﬁc Memory Layout and the Programmable Data Distributer. Out of these subblocks, the Algorithm Speciﬁc Memory Layout (mL) plays a central role in designing an efﬁcient accelerator by providing re-arrangement and the reuse of data for compute blocks. The memory layouts can be common for various applications as shown by Shaﬁq et al. [8]. TARCAD can also adopt a similar common memory layout but in this paper we only consider that a memory layout for an application is customized using the block RAMs (BRAMs) of the device. The pattern of writing data to a customized memory layout can be very different from the reading pattern from this memory layout. A simple example is the memory layout for the FFT (decimation in time) architecture where data is written sequentially while it is read into the architecture in a bit-reversed order. Therefore, TARCAD keeps separate write and read interfaces (CFG MEM-IN-CONTROL and the Programmable Data Distributer) to the memory layout block as shown in the Figure-2. The conﬁgurable memory input control (CFG-MEM-IN-CONTROL) is used to write data to the memory layout. It is based on a ﬁnite state machine (FSM) and works according to a preset design. This memory input control expects various streams of independent data sets through the streaming FIFO channels (DS-ix). Each of the DS-ix can have multiple sub-channels to consume the peak external bandwidth. However, all sub-channels in a DSix will represent the same data set.
External to Device

Data Bus

External Memory

(a)

(b)

(c)

Figure 1. The compute Models: (a) A Generic GPU (b) ROCCC (c) GAUT

II. T HE TARCAD A RCHITECTURE A. Accelerator Models for Supercomputing The TARCAD proposal targets both HDL accelerator designers by providing them with a standard accelerator design framework, as well as High Level Synthesis (HLS) tool developers by giving them a standard layout to map applications on. HLS tools deﬁne an architectural framework into which they map the algorithmic descriptions. The compute models of two such tools, ROCCC [4] and GAUT [5], are shown in Figure 1(b) and (c), together with the compute model of GPUs shown in Figure-1(a). The basic compute model for ROCCC requires streaming data input from an external host. This data is stored in smart buffers before being consumed by the compute units and again before being sent back to main memory. The GAUT architecture, on the other hand, provides an external interface to access data based on data pointers. The memory model of GAUT is simple and can keep large chunks of data using BRAM as buffer memory. Another architecture that is nowadays highly popular, the Graphics Processing Units (GPUs), use their thread indexes to access data from up to ﬁve dimensions. A large number of execution threads helps to hide external memory data access latencies by allowing threads to execute based on data availability. The TARCAD architectural layout provides a generic design framework to map application speciﬁc accelerators onto reconﬁgurable devices. The micro-architectural details of TARCAD layout are presented in Figure-2. It is evident from the ﬁgure that the TARCAD layout can be partitioned into four representative main blocks and their constituent sub-blocks. A detailed description for these main blocks (External Memory Interface, Application Speciﬁc Data Management Block, Algorithm Compute Back-End and the Even Managing Block) follows. B. The External Memory Interface In general, the nature of accelerators is to work on large contiguous data sets or the streams of data. However, data accesses within a data set or across multiple data sets from an algorithm are not always straight forward. Therefore, accelerators can be made more efﬁcient by providing some external support to manage the data accesses in a more regular way. TARCAD supports a Programmable Memory Controller (PMC) as an external interface to the main memory. This controller is inspired from the works done by Hussein et al. [7]. It helps to transfer pattern based chunks of

Global Memory Programmable Memory Controller (PMC) Data-Set (DS) Manager for CFG Device
DSi0 DSi1 DSi2 DSiN
Events Manager

CFG MEM-IN-CONTROL

Event-0 Event-1 Event-k

DSo0

DSo1

DSo2

DSoM

Reconfigurable Device

CFG MEM-OUT-CONTROL br-0 br-1 br-p Algorithm Compute Block Instantiation-0 Algorithm Compute Block Instantiation-1 Algorithm Compute Block Instantiation-H LM

Algorithm Specific Memory Layout Using BRAMs

Programmable br-0 br-1 Data Distributer br-p br-0 br-1 br-p

LM

LM

Figure 2.

TARCAD architectural layout

The Data-Set Manager provides a command data interface between the reconﬁgurable device and the external-to-device PMC unit. This Data-Set Manager helps to ﬁll the DS-ix streaming FIFOs. On the reading side of the memory layout, the Programmable Data Distributer is used which is also a FSM. However, it is programmable in the sense of distributing different sets of data to the different instantiations of the same compute block (see Section II-D). D. The Algorithm Compute Back-End The compute Back-End consists of Branch-Handlers, Compute Block Instantiations and Conﬁgurable Memory Output Control. The computational block is the main part of this Back-End and it can have multiple instantiations for an algorithm. Each instantiation of the compute block interfaces with the programmable data distributer through its BranchHandler. These Branch-Handlers are kind of FIFO buffers to support data pre-fetch for avoiding a time plenty, in-case, of a branch divergence in the compute block. The TARCAD architecture expects a compute block as a combination of arithmetic compute units with minimal complexity in the ﬂow of data inside the compute block. All compute blocks either keep a small set of their computational results in the local memory (LM) shareable with other instantiations or forward the results to conﬁgurable memory output control (CFG MEM-OUT-CONTROL). CFG MEMOUT-CONTROL collects data from the compute blocks for speciﬁc set of output data set (DS-Ox). The results collected at CFG MEM-OUT-CONTROL are either routed back to the global memory by the Data-Set Manager or written back to the CFG MEM-IN-CONTROL. E. The Event Managing Block The role of Event Manager is to guide and monitor the kernel mapped on TARCAD. The Event Manager can be a FSM or a simple processor with multiple interrupt inputs. In our current work, we consider the Event Manager to be a FSM. In general each event in the Event Manager guides and monitors for any single phase of kernel execution. The event manager is initialized by the user before the execution of a kernel. It holds information like the set of events (signals from various blocks) for each phase, input/output memory pointers and the data sizes for different data sets used in the execution of each phase of a kernel. The Event Manager monitors the execution of the kernel and takes actions at the appropriate event. The actions are in the form of exchanging information (setting/getting state data by the event manager) with all the other state machine based blocks. The Event Manager keeps a set of counters shared in all phases while a set of registers for each phase initialized by the user. F. TARCAD Implementation The motive behind the TARCAD layout is to support efﬁcient mapping of application speciﬁc accelerators on to the reconﬁgurable devices. Therefore, these speciﬁc mappings of various designs require to physically change or

TARCAD Blocks Template Library Compute Unit's Annotated HDL Memory Layout Definitions Accelerator Specific Parameter Set DATE Translator TARCAD Mapped HDL

Figure 3.

TARCAD Implementation: Environment of the DATE System

scale the data paths, FSMs, the special memory layouts and the compute blocks. These changes for a reconﬁgurable device can be made only at compile time. Therefore, we are propounding the implementation of the TARCAD using a template expansion method. This is a metaprogramming (i.e. code generation) process that generate a speciﬁc HDL of the accelerator based on the TARCAD layout. The template expansion is provided by our prototype translator called Design of Accelerators by Template Expansion (DATE) system [9]. This is an in-house research tool to support template based expansions for high level domain abstractions. A simple block diagram in Figure-3 shows the environment of the DATE system. The main inputs from the user to the DATE system are annotated HDL based template code for the compute block and the data ﬂow deﬁnitions for the memory layout. The annotations used in coding the HDL are similar to those used in the DATE templates [9]. A set of parameters is also passed to the DATE translator to adjust and generate other HDL design modules by using the TARCAD templates for various blocks maintained inside the TARCAD template library. For example, some important parameters related to the Event Manager are the total number of phases through which a kernel will execute, the total repetitions of a phase, the maximum number of events connected to that phase, the total number of data pointers used in the phase and the equations for memory block accesses for each of the pointer in the phase. However, the actual list of data pointers, the monitoring and activation events and the event’s target blocks are initialized using special commands directly by the Data Set manager at the execution start-up or during the run-time. III. A PPLICATION K ERNELS ON TARCAD TARCAD layout can be mapped for all kinds of application kernels. The following section, present some example application kernels mapped on TARCAD. A. Matrix-Matrix Multiplication (MM-M) Matrix-Matrix multiplication can have numerous design possibilities. Here we use a memory layout and compute block which are efﬁcient for large sized matrices and the matrices are accessed in the same “row major order“ from the external memory. As shown in the Figure-4(a), matrices A and B are fetched in the order of one row and multiple columns. The process of fetching matrices data and writing the results back is managed by the Event Manager with the help of Data Set Manager and CFG MEM-IN/OUT-Controls. A small piece of pseudo code which represents the Event Manager’s FSM

DSA
FB

CFG MEM | IN | CTR

--- a21 a1m -- a12 a11 bn1 --- b31 b21 b11 bn2 --- b32 b22 b12
-----------

br0ins-0 br0ins-1 br0ins-br0ins--

1 ISa

DSB
FB

bnp --- b3p b2p b1p
BRAM Based FIFOs

FB

= A_pointer 2 ISb = B_pointer 3 SSra = A_row_size 4 SSmb = B_matrix_size 5 loop(EVre) : 6 if (EVrr) : i=0 ; i++ 7 FSa = ISa + i x SSra 8 FSaz = SSra 9 FSb = ISb 10 FSbz = SSmb 11 end_if 12 end_loop

(a)

(b)

matrix-B is scattered around the multiple circular buffers equal to the number of compute block instantiations in the back-end. Therefore, the dot product of an element from the row of Matrix-A is done with multiple columns of MatrixB. Each instantiation of the compute block accumulates the results for the element wise dot product of a row (Matrix-A) and a column (Matrix-B). B. Acoustic Wave Equation (AWE) Solver A common method to solve the Acoustic Wave Equation (AWE) numerically consists of applying a stencil operator followed by a time integration step. Some details on the AWE solver and its implementations are described by Araya et al. [10]. In our TARCAD based mapping of the AWE solver, the two volumes of previous data sets for the time integration part are forwarded to the compute block by using simple FIFO channels in the TARCAD’s memory layout. However, our implementation of the stencil operations follows the memory layout of an 8 × 9 × 8 odd symmetric 3D stencil as shown by Shaﬁq et al. [3]. In our TARCAD based mapping of the AWE kernel, we consider real volumes of data that are normally larger than the internal memory layout of the accelerator. Therefore, a large input volume is partitioned into its sub-volumes as shown in Figure-5(a). A sub-volume block also needs to copy the so-called ”ghost points” (input points that belong to the neighboring sub-volume). For example, Block 7 shown in Figure 5(a) needs to be fetched as an extended block that includes ghost points from the neighboring Blocks 2, 6, 12 and 8. However, these ghost points are only required for the current volume being used in stencil computations. The TARCAD layout supports ofﬂoading the management of block-based data accesses to the programmable memory controller (PMC). In the AWE case, for simplicity, TARCAD accesses the same pattern of the extended sub-volumes from all three input volumes. The CFG MEM-IN-CONTROL discards the ghost points accessed for the two previous volumes used in time integration. The PMC is programmed by the host to access the three volumes of data –block by block– on the request of the Event Manager. The example pseudo code for the FSM of Event Manager is shown in Figure-5(b). In the ﬁrst three lines of the pseudo code, the FSM does an initialization of the initial source pointers (ISx) for the three input volumes. In the next line, a reset to zero of
Z=M Partitioned Blocks 0 5 Y=P=∞ 10 1 6 11 2 7 12 3 8 13 4 9 14
1 ISa

Figure 4. MM-M : (a) Matrices elements distribution into application speciﬁc memory layout and (b) The pseudo code for matrices data accesses by the Event Manager

for the data fetch requests is shown in Figure-4(b). In order to make it clear, the FSM actions are non-blocking (i.e simultaneous but based on conditions) and purpose of the sequential pseudo code is just to give the basic idea of the mechanism. The structure of this FSM already exists as a template in the DATE Translator library (Figure-3). However, an arbitrary number of registers to keep kernel speciﬁc information are created from the parameterized information at the translation time. For example in Figure4(b), ISa and ISb are the registers created for the initial source pointers to access matrices from external memory. FSa and FSb are the tuple registers for the fetch source pointers (the current pointers). FSaz and FSbz represent the registers for the fetch sizes of data. The source size registers are mentioned as SSra and SSmb. The external parameters to the DATE System also include simple computational equations to generate data accesses in big chunks, like “F Sa = ISa + i × SSra“ where “i“ is taken as internal incremental variable. The parameterized inputs also creates two events, the ”row request event“ (EVrr) and the “rows end event“ (EVre) coming from the CFG MEM-IN-Control and CFG MEM-OUT-Control respectively. These events are monitored at the Event Manager. At run time, the FSM of the Event Manager corresponding to the pseudo code shown in Figure-4(b) initializes the registers ISa, ISb, SSra and SSrb. This is done by using special initialization commands from an external host. These commands are decoded by the DATA Set Manager and forwarded to the Event Manager. The DATA Set Manager can also hold multiple requests from the Event Manager and forward these requests consecutively to the programmable memory controller (PMC). As in lines 5 and 6 of the pseudo code, the Event Manager monitors the event signals EVrr and EVre and sends the tuples of data for the external memory fetch pointers and their sizes to the Data Set Manager along with necessary control signals. This starts fetching of data by the PMC from both matrices A and B in the external memory. The physical data transactions are directly handled by the Data Set Manager and the CFG MEM-IN/OUT-Controls. The FSMs at CFG MEM-IN/OUTControls are also built based on their own parameterized information and take care for the generation of events EVrr and EVre at the appropriate execution time. During the run, one row of matrix-A is fetched from the external memory into a single circular buffer and used element by element in each cycle while the fetched row from

X= N

= V1_pointer 2 ISb = V2_pointer 3 ISc = V3_Pointer 4 BnV1=BnV2=BnV3= 0 5 loop(EVbe) : 6 if (EVrb) : 7 FSa = ISa 8 FBv1 = BnV1++ 9 FSb = ISb 10 FBv2 = BnV2++ 11 FSc = ISb 12 FBv3 = BnV3++ 13 end_if 14 end_loop

(a)

(b)

Figure 5. 3D-Stencil for : odd symmetric 3D stencil, (a) The large input volume partitioned into sub volumes (b) The pseudo code for sub-volume accesses by the Event Manager

External Memory

block counts (BnVx) for the sub volumes is done. Similar to the MM-M kernel case, the Event Manager of AWE monitors two events. One event, ”Block Ends“ (EVbe), is sourced from the CFG MEME-OUT-CONTROL and ends the execution of the kernel while the other event ”Block Request” (EVbr) comes from the CFG MEM-IN-CONTROL and initiates a new request of the block. Inside the control structure, the FSM updates three tuples of parameters corresponding to the three input volumes. Each tuple consists of the base pointer of the volume (FSx) and the block number (FBvx). These tuples of data are used by the Data-Set Manager to access external data through the programmable memory controller. The ﬂow of data between the DataSet Manager and CFG MEM-IN-CONTROL is synchronized with handshake signals between the two interfaces. C. Smith Waterman (SW) The implementation of Smith Waterman algorithm results in a systolic array of processing cells. This kind of data ﬂow is also well suited to map on the compute blocks of the TARCAD architecture. The left part of Figure 6 shows a TARCAD based systolic array of processing cells mapped by joining a number of compute blocks to run the SW-kernel. Each of the compute blocks consists of an algorithm speciﬁc processing cell. This processing cell, in our case, consists of the Smith Waterman compute architecture proposed by Hasan et al. [11]. The input data for a compute block constitutes only a single branch set that consists of Ax , By (the two sequences) and Mup , MDiag (the top and diagonal elements) from the similarity matrix. The MLD represents the current data passed through the LM to the next compute block as left-side’s matrix-M data. This data-word is also passed in stair case ﬂow to be used as a diagonal data element. The generic layout of the compute block in TARCAD is shown in Figure-6(Right). Each compute block keeps a dual ported local memory (LM) to communicate low latency data with other compute blocks. Each word of this local memory is also accompanied by a valid bit which describes the validity of the data written to it. This valid bit is invalidated by the receiving compute block. In case the receiving blocks are more than one then only one of them can drive the invalidation port of the source compute block and others should be able to work synchronous to it. Inside a compute block, the LM is written as a circular buffer therefore the
A(x) Mup,MDig A(x+1) Mup,MDig B(y) MLD A(x+2) Mup,MDig Algorithm Compute Block

Programmable Memory Controller (PMC) Data-Set (DS) Manager for CFG Device
DSi0 DSi1
Events Manager

MEM-IN-CONTROL
Control Data

Event-0 Event-1

DSo0

DSo1

MEM-OUT-CONTROL
Data Control

FFT Core - N Data Organization + Computations

Figure 7.

Mapping an existing FFT core on TARCAD

invalidation of the valid bit can not create any read/write hazards for few consecutive cycles for the LM data between the source and destination. The width and depth of LM is parameterized and it can be decided at the translation time. Moreover, each compute block also has a local memory read and invalid control (LM R/I Ctrl), which helps to read and invalidate a word of the source block’s LM. The read word is placed into a FIFO which is readable by the compute block’s algorithm speciﬁc processing cell. D. Fast Fourier Transform (FFT) The TARCAD layout is ﬂexible and can also integrate with third party cores. For the FFT case, we show in Figure 7 how TARCAD interfaces with an FFT core generated by Xilinx CoreGen [12]. TARCAD interfaces and controls the single or multiple input/output streams of data corresponding to one or more instantiations of the FFT cores. E. Sparse Matrix-Vector Multiplication (SpMV) In our TARCAD based mapping of SpMVM kernel, we use an efﬁcient architecture that is based on a row interleaved input data ﬂow described by Dickov et al. [13]. TARCAD’s FSM in CFG MEM-IN-CONTROL uses a standard generic Spars Matrix format and converts it internally to the rowinterleaved format before feeding to the compute block. However, this methodology needs to know in advance (at translation phase), the maximum possible number of nonzero elements in any row of the matrix. This information helps the translator to correctly estimate the maximum number of rows possible to decode and maintain inside the SpMV memory layout. F. Multiple Kernels On TARCAD TARCAD can handle multiple algorithms working at the same time. In general, each algorithm should be maintained with separate data paths, memory organization and the compute units. Only data requests to the global memory (through the Data-Set Manager) are shared. However, design schemes like a spatially mapped, shared memory layout recently presented by Shaﬁq et al. [8] could help to use shared data for certain kernels with different types of compute block instantiations. IV. E VALUATION M ETHODOLOGY To evaluate the TARCAD system, we simulate the mappings of various application kernels as presented in sectionIII by using a Xilinx Virtex-6 XC6VSX475T device. The

br-0 br-1 CELL-(0,0) CELL-(0,1) CELL - 1 CELL-(0,1) CELL - 1

br-p

B(y+1) MLD

CELL-(1,0)

CELL-(1,1) CELL - 1

CELL-(1,1) CELL - 1

Algorithm Specific Processing Cell

VB B(y+1) MLD

LM

CELL-(2,0) CELL - 1

CELL-(2,1) CELL - 1

CELL-(2,2) CELL - 1
LM R/I Control Output

Figure 6. Smith Waterman : symmetric 3D stencil, Left: The Systolic array of compute blocks, Right: Architectural support for inter-compute block communication.

Control

Data

FFT Core - 0 Data Organization + Computations

Data Control

Table I A PPLICATIONS M APPED TO TARCAD USING V IRTEX -6 & ISE 12.4 Applications Compute Freq DSP48E1 Slices BRAMs Blocks (MHz) (36Kb) M-M Mul 403 105 2015 49757 432 AWE Solver 22 118 2008 45484 677 SW 4922 146 2012 63989 85 FFT 4-48 125 2016-472 59K-48K 0-1060 SpMVM 134 115 2010 33684 516

HDL designs were placed and routed using the ISE 12.4 environment. The Virtex-6 device used in our evaluations has a very large number (more than 2K) of DSP48E1 modules. Therefore, we did maximum possible instantiations of the compute blocks for the kernel and used device’s maximum operational frequency after the place and route for all the back-end instantiations. The external memory support to TARCAD is dependent on the board design. In our simulated evaluations for TARCAD we use an aggressive external memory interface with multiple memory controllers, providing an aggregate peak bandwidth between 100GB/sec to 144GB/s. This external memory interface performance is similar to what can be achieved today by GPUs. In our evaluations, the efﬁciency of the application kernels mapped on the TARCAD layout are compared with the state of the art implementations of the same kernels on various GPU devices. The choice of the GPU based implementation is based on two points. One is that the GPU implementation should be selected out of the available ones for the best possible GPU device and second, we should be able to reproduce the same input test data for the TARCAD based implementations. The architectural efﬁciencies shown in the section-V(b)-(f) are deﬁned differently for the kernels using ﬂoating point computations and cell updates as follows: Arch. Eff for Kernels with FP Operations = GFLOPS Achieved divided by the Device Max.GFLOPS Arch. Eff for Kernels with CUPS = CUPS Achieved divided by the Operational Freq V. R ESULTS & D ISCUSSION The overall performance (Figure-8-(a)) for various kernels mapped on TARCAD remained lower than 100 GFlops (or GCUPS for SW). This is considerably lower than that for the reference performances on GPUs. In fact, it is an expected phenomena as the current reconﬁgurable technology operates at an order of magnitude lower operational frequency (see Table-I) for the mapped designs. However, if we look at the efﬁciency of the TARCAD mapped applications, these are quite promising due to the customized arrangement of data and compute blocks. In following we will discuss efﬁciency of each kernel. In support to the discussion, the total number of compute units instantiated along with their operational frequencies and the usage of chip resource are given in Table-I. The numbers for FFT corresponds to the implementations for 128 points to 65536 points and frequency is chosen for the lowest value. 1) Matrix-Matrix Multiplication (MM-M): In case of MM-M, we can observe from the plot-8-(b) that the efﬁciency of TARCAD based implementation is on average 4

times higher than that for GPU. However, for smaller size of matrices the efﬁciency is relatively lower because of the two factors: One is that the number of columns in Matrix B are less than 403 (total compute block instantiations) or secondly, the number of columns are not multiples of 403. Both cases make unoptimized usage of the available compute units on TARCAD. 2) Acoustic Wave Equation (AWE) Solver: The TARCAD mapped memory layout for the AWE kernel can handle sub-volumes of size 320 × 320 × ∞ in the Z, X and Y axes respectively. The results for AWE (Plot-8-(c)) shows that TARCAD based AWE kernel efﬁciency reaches to 14 times to that for GPU based implementation. However, then it drops lowest to 5× for 384 points 3D volumes. This is because 384 is not the multiple of the basic size (320 × 320 × ∞) for AWE managed specialized memory layout and suffers huge data and computational overhead. However, this plenty starts reducing with an increase in the size of the actual input volumes. 3) Smith-Waterman (SW): The Smith Waterman’s implementation on TARCAD is approximately 3 times (Figure8-(d)) efﬁcient than the referenced GPU based efﬁciency. In-fact, this edge in architectural efﬁciency of TARCAD is only a result of the customized mapping for the computing cells and the systolic array. The front-end data management only takes care to buffer new sequences for comparison or for feeding back the results from the cells on the boundary of the systolic array through CFG MEM-OUT-CTRL, Data Set Manager and CFG MEM-IN-CTRL path. 4) Fast Fourier Transform (FFT): The memory requirement by the ﬂoating point, streaming based implementation of xilinx’s FFT core increases rapidly for larger number of points. In case of TARCAD based mapping, the instantiations of the FFT kernel for 16384 or larger points are limited by the total available BRAM of the device . This limitation is accordingly apparent from the plot shown in Figure-8(e). However, for the lower number of points (equal/lower than 8192), the instantiations of FFT compute blocks are dictated by the total number of DSP48E modules available on the device. 5) Sparse Matrix-Vector Multiplication (SpMV): In the SpMVM mapping on TARCAD, we modiﬁed the original design of Dickov et al. [13] to a special yet generic compute block for handling any kind of laplacian data. This design handles a three point front-end which accumulates three dot products at a time from a row. However, the inefﬁciencies for this laplacian speciﬁc compute block appear when the non-zero diagonals in the laplacian matrix are not a multiple of 3. VI. R ELATED W ORK The topic of developing a compute template for FPGAs is related to many areas of research. In Section II-A we presented a small look into several recent developments that are directly related to this paper. Here we will provide a succinct overview of other related developments.

24

48

92

51

96

25

25

32

38

10

20

40

81

Data Size --->

Data Size --->

a)
Device Efficiency (CUP/Cyc) --->
35 30 25 20 15 10

(b)
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

(c)
0.10 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02

Smith Waterman Sequence Alignment
Device Efficiency --->

Fast Fouriear Transform

Sparse Matrix-Vector Multiplication GPU FPGA

GPU FPGA

GPU FPGA

12 8

51 2

4

81 92

25 6

4

76 8

40 96

16 38

10 2

32

14 4 18 9 22 2 37 5 46 4 56 7 65 7 72 9 85 10 0 0 15 0 0 20 0 0 25 5 04 30 0 35 5 6 40 4 61 45 4 47 8 4 51 3 4 54 7 78

Query Sequence Length --->

Laplacian Points --->

(d)

(e)

(f)

Figure 8. (a) Performance Numbers for TARCAD based Kernels (GFLOPS and GCUPS for SW) using Virtex-6 XC6VSX475T device and (b-f) The Architectural Efﬁciency for :(b) MM-M (GPU: Tesla C2050 in [14]) (c) AWE (GPU: Tesla C1060 in [10]) (d) SW (GPU: Tesla C1060 in [15]), (e) FFT (GPU: Tesla C2050 in [14]) (f) SpMVM (GPU: GTX 280, Cache Enabled [16])

TARCAD deﬁnes both a high-level model for the computation ﬂow as well as a strategy for organizing resources, managing the parallelism in the implementation, and facilitating optimization and design scaling. Following DeHon’s taxonomy, these two correspond to the ﬁelds of compute models and system architectures [17]. Compute models include abstractions such as Dataﬂow, Sequential Control or Hoare’s CSP. System Architectures roughly consists of Dataﬂow Machines, von Neuman computers and Data Parallel architectures such as SIMD, SPMD or SIMT machines. Field Programmable Gate Arrays offer a raw and unconﬁgured computation substrate that allows mapping all of the previous models on a chip. This provides great ﬂexibility, but, as already discussed in this paper, at the cost of many design overheads. The raw logic and routing hardware also creates some performance bottlenecks as it imposes a considerable area penalty and limits the frequency at which a circuit can operate. As a consequence, many researchers have attempted to reduce the ﬂexibility and improve frequency by designing new reconﬁgurable hardware with reduced the interconnection networks and more fullcustom functional units. Such chips are often called CoarseGrained Reconﬁgurable Architectures (CGRA). Similarly to TARCAD they also deﬁne stricter compute models and system architectures. PipeRench [18], MUCCRA [19] or ADRES [20] are examples of CGRA architectures. A related architecture are the so-called Massively Parallel Processor Arrays (MPPA), which are similar to CGRAs, but include complete, although very simple, processors instead of the functional units featured within CGRAs. PACT-XPP [21] are is an example of a MPPA-style architecture. Deﬁning a compute model and a system architecture are not only speciﬁc to chip design. Several efforts have concentrated on deﬁning environments in which to accommodate

FPGA chips. Kelm et al. [22] used a model based on local input/output buffers on the accelerator with DMA support to access external memory. Brandon et al [23] proposes a platform-independent approach by managing virtual address space inside their accelerator. Several commercially available machines like the SGI Altix-4700 [24] or the Convey HC 1 [25] propose system level models to accelerate application kernels using FPGAs. These models combine a CPU with one or multiple FPGAs running over a system bus. Another option is to integrate CPU and FPGA directly in a single chip. Several research projects have covered this possibility. In the Chimaera architecture [26], the accelerator targets special instructions that tell the microprocessor to execute the accelerator function. The accelerator in Molen processor [27] uses some exchange registers which get their data from processor register ﬁle. The major FPGA vendors are now introducing new FPGAs that include processor cores and FPGA logic, and which are speciﬁcally designed for usage together with High Level Synthesis tools. The Zynq7000 family of devices is a recent commercial architecture that combines 7-Series reconﬁgurable logic with ARM cores. VII. C ONCLUSIONS In this paper we have presented our developments towards a uniﬁed accelerator design for FPGAs that improves FPGA design productivity and portability without constraining customization. The evaluation on several scientiﬁc kernels shows that the template makes efﬁcient use of resources and achieves good performance. In this work we have focused on showing how the properties are achieved for HDL-based designs. Our TARCAD design also considered adoptability by High Level Synthesis tools as a main goal in order to provide interoperability and

27

Data Size --->

65

53 6

Device Efficiency --->

t

44

5p

3p

7p

9p

pt

t

t

t

51

90 80 70 60 50 40 30 20 10 00

GFLOPS (GCUPS for SWM) --->

Device Efficiency --->

GPU
2

FPGA

1 4 5 7 6 9 2 3 8 Problem Sizes ---> (Kernal's Data Sets are the ones used in the Efficiency Plots)

Device Efficiency --->

GFLOPS for the Evaluated Kernels on FPGAs MMM FFT SpMVM RTM SWM

6

6

0

4

8

2

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Matrix-Matrix Multiplication

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

3D-Reverse Time Migration GPU FPGA

high customization to such tools. In the future we plan to analyze this possibility in more detail. Although we have shown that TARCAD is more efﬁcient than GPUs, ﬁnal performnance is often worse due to the slower operational frequencies of FPGAs. Designing a coarse-grained reconﬁgurable architecture (CGRA) based on the TARCAD architecture is an interesting idea to improve ﬁnal performance that could be explored in the future. R EFERENCES
[1] B. Cope, P. Y. Cheung, W. Luk, and L. Howes, “Performance Comparison of Graphics Processors to Reconﬁgurable Logic: A Case Study,” IEEE Transcations on Computers, August 2009. [2] M. Garland and D. B. Kirk, “Understanding throughputoriented architectures,” Commun. ACM, vol. 53, pp. 58–66, November 2010. [3] M. Shaﬁq, M. Peric` s, R. de la Cruz, M. Araya-Polo, a N. Navarro, and E. Ayguade, “Exploiting Memory Customization in FPGA for 3D Stencil Computations,” IEEE FPT, December 2009. [4] B. Buyukkurt, J. Cortes, J. Villarreal, and W. A. Najjar, “Impact of high-level transformations within the ROCCC framework,” ACM Trans. Archit. Code Optim., December 2010. [5] P. Coussy and D. Helle, “GAUT - High-Level Synthesis tool From C to RTL.” [6] J. E. Smith, “Decoupled access/execute computer architectures,” in Proceedings of the 9th annual symposium on Computer Architecture, ser. ISCA ’82. Los Alamitos, CA, USA: IEEE Computer Society Press, 1982, pp. 112–119. [7] T. Hussain, M. Peric` s, and E. Ayguad´ , “Reconﬁgurable a e Memory Controller with Programmable Pattern Support,” HiPEAC WRC, Heraklion Crete, January 2011. [8] M. Shaﬁq, M. Peric` s, N. Navarro, and E. Ayguad´ , “FEM: a e A Step Towards a Common Memory Layout for FPGA Based Accelerators,” 2010. [9] M. Shaﬁq, M. Peric` s, N. Navarro and E. Ayguad´ , “A a e Template System for the Effcient Compilation of Domain Abstractions onto Reconﬁgurable Computers,” HiPEAC WRC, Heraklion Crete, Jan 23, 2011. [10] M. Araya-Polo, J. Cabezas, M. Hanzich, M. Peric` s, F. Rubio, a I. Gelado, M. Shaﬁq, E. Morancho, N. Navarro, E. Ayguad´ , e J. M. Cela, and M. Valero, “Assessing Accelerator-Based HPC Reverse Time Migration,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, pp. 147–162, 2011. [11] L. Hasan, Y. M. Khawaja, and A. Bais, “A Systolic Array Architecture for the Smith-Waterman Algorithm with High Performance Cell Design,” Proceedings of IADIS European Conference on Data Mining, 2008. [12] Xilinx, ISE Design Suite CORE Generator IP Updates. [Online]. Available: http://www.xilinx.com/ipcenter/coregen/ updates.htm

[13] B. Dickov, M. Peric` s, N. Navarro, and E. Ayguade, “Rowa interleaved streaming data ﬂow implementation of Sparse Matrix Vector Multiplication in FPGA,” in 4th Workshop on Reconﬁgurable Computing, WRC-2010, 2010. [14] NVIDIA, “Tesla C2050 Performance Benchmarks,” Tech. Rep., 2010. [Online]. Available: www.siliconmechanics.com/ ﬁles/C2050Benchmarks.pdf [15] NVIDIA, “CUDASW++ on Tesla GPUs,” 2010. [Online]. Available: http://www.nvidia.com/object/swplusplus on tesla.html [16] N. Bell and M. Garland, “Efﬁcient sparse matrix-vector multiplication on cuda,” NVIDIA Technical Report NVR-2008-004, Dec. 2008. [17] S. Hauck and A. DeHon, “Reconﬁgurable computing: the theory and practice of FPGA-based computation,” November 2007. [18] S. C. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi, R. R. Taylor, and R. Laufer, “Piperench: a co/processor for streaming multimedia acceleration,” in Proceedings of the 26th annual international symposium on Computer architecture, ser. ISCA ’99. Washington, DC, USA: IEEE Computer Society, 1999, pp. 28–39. [19] Y. Saito, T. Sano, M. Kato, V. Tunbunheng, Y. Yasuda, M. Kimura, and H. Amano, “Muccra-3: a low power dynamically reconﬁgurable processor array,” in Proceedings of the 2010 Asia and South Paciﬁc Design Automation Conference, ser. ASPDAC ’10. Piscataway, NJ, USA: IEEE Press, 2010, pp. 377–378. [20] J. Bormans, “ADRES Architecture - Reconﬁgurable Array Processor,” Chip Design Magazine, November 2006. [21] V. Baumgarte, G. Ehlers, F. May, A. N¨ ckel, M. Vorbach, u and M. Weinhardt, “Pact xpp – a self-reconﬁgurable data processing architecture,” J. Supercomput., vol. 26, pp. 167– 184, September 2003. [22] J. Kelm, I. Gelado, K. Hwang, D. Burke, S.-Z. Ueng, N. Navarro, S. Lumetta, and W. mei Hwu, “Operating System Interfaces: Bridging the Gap between CPU and FPGA Accelerators,” Poster in International Symposium on FPGAs (FPGA’07), 2007. [23] A. Brandon, I. Sourdis, and G. N. Gaydadjiev, “General Purpose Computing with Reconﬁgurable Acceleration,” International conference on Field Programmable Logic and Applications, 2010. [24] SGI, “Reconﬁgurable Application-Speciﬁc Computing User Guide,” Tech. Rep., 2008. [25] C. C. Corporation, “The Convey HC-1: The Worlds First Hybrid-Core Computer,” HC1- Data Sheet, 2008. [26] S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao, “The Chimaera reconﬁgurable functional unit,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 12, pp. 206–217, 2004. [27] S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, and E. M. Panainte, “The MOLEN Polymorphic Processor,” IEEE Transactions on Computers, vol. 53, pp. 1363–1375, 2004.

Similar Documents

Science

The Importance Of Science In Science

Science

Science

Science

Science

Science

Science

Science

Science

Science

Science

Science

Science

Science

Popular Essays