FPGA-Specific Custom Arithmetic Datapath Design
Florent de Dinechin, Bogdan Pasca

To cite this version:
Florent de Dinechin, Bogdan Pasca. FPGA-Specific Custom Arithmetic Datapath Design. 2010. ensl-00542396

HAL Id: ensl-00542396
https://hal-ens-lyon.archives-ouvertes.fr/ensl-00542396
Submitted on 2 Dec 2010

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
FPGA-Specific Custom Arithmetic Datapath Design

Florent de Dinechin, Bogdan Pasca
LIP (ENSL-CNRS-Inria-UCBL), Ecole Normale Superieure de Lyon
46 allée d’Italie, 69364 Lyon Cedex 07, France
Email: {florent.de.dinechin, bogdan.pasca}@ens-lyon.org

Abstract—This paper presents FloPoCo, a framework for easily designing custom arithmetic datapaths for FPGAs. Its main features are: an important basis of highly optimized arithmetic operators, a unique methodology for frequency-directed pipelining the designed circuits and a flexible test-bench generation suite for numerically validating the designs. The framework was tested for designing several complex arithmetic operators, this paper presenting the architecture and results for the exponential operator. Synthesis results capture the designed operator’s flexibility: automatically optimized for several Altera and Xilinx FPGAs, wide range of target frequencies and several precisions ranging from single to quadruple precision.

Keywords—FloPoCo; framework; arithmetic circuit; pipelining; flexibility;

I. INTRODUCTION

In the last years the trend of using FPGAs for prototyping ASICs has shifted towards FPGAs as application accelerators. There is a wide range of applications that can benefit FPGA acceleration but the best-suited candidates applications are intrinsically parallel and require complex and exotic arithmetic operations using custom precisions.

The potential FPGA speedup over microprocessor systems can go beyond 3 orders of magnitude, depending on the application. However, translating even the best acceleration candidate into an optimal FPGA design is a tedious task. The first attempts for application acceleration using FPGAs boiled down to a manual, low-level circuit description. FPGA tools have come a long way since then, but even today, although laborious, describing circuits using low-level primitives is still done performance-critical circuit parts.

At the other end of the spectrum, new products targeting portability and productivity were developed [1], [2], [3], [4], [5]. These products are capable of automatically generating circuit descriptions for algorithms described in subset of the C language. Out of these tools, only [1] and [4] are capable of dealing with floating-point numbers. Numerous compiler optimization techniques are implemented, but most of the time the result is significantly slower and larger than manual design.

Arithmetic datapath design for microprocessors includes several constraints among which the fixed operators implemented in hardware (for floating-point: +, -, *, /) and their working precisions, usually single-precision (SP) and double-precision (DP). Matching these operators and the available precisions will generally yield a good design. Trying to optimize the datapath using exotic precisions will bring no improvement.

Due to their reconfigurability, FPGAs have virtually no constraints. However, in order to benefit from the last drop of performance the user must understand the problem well enough so he can give a rough bound on the output precision. From this specification, the circuit can be implemented working with non standard operators at custom precisions yielding significant speedups over traditional design [6].

We seek to provide an extensible open-source framework for efficiently building arithmetic pipelines on FPGAs. It can be seen as a hybrid between the two ends of the spectrum, automatizing parts of the design process of high-performance arithmetic operators. The development effort of the arithmetic pipeline will be a parametrized design in: input/output precision, deployment FPGA and objective frequency.

II. ARITHMETIC OPERATORS

In this work we consider arithmetic operators as being circuits that can be described by a function \( f(X) = Y \) where \( X = x_0, ..., x_{j-1} \) is a set of inputs and \( Y = y_0, ..., y_{j-1} \) is a set of outputs. Any sequential code without feedback loops performing some computations fits this description. Take for example the circuit performing the complex multiplication: \((a+ bj)(c+ dj)\). The circuit inputs are \( a, b, c, d \) and the output is the pair \( r = ac - bd, i = ad + bc \). As it can be seen in further sections, the restriction to this class of circuits allows for a finer modeling of arithmetic circuits. A simple extension to the introduced framework will allow us to model circuits having feedback loops.

A. FPGA-specific arithmetic operator design

Two of the factors characterizing the quality of arithmetic operators on FPGAs are circuit area and operating frequency. Unfortunately, there is a monotonic dependency between the two: the faster a circuit is, the more resources it takes. As circuit frequency \( f \) is part of the project’s specifications, our job as designers is to build the smallest circuit matching this frequency. The task gets even more complex if we introduce productivity in the equation.

One solution is to use high-performance off-the-shelf IP cores for modeling the circuit. This solution will give the correct performance at the expense of pipeline depth and circuit area as the obtained frequency is usually overestimated. Another way to do this is assembling custom components built for the same frequency \( f \). The circuit proposed in [7] for
evaluating \(x^2 + y^2 + z^2\) in DP uses a custom 3-input FP adder\(^1\) and squarers. The circuit consumes 40-50\% less logic and 25-33\% less embedded multipliers on a Xilinx Virtex4 device than assembling high-performance off-the-shelf IP cores from Xilinx Coregen \[8\].

The target of the FloPoCo is to cover a broad range of abstraction levels, allowing the user to obtain the required performances as productively as possible, without overengineering the solution. For complying with these demands, our framework should:

- provide quality implementations of all the basic off-the-shelf additonSizeblocks available in commercial core generators and more
- provide the mechanisms for easily connecting and synchronizing these blocks
- enhance productivity by employing reusability. Each operator described using this framework will be part of an available operators basis.
- be expressive enough to capture low-level FPGA-specific architectural optimizations when needed
- employ frequency-directed pipelining for minimizing circuit area and pipeline depth
- encourage parametric description of circuits so they can easily be retuned to changing precision requirements
- allow to easily retarget existing operator descriptions to new FPGAs.

In a more visual representation, Figure 1, FloPoCo should be able to cover a whole range of abstraction levels in the circuit description allowing much better productivity than hardware description languages while being able to offer the same set of performances.

\(^1\)The classical implementation of FP adders has two paths, one of which is for the case of subtraction. As the inputs are all positive this path is manually trimmed-out at design time.

III. The Framework

A. Initial off-the-shelf tools

1) Adder: Integer addition is used as a building block in many coarser operators. In hardware description languages (HDL) addition is expressed as “+” and is implemented as a ripple-carry adder. Although FPGAs are enhanced to better support this type of adder, long additions require pipelining for reaching high frequencies. Examples which require large adders include integer multipliers, most FP operators,\(^2\) and modular adders used in some cryptographic applications.\(^3\) FloPoCo offers three different implementations for pipelined adders \[10\]. In the multidimensional space \((f,FPGA,circuit area)\) our framework transparently chooses the best architecture for a given scenario.

2) Multiplier: Multiplication is a pervasive operation, and in an FPGA it should be adapted to its context as soon as this may save resources. Recent FPGAs embed a large number of Digital Signal Processing (DSP) blocks, which include small multipliers.

The automated generation of larger multipliers using the embedded multipliers and adders present in the DSP blocks of current FPGAs can be expressed as a tiling problem, where a tile represents a hardware multiplier, and super-tiles represent combinations of several hardware multipliers and adders, making efficient use of the DSP internal resources \[11\]. This technique allows building high performance multipliers while minimizing DSP block count. Reducing DSP-block count can also be implemented using the Karatsuba-Ofman algorithm which trades multiplications for additions. The available multipliers using this technique have the DSP cost reduced from 4 to 3, from 9 to 6, or from 16 to 10 on a Virtex4 FPGA \[12\].

Our framework offers all these types of multipliers, therefore offering the designer a large space of different trade-offs between latency, logic and DSP count.

3) Squarer: Squaring is fairly common in FPGA-accelerated computations, as it appears in norms, statistical computations, polynomial evaluation, etc. A dedicated squarer saves as many DSP blocks as the Karatsuba-Ofman algorithm, but without its overhead. FloPoCo implements squarers as presented in \[12\]. By using squarers instead of off-the-shelf Coregen Multipliers the evaluation of \(x^2 + y^2 + z^2\) was reduced from 27 to 18 DSPs for DP and from 20 to 9 DPSs for SP \[7\].

4) Truncated Multipliers: Truncated multipliers \[13\] discard some of the lower bits of the mantissa to save hardware resources. For a FP multiplier, the impact of this truncation can be kept small enough to ensure last-bit accuracy (or faithful rounding) instead of IEEE-754-compliant correct rounding. This small accuracy lost may be compensated by a larger mantissa size. However, it is also perfectly acceptable in situations

\(2\)In floating-point, the demand in precision is now moving from double (64-bit) to the recently standardized quadruple precision (128-bit format, including 112 bits for the significand) \[9\]

\(3\)In elliptic-curve cryptography, the size of modular additions is currently above 150 bits for acceptable security.
where a bound on the relative error of the multiplication is enough to ensure the numerical quality of the result. This is for instance the case of polynomial approximation of functions: it is possible to build high-quality functions out of truncated multipliers [14]. The implementation of truncated multipliers is an important step towards efficient implementations of elementary functions up to quadruple precision on FPGAs. FloPoCo offers the implementation of truncated multipliers as described in [11].

5) Constant Multiplier: Multiplication by a constant is a pervasive operation. It often occurs in scientific computing codes, and is at the core of many signal-processing filters. It is also useful to build larger operators: architectures for exponential, logarithm and trigonometric functions [15], [16] all involve multiplication by a constant. A single unoptimised exponential, logarithm and trigonometric functions [15], [16] is an important step towards efficient implementations of constant multipliers, both integer and correctly rounded FP is that presented in [17].

6) Function Evaluator: Polynomial approximation is a very general technique for the evaluation of a wide class of numerical functions of one variable. FloPoCo provides an architecture generator that inputs the specification of a function and outputs a synthesizable description of an architecture evaluating this function with guaranteed accuracy. Our current implementation [14] improves upon the literature in two aspects: it uses better polynomials, thanks to recent advances related to constrained-coefficient polynomial approximation [18], and it refines the error analysis of polynomial evaluation to reduce the size of the multipliers used.

B. Synchronization mechanisms

Arithmetic operators as defined in Section II have no feedback loops. Such an operator can be seen as a flow of operations propagating from the inputs towards the outputs. In order to obtain high throughput these operators are deeply pipelined. Such operators can be obtained by assembling elementary operators built for the same frequency \( f \) and synchronizing the datapath accordingly using registers. As the pipeline depth of sub-components is a depends on the target FPGA, frequency \( f \) and precision, generic synchronization mechanisms are required in order to connect these components.

Hardware circuits usually have several several parallel executing threads which may interact. HDLs, being concurrent languages\(^4\) make it natural to describe combinatorial circuits of this type. The triviality comes from the fact that all computations are performed at the same clock cycle. For pipelined designs one has to describe the computations performed at each clock cycle. The synchronization between signals is a tedious and error-prone task. Moreover, minor modifications to the pipelining depth of components usually leads to synchronization problems.

Say for instance that we need to parallelly evaluate the second-degree polynomial: \( a_2x^2 + a_1x + a_0 \) (Figure 2).

\[ x \] \[ a_2 \] \[ \times \] \[ \times \] \[ a_2x^2 \] \[ a_1x \] \[ + \] \[ a_0 \] \[ \downarrow \]

Fig. 2. Parallel evaluation of the polynomial \( a_2x^2 + a_1x + a_0 \)

When performing the addition \( a_2x^2 + (a_1x + a_0) \) we need to synchronize the two input signals. The pipeline depths of the two signals are \( d_1 = d_{\text{square}} + d_{\text{multiplier}} \) and \( d_2 = d_{\text{multiplier}} + d_{\text{adder}} \). This yields three cases:

- \( d_1 > d_2 \) need to delay the FPAdder output with \( d_1 - d_2 \) cycles
- \( d_2 > d_1 \) need to delay the FPMultiplier output with \( d_2 - d_1 \) cycles
- \( d_1 = d_2 \) computation can be performed directly as the two signals are already synchronized.

Formally, when needing to synchronize several signals the procedure consists in determining the maximum pipeline depth of all signals and delaying the rest for the corresponding clock cycles.

Figure 3 presents the FloPoCo parametric code for the floating-point version of this circuit. The instantiated FP primitives, FPSquarer, FPMultiplier and FPAdder are also parametrized function of target FPGA (the target parameter), exponent width (\( wE \)) and fraction width (\( wF \)). The synchronization between the different threads is done using two functions:

\[ \text{syncCycleFromSignal}("s") \] : advances the current cycle in describing the circuit to the declaration cycle of signal "s" (production cycle)
\[ \text{setCycleFromSignal}("s") \] : sets the current cycle to the declaration cycle of signal "s". This is useful when describing new threads for which the current cycle is less than our current active cycle (Figure 3, line \[ \text{setCycleFromSignal}("al"); \])

C. Sub-cycle accurate data-path design

All basic FloPoCo operators have flexible pipelines, adapting to the user specified frequency, target FPGA properties and input/output precisions. By registering the inputs of these components we guarantee that the critical path remains smaller than \( 1/f \). However, this is not optimal. Consider the following case depicted in Figure 4. For the cases when \( \text{additionSize} \) and \( f \) are small enough registering the addition in not needed which leads to an overpipelined circuit.

In order to avoid overpipelining and therefore wasting resources \( \text{IntConstMultiplier} \) should be able to adapt its
first pipeline stage for the case when there is no register level at its input. This is accomplished by feeding the delay of the previous computation (addition size `additionSize`) to the constructor of `IntConstMult` (`target->adderDelay(additionSize)`). In turn, `IntConstMultiplier` reports the delay of the circuit at its outputs.

**D. Transaction level pipelining**

We have presented so far a series of operators and mechanisms for facilitating arithmetic operator design having flexible pipelines. In this section we present an automation of these techniques, having an inspiration from database transactions.

In databases, a transaction is unit of work (series of statements) that are treated atomically, that is, either they are all executed or the transaction is rolled-back. In our case, a transaction is composed of a number of arithmetic operations starting from a previous register level contributing to the critical path delay. A transaction is valid while the critical path of the operations declared within is $\leq \frac{1}{f}$. The condition that makes a transaction invalid gives us the precise position where to insert the register level. In this case a new transaction is started having a delay equal to the delay of the operation which invalidated the transaction.

Figure 5 shows the generic optimal pipelining for the code in Figure 4. The `manageCriticalPath()` function verifies if the addition delay added to the current critical path exceeds $1/f$. If this is true, a register level is inserted automatically and the new critical path becomes equal the adder delay. Otherwise, the critical path is updated with the adder delay.

The `getCriticalPath()` parameter fed to the `IntConstMult` constructor represents the delay information at the multiplier input. The multiplier has to cope with the new information and adjust the delay of its first pipeline level accordingly.

The last line of code updates the critical path delay at the output of the constant multiplier. If the multiplier architecture ends in a register level, this value will be 0, otherwise this will be the delay of the circuit starting from the last register level in `IntConstMult`.

**E. Reality check:** $e^x$

This methodology has been checked on a complex arithmetic operator $e^x$ having the architecture depicted in Figure 6. A detailed description this architecture can be found in [19].
our framework is capable of generating all the necessary for size, using specialized operators etc. Once design is finished, for FPGA deployment by adjusting intermediate computation mathematical specification. The architecture is then optimized feature of FloPoCo. Arithmetic circuit design starts from its precision (QP) in the IEEE754-2000 FP standard [9]. From SP to DP and even up to the newly introduced quadruple operator instances can be generated for precisions ranging is flexible in precision. A large number of different exponential architecture is optimized for all these targets. Moreover, the circuit optimized. A whole set of FPGAs are currently supported proprietary MegaWizard Core Generator from Altera is highly operator yield better results for Altera FPGAs, where the arithmetic optimizations allowed by FloPoCo makes this 52 levels for 400MHz to to 23 for 200MHz.

Testing the designed arithmetic circuits is an essential feature of FloPoCo. Arithmetic circuit design starts from its mathematical specification. The architecture is then optimized for FPGA deployment by adjusting intermediate computation size, using specialized operators etc. Once design is finished, our framework is capable of generating all the necessary for testing the implementation results to a series of stimuli against the response of the mathematical model to that set of stimuli.

In order to get testing support, each operator has to be annotated with an emulate() function. The purpose of this function is to emulate the behavior of the mathematical specification of the circuit. Say for instance the circuit computes \( e^x \) in floating-point having the architecture presented in figure 6. The emulate() function does not mimic this complex architecture, increasing the probability of doing the same errors in software as in hardware, but mimics the specification: for a given floating-point number \( x \) the output of the value is \( e^x \) for the given precision. The code of the emulate function for the \( e^x \) function is presented in Figure 8.

As it can be clearly seen in Figure 8, the specification testing, creates a test-bench which tests errors in software as in hardware, but mimics the specification: for a given floating-point number \( x \) the output of the value is \( e^x \) for the given precision. The code of the emulate function for the \( e^x \) function is presented in Figure 8.

Test-bench suites can then be generated for all operators implementing the emulate function. By default, FloPoCo uses a basic random-number generator to generate input values in the emulate function. The output value(s) (faulty rounding allows two possible results, \( \text{fpfr\_exp}(rd, x, \text{GMP\_RNDU}) \) for this input is then computed and packed into the the test-case ( \( tc\_\text{-addExpectedOutput}("R", \text{svRD}) \)). A list of test-cases form a test-bench. Exhaustive testing, also called soak testing, creates a test-bench which tests all the possible combinations of the inputs. This gives the absolute guarantee that an operator is fully compliant to the
specifications. Unfortunately, this type of testing is restricted to only a hand-full of precisions. Generally, it is feasible to test only a small fraction of the total number of tests. Therefore, the problem boils down to choosing the test-vectors which best test the given operator.

For some operators such as fixed-point +, *, floating-point *, the test-vectors can be generated using the classical random-number generators. The probability of testing all the data-paths of the circuit suffices. Other floating-point operations are more sensitive:

- +. The architecture usually consists of two main data-paths, one for the case when the difference in exponents is $\in \{-1, 0, 1\}$. The probability of generating a test-vector which tests this data-path using an random-number generator with a uniform distribution is approximatively 1/85 for single-precision and 1/706 for double-precision.

- $e^x$. The exponential returns zero for input numbers smaller than $\log(2(2^{wF-1} - 1))$, and should return $+\text{inf}$ for all inputs larger than $\log((2-2^{-wF} \cdot 2^{wF-2} - 1))$. In single precision the set of input numbers on which a computation will take place is just [88.03, 88.72]. In addition, as for small $x$ we have $e^x \approx 1 + x + x^2/2$, the exponential will return 1 for all the input $x$ smaller that $2^{-wF - 2}$. One consequence is that the testing of a floating-point exponential operator should focus on the this range of the input. More details can be found in [19].

FloPoCo offers the possibility of overriding the default behavior of of filling the test-cases using random-numbers having a uniform distribution. The corresponding function for $e^x$ if given in Figure 9. This new version of the random generator function generates $1/8$ truly random inputs, and for the rest of $7/8$ the exponent is generated such that $x \in [X_{\min}, X_{\max}]$. 

---

<table>
<thead>
<tr>
<th>Precision</th>
<th>FPGA</th>
<th>Tool</th>
<th>Performance (MHz)</th>
<th>Logic Usage (A1LUs</th>
<th>Reg.</th>
<th>Slice</th>
<th>DSPs</th>
<th>Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>(8,23)</td>
<td>StratixIII</td>
<td>Altera MegaWizard</td>
<td>274</td>
<td>17</td>
<td>527</td>
<td>900</td>
<td>-</td>
<td>19 18-bit elem.</td>
</tr>
<tr>
<td></td>
<td>ours</td>
<td>391</td>
<td>6</td>
<td>852</td>
<td>374</td>
<td>-</td>
<td>2 18-bit elem.</td>
<td>2 M9K</td>
</tr>
<tr>
<td></td>
<td>405</td>
<td>7</td>
<td>519</td>
<td>382</td>
<td>-</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>(11,52)</td>
<td>VirtexIV</td>
<td>Altera MegaWizard</td>
<td>213</td>
<td>25</td>
<td>2941</td>
<td>1476</td>
<td>-</td>
<td>58 18-bit elem.</td>
</tr>
<tr>
<td></td>
<td>ours</td>
<td>327</td>
<td>29</td>
<td>1307</td>
<td>3757</td>
<td>-</td>
<td></td>
<td>10 DSP48E1</td>
</tr>
<tr>
<td></td>
<td>256</td>
<td>15</td>
<td>1437</td>
<td>1984</td>
<td>-</td>
<td>22 18-bit elem.</td>
<td>10 M9K</td>
<td></td>
</tr>
<tr>
<td></td>
<td>187</td>
<td>45</td>
<td>2454</td>
<td>2300</td>
<td>1579</td>
<td>14 DSP48E1</td>
<td>5 BRAM</td>
<td></td>
</tr>
<tr>
<td></td>
<td>ours</td>
<td>395</td>
<td>69</td>
<td>8071</td>
<td>7725</td>
<td>-</td>
<td>71 DSP48E1</td>
<td>123 BRAM</td>
</tr>
</tbody>
</table>

---

Fig. 7. Component hierarchy for $e^x$ DP, on a Virtex4 for frequency left) $f = 400MHz$, right) $f = 200MHz$
where $X_{\text{min}} = 2^{-E_0}$ and $X_{\text{max}} = (2 - 2^{-w_F}) \cdot 2^{w_E} - 1 - E_0$.

For the case of the floating-point addition one could decide that testing the two data-paths with the same probability suffices. Implementing this change is trivial, but might not be enough. Consider the extreme case $X + (-X)$. This causes a massive cancellation of the mantissas and is therefore a difficult case to cover. Probabilistically, this has a $1/2^{w_F}$ chances of happening with a uniform distribution.

In order to capture all these corner-cases, FloPoCo allows manually defining a set of standard test-cases which make it possible to specify the extreme cases.

G. Framework extensions

1) Managing feedback loops: Up to this point we have constrained our definition of arithmetic operator to functions. In fact, the current implementation of FloPoCo can also manage feedback loops. This is especially important as the accumulation circuit which falls in this category is considered by many the 5th basic operation. The subtlety in this case is using a signal which may be declared cycles later. Say for example that the accumulation circuit takes 5 pipeline stages. The result signal of this accumulation is declared only at cycle 5 in the design, however, this input needs to be fed back to the first cycle, at the accumulator input. As using a signal cycles before it’s declared leads to errors in designs not having feedback loops, at circuit generation time our framework signals to the user, as a warning, the signals having this property. If indeed the signals are feedback signal this may be ignored. Otherwise, the described circuit may not be what the user planed for.

2) An extension tool for VLSI ALU design: The initial purpose of FloPoCo was to provide a flexible environment for describing purely arithmetic operators for FPGAs. Nevertheless, FloPoCo may be extended so to be used in VLSI ALU design. The extension is in fact a simplification of all basic components for the VLSI target. The VHDL code generated for the basic operators will simply be “+,-,*”.

FloPoCo will be used perform and initial pipelining of the ALU. The code generated will then be passed through VLSI specific tools which replace the “+,-,*” operators by VLSI-specific instantiations and perform register retiming.

3) Backend for HLS: Our framework can also be used as a beck-end for high-level synthesis as it offers an important basis of arithmetic operators optimized for different types of contexts. The tool itself is open-source and extensible allowing an on-demand update of the available operator basis. This is as flexible as being able to add a new instruction to the instruction set of a microprocessor. Work is undergoing in experimenting in this research direction.

IV. Conclusion

With the features provided by the FloPoCo framework, flexible arithmetic datapath design for FPGAs is finally ready for prime-time. The important operator basis together with the novel methodology in pipelining allows developing state-of-the-art arithmetic operators with a productivity never before encountered. Mathematically oriented test-bench generation support allows a better validation of designed circuits in a shorter time. Synthesis results confirm the flexibility and performance of the operators developed using our framework. Preliminary efforts confirm the possible extensions for the framework for VLSI ALU design. Moreover, work is currently undergoing in linking FloPoCo as a back-end for HLS.

use section* for acknowledgement

ACKNOWLEDGMENT

This work was partly supported by the XtremeData university programme, the ANR EVAFlo and TCHATER projects, and StoneRidge Technology.

REFERENCES


