(M, p, k)-Friendly Points: A Table-based Method to Evaluate Trigonometric Function
Dong Wang, Jean-Michel Muller, Nicolas Brisebarre, Milos Ercegovac

To cite this version:
Dong Wang, Jean-Michel Muller, Nicolas Brisebarre, Milos Ercegovac. (M, p, k)-Friendly Points: A Table-based Method to Evaluate Trigonometric Function. IEEE Transactions on Circuits and Systems Part 2 Analog and Digital Signal Processing, Institute of Electrical and Electronics Engineers (IEEE), 2014, 61 (9), pp.711-715. 10.1109/TCSII.2014.2331094 . ensl-01001673

HAL Id: ensl-01001673
https://hal-ens-lyon.archives-ouvertes.fr/enl-01001673
Submitted on 5 Jun 2014

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
(\(M, p, k\))-Friendly Points: A Table-based Method to Evaluate Trigonometric Function

Dong Wang, Jean-Michel Muller, Nicolas Brisebarre, and Miloš D. Ercegovac

Abstract—Linear (order-one) function evaluation schemes, such as bipartite and multipartite tables, are usually effective for low precision approximations. For high output precision, the lookup table size is often too large for practical use. This study investigates the so-called \((M, p, k)\) scheme that reduces the range of input argument to a very small interval so that trigonometric functions can be approximated with very small lookup tables and a few additions/subtractions. An optimized hardware architecture is presented and implemented in both FPGA device and standard cell based technology. Experimental results show that the proposed scheme achieves more than 50\% reduction in total chip area compared with the best linear approach for 24-bit evaluation.

Index Terms—Bipartite table, trigonometric function evaluation, field-programable gate array (FPGA).

I. INTRODUCTION

Trigonometric functions are extensively used in computer graphics, digital signal processing, communication systems, and robotics, to name a few fields of application. Hardware function generators are usually desirable because of their major advantages in speed over software solutions and their potential to save power by avoidance of the use of hundreds of general-purpose instructions.

In 1984, Sunderland et al. [3] considered approximating a 12-bit sine function in hardware with a input argument \(x\) less than \(\pi/2\) with the use of tables. They proposed to evenly split \(x\) (in binary form) into three 4-bit sub-words, i.e., \(x = x_0 + x_1 + x_2\), where \(x_0 < \pi/2\), \(x_1 < 2^{-4}\pi/2\) and \(x_2 < 2^{-8}\pi/2\), and use the following equation

\[
\sin(x_0 + x_1 + x_2) \approx \sin(x_0 + x_1) + \cos(x_0)\sin(x_2) \tag{1}
\]

By doing so, instead of using one large table with 12 address bits, two small tables that each has 8 address bits are needed: one for \(\sin(x_0 + x_1)\) and one for \(\cos(x_0)\sin(x_2)\). In 1995, Das Sarma and Matula [4] introduced a new scheme to evaluate reciprocal functions. Two tables are addressed in parallel by the partitioned input argument. The outputs are then summed and rounded to the desired accuracy. This scheme is called the bipartite method. Later, Schulte and Stine [5] generalized such a scheme to other elementary functions and termed it the symmetric bipartite table method (SBTM), which also generalizes the scheme of Sunderland et al. Compared with straightforward tabulation, the bipartite approach uses only two tables with \(2q\) address bits instead of one table with \(3q\) address bits (assuming that the input argument is split into three \(q\)-bit sub-words).

An enhanced scheme was then proposed by Dinechin and Tisserand [6], who divided the input argument into more than three sub-words, so that multiple small tables can be built. The scheme is thereby named as multipartite table-lookup method (MTM). Both SBTM and MTM are linear (order-one) approximations that possess an inherent limitation: to evaluate functions with high precision, the input argument needs to be reduced to very small values [2]. Very recently, Brisebarre, Ercegovac and Muller [1] proposed a scheme called \((M, p, k)\)-friendly points to approximate trigonometric functions by using two small multipartite tables and only a few additions. However, no analysis of the error bound or practical hardware implementations were conducted to evaluate the effectiveness of this scheme. Low and Jong [2] revisited the tables-and-additions strategy and successfully proposed an integrated add-table lookup-add (iATA) approach, which could save 20\% to 60\% of memory compared with MTM. Other approaches, such as LUT cascades [7] and order-two piecewise polynomial approximation and interpolation approaches [8] are also studied in the literature.

In this study, we investigate and develop implementations of the \((M, p, k)\) scheme. Section II introduces the proposed algorithm, rectify prior incorrect definitions and performs an analysis of the error bound. Section III describes an optimized hardware architecture that efficiently implements the proposed algorithm. Section IV provides the FPGA and ASIC based implementation results. Section IV concludes the paper.

II. THE \((M, p, k)\) SCHEME

A. \((M, p, k)\)-Friendly Points and Angles

Given an \(n\)-bit input angle \(x\) in the range of \([0, \pi/2]\), we aim to evaluate the trigonometric functions \(\sin(x)\) and \(\cos(x)\) with \(p\)-bit output precision. Assuming that \(\hat{x}\) approximates \(x\) (we will see later how \(\hat{x}\) is obtained), we define \(\theta = x - \hat{x}\). When \(\theta\) is much smaller than \(\pi/2\), \(\sin(\theta)\) and \(\cos(\theta)\) can be efficiently approximated with high accuracy with the use of the aforementioned table-and-addition-based schemes. We can then obtain \(\sin(x)\) and \(\cos(x)\) in the form of

\[
\sin(x) = \sin(\hat{x}) \cdot \cos(\theta) + \cos(\hat{x}) \cdot \sin(\theta) \tag{2}
\]

\[
\cos(x) = \cos(\hat{x}) \cdot \cos(\theta) - \sin(\hat{x}) \cdot \sin(\theta)
\]

Instead of directly implementing (2) in hardware with the use of four multipliers, the \((M, p, k)\) scheme uses a different
TABLE I
The First and Last Entries of Table T0 for $n = p = 24$, $m = 8$, $k = 7$ and $r = 7$. 

| Index | $x_0,x_1 \cdots x_7$ | $a$ | $b$ | $\hat{x}_T$ | $|\hat{x}_T - x_0,x_1 \cdots x_7| | \cdot 10^d$ |
|-------|-----------------|---|---|------------|-----------------|
| 0     | 00000000        | 256| 1  | 3.9062e+003 | 1.9868e+008     |
| 1     | 00000001        | 256| 3  | 1.1718e+002 | 5.3639e+007     |
| 2     | 00000010        | 468| 9  | 1.9228e+002 | 3.0285e+004     |
| ...   | ...             | ...| ...| ...        | ...             |
| 200   | 11001000        | 2  | 481| 1.5666e+000 | 2.3209e+004     |
| 201   | 11001001        | 0  | 1  | 1.5708e+000 | 3.4224e+003     |

is denoted as $D_r$. If $D_r$ is smaller than $2^{-r-1}$, then all the $x$ within that region will be at a distance

$$D_r + 2^{-r-1} \leq 2^{-r}$$

from $\hat{x}_T$. Consequently, the value of $\theta = x - \hat{x}_T$ has an absolute value less than $2^{-r}$. If there are no such $\hat{x}_T$ found, $M$ and $k$ will be enlarged and another round of search is performed. As long as $r$ is sufficiently large, we can build small bipartite tables to generate $\sin(\theta)$ and $\cos(\theta)$ with high output precision. Table I provides a numerical example of the selected $(M,p,k)$-friendly points and angles for $n = p = 24$, $m = 8$, $k = 7$ and $r = 7$. The search process, performed off-line, has a time complexity of $O(M^2)$ for given $r$.

C. Computation Steps of the Algorithm

Assuming that the correct $(M,p,k)$-friendly points $(a,b)$ and angles $\hat{x}_T$ are stored in a lookup table T0, the following computation steps are used to implement (2):

1) Subtraction of $\theta = x - \hat{x}_T$ is performed, and the values of $\sin(\theta)$ and $\cos(\theta)$ are looked up in two bipartite tables;

2) The following is computed:

$$S = b \cdot \cos(\theta) + a \cdot \sin(\theta)$$

$$C = a \cdot \cos(\theta) - b \cdot \sin(\theta)$$

Assuming that $a$ and $b$ are stored in radix-4 with a digit set $\{-2, -1, 0, 1, 2\}$, each multiplication can be implemented in $\lceil m/2 \rceil$ additions/subtractions;

3) $S \cdot z$ and $C \cdot z$ are performed to implement (2). Because $z$ is in the canonical representation and the number of nonzero bits is bounded by $k$, the two multiplications can also be simplified to $2(k+1)$ additions/subtractions.

We note that the efficiency of the proposed scheme relies on how small $\theta$ can be for not-too-large values of parameters $M$ and $k$. Generally, for a given $n$ and $p$, multiple pairs of parameters can be selected. For instance, we list several possible values of the parameters for $n = 24$ in Table II. The optimal choice of the parameters is determined by the hardware cost and performance. A scheme is proposed in Section III-B to address this design issue.

D. Error Bound

By denoting $\hat{z}$, $\hat{\sin}(\theta)$ and $\hat{\cos}(\theta)$ as the exact values of $1/\sqrt{a^2 + b^2}$, $\sin(\theta)$ and $\cos(\theta)$, respectively, three truncation errors can be expressed in $e_z = (z - \hat{z})$, $e_a = |\hat{\sin}(\theta) - \sin(\theta)|$ and $e_b = |\hat{\cos}(\theta) - \cos(\theta)|$. The resulting error bounds are $e_z \leq 2^{-r}$ and $e_a,e_b \leq 2^{-2r}$.
and \( e_c = [\cos(\theta) - \cos(\theta)] \). Then, the total approximation error entailed by the proposed algorithm is bounded by
\[
|\cos(x) - \cos(z)| = \left| (\tilde{z} - z) [a \cos(\theta) - b \sin(\theta)] \right|
\]
\[
+ [az(\cos(\theta) - \cos(\theta)) + b(a \sin(\theta) - \sin(\theta))] \leq \sqrt{a^2 + b^2} \cdot \tilde{z} = \sqrt{(\tilde{z} - z)(\tilde{z} + z)} < 2^m \sqrt{2} \cdot |e_c| + |e_c| + |e_s|
\]
where \( \tilde{z} \) is the exact value of the final result obtained by infinite precision computations. From (3), we already know that \( |e_c| < 2^{-(p+m+3)} \). If both outputs of the sine and cosine bipartite tables have a maximum absolute error of \( 2^{-p+2} \), the absolute value of the total error is bounded by \( 2^{-p} \). Therefore, the precision of the results is only determined by the parameters \( n \) and \( p \).

### III. HARDWARE ARCHITECTURE

#### A. Proposed Architecture

Fig. 1 shows the top-level hardware architecture that implements the proposed algorithm. Table T0 stores the precomputed \((M, p, k)\)-friendly points \((a, b)\), angles \( \hat{x}_T \), and associated \( z \). SBT0 and SBT1 are two small bipartite lookup tables that generate \( \sin(\theta) \) and \( \cos(\theta) \).

In Table T0, the \( m \)-bit integers \( a \) and \( b \) are recoded in radix-4 with digit set \([-2, -1, 0, 1, 2]\). To reduce the area of shifter units, we introduce another constraint on \( z \), and only those \( \hat{x}_T \)'s whose associated \( z \) satisfies this extra condition are tabulated. Fig. 2 shows that \( z \) is divided into the lower \((|p + m + 2|/2)\)-bit and higher \((|p + m + 2|/2)\)-bit sub-words, and the new constraint requires that the number of nonzero \( z_i \)'s in the two sub-words is not larger than \( \lfloor k/2 \rfloor \) and \( \lfloor k/2 \rfloor \), respectively. Usually, a new round of search for \( \hat{x}_T \) should be performed. The value of \( m \) might need to be enlarged to find a sufficient number of \( \hat{x}_T \)'s.

After recoding, \( z \) has \( k + 1 \) fields. The first field (Field-0) has \( \log_2(e) \) bits and is reserved for the leading “1”. The other fields either have \( \lfloor \log_2 (p + m + 2) \rfloor + 1 \) bits or \( \lfloor \log_2 (p + m + 2) \rfloor + 1 \) bits, in which one sign bit is added at the most significant bit. Those fields that have no corresponding nonzero \( z_i \) are filled with all “1”, which represents multiplication by zero.

The multiple generation module (MGEN) generates \( \lfloor m/2 \rfloor \) “multiples” by implementing a simple logic as described in [11]. The multi-operand adder (MOPADD) has a commutative tree structure based on [3:2] carry-save adders when standard cell based implementations are targeted. For FPGA implementation, an [N:3] redundant adder structure is proposed in this study to efficiently utilize the carry-propagate adders (CPAs) and ternary adder structures on FPGAs with dedicated carry-chains that can provide carry propagation by more than one order of magnitude faster compared with the use of general logic resources [14]. Fig. 3 shows an example of a [9:3] compressor for the proposed redundant adder structure. The compressor consists of two diagonally deployed linear arrays (only one is shown in Fig. 3) of CPAs. The structure can be efficiently mapped into two arrays of CPAs, in which the fast carry-chain (blue lines) is efficiently utilized. The proposed redundant adder significantly reduces the critical path delay of the proposed architecture.

The right-shift barrel shifter (BSFT) has an internal structure that merges two consecutive levels of 2-to-1 shift operations into a single stage [15]. Efficiently mapping each stage into 4-to-1 multiplexors can reduce critical path delay without increasing the shifter area. Sign extension is performed after shifting.

#### B. Selecting the Optimal Parameters

From Section III-A, we know that the total size of table T0 and two bipartite tables can be estimated by
\[
S_{T0} = 2^{r+1} \cdot \left\lceil \frac{m}{2} \cdot 6 + p + \log_2 (p + m + 2) \cdot k \right\rceil
\]
\[
S_{SBT} = 2^{2(n-1-r)/3} + 1 \cdot \left\lfloor \left( (p + 4) + p/4 \right) \right\rfloor
\]
where \( S_{T0} \) and \( S_{SBT} \) denote the memory cost in bits. Applying the precomputed parameter values in Table II, we draw the diagram of Fig. 4 to illustrate the variation trend of the lookup table size with parameter \( r \). An optimal value for \( r \) to minimize the memory usage can be observed.

<table>
<thead>
<tr>
<th>( M )</th>
<th>( 5 )</th>
<th>( 6 )</th>
<th>( 7 )</th>
<th>( 8 )</th>
<th>( 9 )</th>
<th>( 10 )</th>
<th>( 11 )</th>
</tr>
</thead>
<tbody>
<tr>
<td>128</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>na</td>
<td>na</td>
<td>na</td>
<td>na</td>
</tr>
<tr>
<td>256</td>
<td>7</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>na</td>
<td>na</td>
<td>na</td>
</tr>
<tr>
<td>512</td>
<td>6</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>na</td>
<td>na</td>
</tr>
<tr>
<td>1024</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>na</td>
</tr>
<tr>
<td>2048</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>7</td>
<td>8</td>
<td>9</td>
</tr>
</tbody>
</table>

**TABLE II**

VALUES OF PARAMETERS \( k, M \) AND \( r \) FOR \( n = 24, p = 24 \).
Similarly, one can estimate the area of the multi-operand adders and shifters by

\[
S_{\text{ADD}} = \left[2\left\lfloor m/2 \right\rfloor - 2\right] \cdot (p + m + 2) \cdot NR
\]

\[
S_{\text{SFT}} = \left[\frac{m}{2}\right] \cdot (p + 2) + 2 \cdot \log_2(e) \cdot (p + m + 2) + \left[\log_2(p + m + 2)\right] \cdot (p + m + 2) \cdot k
\]

where \(S_{\text{ADD}}\) is calculated in numbers of FA used, whereas \(S_{\text{SFT}}\) is counted in numbers of 2-to-1 multiplexors. \(NR\) denotes the ratio of the area of one 2-to-1 multiplexor to that of one FA. In this study, we assume that \(NR = 1.6\) for ASIC and \(NR = 1\) for FPGA implementation. After the optimal \(r\) is selected, another diagram can be drawn to find the optimal \(m\). Usually, multiple local optimal values can be considered. For the case of \(n = 24\), they are \(m = 7, 9, 11\). This diagram is not presented because of limited space.

The critical path delay of the second pipeline stage is substantially affected by the value of \(m\), whereas that of the third stage is determined by \(k\). Table II shows that the value of \(k\) gradually increases as \(m\) decreases. Therefore, in this study, we select the pair of \(m\) and \(k\) that minimize the value of \(m + k\) (i.e., \(m = 9, k = 7\) for the presented instance).

### IV. Implementation Results

The proposed hardware architecture was written in RTL-level System Verilog. Functional simulations were performed with Cadence NC-sim 5.4. The verified designs were implemented in two technologies: 1) Virtex-II XC2V1000-FG456-5 FPGA device synthesized and fitted with the ISE 8.1i tool\(^4\) and 2) TSMC 65nm CLN65G standard cell library synthesized by Synopsys Design Compiler C-2010.03-SP1, placed-and-routed by SoC Encounter 10.1.

Table III first compares the estimated hardware costs of the proposed design with three different schemes, namely, the multipartite table (MTM [6]), piecewise table lookup approximation (Sasao [7]) and order-two interpolation (Lee [8]). The memory resource is counted in bit, while other supporting arithmetic/logic units are measured in the numbers of FA. The proposed design is observed to achieve 79% and 73% reduction in memory resource compared with [6] and [7] for 24-bit evaluations, respectively. The increase in logic resources are 8.2× and 1.5×. Because of the degree-2 interpolation algorithm adopted, the scheme of [8] can further compress the lookup tables used. However, due to the large multipliers used, the growth in logic resources is significant. More accurate comparison of the total circuit area are made after the designs are mapped on a specific technology.

Two design instances with precisions \(n = 16\) and \(n = 24\) were finally implemented. Tables IV and V report the implementation results. To enable direct comparison architectures that do not utilize lookup tables, all ROMs were synthesized into pure combinational logic cells (ASIC) or distributed ROM structures (FPGA) as suggested by [8]. The MTM designs were generated in VHDL by using the program provided by [12] and implemented in the same technologies. A balanced two-stage pipeline structure is implemented such that each stage has a similar delay to the proposed design.

Both tables show that the total area of the proposed design is almost two times larger than that of MTM for 16-bit evaluation. It is revealed that, when using the proposed algorithm to perform low precision evaluations, reducing the lookup table size does not satisfactorily compensate for the hardware cost introduced by the extra arithmetic operations. For 24-bit evaluation, the advantage of the proposed architecture is obvious: it achieves more than 63% reduction in the total circuit area over MTM for both pipelined and non-pipelined FPGA based implementations. It is also observed that the CPAs in the proposed architecture introduces a 42% longer critical latency. However, a simple pipelining scheme enables both scheme working at a similar frequency at the cost of a few registers. For ASIC designs, the area reduction is 51%, and the

---

\(^4\)The old device and compilation tool were selected for a fair comparison with the reference designs [2], [9]. Optimization goal and effort were set as “Speed” and “High”, respectively.
latency gap also decreases to 17%. Our approach consumes a 42% lesser chip area than the newly reported iATA [2] scheme, but their critical path delays are close.

CORDIC-based designs are compared with the proposed scheme in Table IV. Standard parallel CORDIC structures are implemented with the Xilinx IPcore generator [13]. For both 16- and 24-bit evaluations, the proposed design achieves more than 70% reduction in the critical path delay. The main contribution comes from the optimized multi-operand redundant adder structure presented in this paper. Compared with the radix-4 CORDIC design [9], our proposed architecture is both superior in critical path delay and total circuit area. Because of the limited information, however, we cannot directly compare with the schemes presented in [7] and [8].

V. CONCLUSION

This brief has presented a new scheme to compute sine and cosine functions. The main contribution of our work includes rectifying prior incorrect definitions, conducting an analysis on the error bound and developing an optimal hardware architecture that efficiently implements the proposed algorithm. The 16- and 24-bit design instances are coded in Verilog HDL and mapped on Xilinx FPGA and 65nm standard cell based technology. Comparison of our proposed work with multiparite and CORDIC-based designs shows that a considerable reduction in chip area and critical path delay is achieved for 24-bit evaluations.

REFERENCES