Fastest way to compute Arctan2 on an FPGA

Question

Rip Van Winkle here -- is the fastest and least-footprint way to compute an arctan on an FPGA still to use CORDIC?

Or is there a way to leverage block RAM and DSP blocks to speed and/or reduce the size of the process?

Have you seen [this question](https://dsp.stackexchange.com/q/28814/4298)? — Matt L., Feb 01 '22 at 16:10
I hadn't -- my GoogleFu is abysmal, so even though I tried searching for it I still failed. — TimWescott, Feb 01 '22 at 16:33
[This](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6928950/) is an interesting article. It suggests a rational approximation combined with a LUT. But more interestingly, it compares several approaches from the literature. — Matt L., Feb 01 '22 at 19:34
Wow that article is really great @MattL. I am almost thinking it's worthy as putting as a (short) answer somehow rather than being buried in the comments- understood it is just a link but very salient and full of good info for future reference. — Dan Boschen, Feb 01 '22 at 23:16
Florent de Dinechin and Matei Istoan, "Hardware implementations of fixed-point Atan2", In *22nd IEEE Symposium on Computer Arithmetic*, June 2015, pp. 34-41: "This work essentially focuses on FPGAs. An unexpected result is that, even on modern FPGAs enhanced with DSP blocks and memories, CORDIC is a clear winner." — njuffa, Feb 02 '22 at 08:59
@njuffa I haven’t read the article yet (and looks good) but I wouldn't expect there to be any contest in favor of the CORDIC if speed is the primary concern. — Dan Boschen, Feb 02 '22 at 13:14

score 15 · Answer 1 · edited Feb 01 '22 at 20:01

15

If you're willing to tolerate a possible absolute error of 0.26 degrees, you could use the following (from Chapter 13 of my "Understanding Digital Signal Processing" book):

The product 0.28125$Q^2$ is equal to (1/4+1/32)$Q^2$, so you can implement the product by adding $Q^2$ shifted right by two bits to $Q^2$ shifted right by five bits.

Here is the error curve, over the range of -45 degrees to +45 degrees

edited Feb 01 '22 at 20:01

lennon310

3,520
13
20
27

answered Feb 01 '22 at 07:16

Richard Lyons

4,305
11
24

The division will be quite slow on an FPGA. – jpa Feb 01 '22 at 15:57
Unless you use CORDIC I don't think there's a way to avoid a divide with an atan2. – TimWescott Feb 01 '22 at 16:12
1

@TimWescott You can use a Maclaurin series to approximate the division. But I have (in another answer) an approximation which avoids the problem. – Graham Feb 01 '22 at 16:18
@Graham: In terms of $x$ your method doesn't need a division, but with a power series approximation you'll always have a division if you're only given $I$ and $Q$ values. I think that is what Tim refers to. – Matt L. Feb 01 '22 at 19:37

score 11 · Answer 2 · edited Feb 02 '22 at 15:07

For some work fairly recently, I had a need for fast and accurate trig approximations which were at least C1 and ideally C2 continuous, because discontinuities in the first and second differentials were undesirable for our application. I fitted polynomials to achieve this, working off a basic idea by OlliW.

For arctan, as with @RichardLyons' answer, I calculated this for octants 1 and 8 (to avoid having to fit a curve that tends to infinity). The standard trig theorems then apply for using this in other octants. I normalised the output of the calculation to +/-1 = +/-45deg, so that scaling for degrees or radians simply requires an extra scaling. Then available fits are

$y = 0.157153894 x - 0.041596524 x^3 + 0.00944263 x^5$
$y = 0.159154943 x - 0.051601768 x^3 + 0.023449972 x^5 - 0.006003146 x^7$
$y = 0.159154943 x - 0.052424634 x^3 + 0.027564301 x^5 - 0.011763207 x^7 + 0.002468598 x^9$

For $y$ normalised to +/-1, worst-case errors are (respectively) 3.45e-4, 5.96e-5 or1.10e-6. In degrees (multiply $y$ by 45 for error in degrees, or by $\pi/4$ for error in radians), that's a worst-case error of 0.016 degrees for the 5th-order fit.

This can be calculated effectively with an FPGA, using (respectively) 4, 5 or 6 cycles around an "$(A*B)+C$" DSP slice, or with a DSP.

Note that the nature of the equations as odd functions gives us the following for free:-

${\tt atan}(-x) = -{\tt atan}(x)$
${\tt atan}(0) = 0$
$\frac{d^2y}{dx^2}({\tt atan}(0)) = 0$

For anyone wanting to derive this themselves, this comes from solving simultaneous equations where:-

For 5th order, ${\tt atan}$ values are correct at 30 degrees and 45 degrees, and the 1st differential is correct at 45 degrees. This makes the fit C1 continuous when mapping to other octants.
7th order adds that the 1st differential is correct at zero.
9th order adds that the 2nd differential is correct at 45 degrees. This makes the fit C2 continuous when mapping to other octants.

Dan Boschen · Answer 3 · 2022-02-01T12:42:48.013

To add to Richard's good answer please see this other post for additional estimators.

There really isn't much resource requirements for a CORDIC in an FPGA: Just an I, Q and phase accumulator and a very small look-up-table and notably no multipliers. It's an iterative algorithm with an approximate precision of $2^{-N-3}$ radians for $N$ iterations, and the look-up-table size need is just the ATAN2 results for only $N$ angles corresponding to each iteration ($\pi/4^n, n=1\ldots N$). If the additional time for iteration is available, it would be a viable candidate for implementation if resources and/or power were a premium in dedicated hw solutions (hence it's prevalence in current use for Bluetooth receivers), especially if there are no dedicated multipliers available.

A high level summary of the algorithm is shown below where for an ATAN2 result we would operate the CORDIC in "Vectoring Mode" where we would use the CORDIC to rotate an unknown vector until the resulting angle of the rotated vector is 0 degrees as given by Q=0 and that way get both angle from the phase accumulator and magnitude from I. The CORDIC as depicted below has a range of +/-90°. Extending this to +/-180° is done by a +/-j rotation (which means simply swap I and Q and change the sign).

The CORDIC is the algorithm of choice when successive iteration time is available and there are no multipliers. If a complex multiplier is available then consider simply using a small Look-up table for the ATAN2 results of the $N$ binary weighted iterations down to any desired phase precision over $\pm \pi$ as $\pi/2^N$, and as done with "Vectoring Mode" in a CORDIC iterate over $N$ steps going in the direction based on the sign of $Q$ driving it to zero while accumulating the phase. This would have higher precision in less steps due to true binary weighted rotations (unlike the CORDIC) and not have the 1.647 CORDIC gain (if that was even an issue).

Also for consideration as an alternate to the CORDIC for hardware implementations is the BKM algorithm.

"and notably no multipliers" -- yes. But it's notable that just about any FPGA these days is sprinkled liberally with "DSP blocks" that implement "$(A \cdot B) + C$" in a hard-coded block. — TimWescott, Feb 01 '22 at 16:57
@TimWescott right- I wouldn’t see any good reason to implement a CORDIC in an FPGA fabric but may be attractive for a low end microcontroller or hard ASIC implementation for extremely low cost low power receivers (where you see their use today). In an FPGA I would consider the multiplier based option I described. — Dan Boschen, Feb 01 '22 at 17:20

Fastest way to compute Arctan2 on an FPGA

3 Answers3

Linked