Rip Van Winkle here -- is the fastest and least-footprint way to compute an arctan on an FPGA still to use CORDIC?
Or is there a way to leverage block RAM and DSP blocks to speed and/or reduce the size of the process?
Rip Van Winkle here -- is the fastest and least-footprint way to compute an arctan on an FPGA still to use CORDIC?
Or is there a way to leverage block RAM and DSP blocks to speed and/or reduce the size of the process?
If you're willing to tolerate a possible absolute error of 0.26 degrees, you could use the following (from Chapter 13 of my "Understanding Digital Signal Processing" book):
The product 0.28125$Q^2$ is equal to (1/4+1/32)$Q^2$, so you can implement the product by adding $Q^2$ shifted right by two bits to $Q^2$ shifted right by five bits.
Here is the error curve, over the range of -45 degrees to +45 degrees
For some work fairly recently, I had a need for fast and accurate trig approximations which were at least C1 and ideally C2 continuous, because discontinuities in the first and second differentials were undesirable for our application. I fitted polynomials to achieve this, working off a basic idea by OlliW.
For arctan, as with @RichardLyons' answer, I calculated this for octants 1 and 8 (to avoid having to fit a curve that tends to infinity). The standard trig theorems then apply for using this in other octants. I normalised the output of the calculation to +/-1 = +/-45deg, so that scaling for degrees or radians simply requires an extra scaling. Then available fits are
For $y$ normalised to +/-1, worst-case errors are (respectively) 3.45e-4, 5.96e-5 or1.10e-6. In degrees (multiply $y$ by 45 for error in degrees, or by $\pi/4$ for error in radians), that's a worst-case error of 0.016 degrees for the 5th-order fit.
This can be calculated effectively with an FPGA, using (respectively) 4, 5 or 6 cycles around an "$(A*B)+C$" DSP slice, or with a DSP.
Note that the nature of the equations as odd functions gives us the following for free:-
For anyone wanting to derive this themselves, this comes from solving simultaneous equations where:-
For 5th order, ${\tt atan}$ values are correct at 30 degrees and 45 degrees, and the 1st differential is correct at 45 degrees. This makes the fit C1 continuous when mapping to other octants.
7th order adds that the 1st differential is correct at zero.
9th order adds that the 2nd differential is correct at 45 degrees. This makes the fit C2 continuous when mapping to other octants.
To add to Richard's good answer please see this other post for additional estimators.
There really isn't much resource requirements for a CORDIC in an FPGA: Just an I, Q and phase accumulator and a very small look-up-table and notably no multipliers. It's an iterative algorithm with an approximate precision of $2^{-N-3}$ radians for $N$ iterations, and the look-up-table size need is just the ATAN2 results for only $N$ angles corresponding to each iteration ($\pi/4^n, n=1\ldots N$). If the additional time for iteration is available, it would be a viable candidate for implementation if resources and/or power were a premium in dedicated hw solutions (hence it's prevalence in current use for Bluetooth receivers), especially if there are no dedicated multipliers available.
A high level summary of the algorithm is shown below where for an ATAN2 result we would operate the CORDIC in "Vectoring Mode" where we would use the CORDIC to rotate an unknown vector until the resulting angle of the rotated vector is 0 degrees as given by Q=0 and that way get both angle from the phase accumulator and magnitude from I. The CORDIC as depicted below has a range of +/-90°. Extending this to +/-180° is done by a +/-j rotation (which means simply swap I and Q and change the sign).
The CORDIC is the algorithm of choice when successive iteration time is available and there are no multipliers. If a complex multiplier is available then consider simply using a small Look-up table for the ATAN2 results of the $N$ binary weighted iterations down to any desired phase precision over $\pm \pi$ as $\pi/2^N$, and as done with "Vectoring Mode" in a CORDIC iterate over $N$ steps going in the direction based on the sign of $Q$ driving it to zero while accumulating the phase. This would have higher precision in less steps due to true binary weighted rotations (unlike the CORDIC) and not have the 1.647 CORDIC gain (if that was even an issue).
Also for consideration as an alternate to the CORDIC for hardware implementations is the BKM algorithm.