I realise this question is somewhere in between Math StackExchange and StackOverflow. So forgive me if this is too much of a practical question, it would probably too theoretical elsewhere.
I am looking for the quickest way to compute the square of a number, modulo a medium-sized Mersenne prime. More specifically, let $P = 2^p - 1$ be a Mersenne prime. For the sake of argument, we can set, e.g., $p = 521$. Let $x$ be an arbitrary number between $1$ and $P$. I want to compute $x^2 \bmod P$.
I need to do this using only 32-bit integers. So, $x$ could for example be encoded in an array of $17$ 32-bit integers, and I'd need another array as output.
How would I go about doing this? I know I could use, e.g., FFT, but for not-that-big numbers it feels like an overkill maybe? I'm thinking Karatsuba, but I'm wondering whether there are shortcuts that either leverage the Mersenne modulo or the fact that it's a square I'm doing, and not a generic multiplication?