4

The paper, Random Fourier Features for Large-Scale Kernel Machines by Ali Rahimi and Ben Recht ,

makes use of Bochner's theorem which says that the Fourier transform $p(w) $ of shift-invariant kernels $k(x,y)$ is a probability distribution (in layman terms).

And therefore the kernel can be expressed as the inverse-Fourier transform of $p(w)$

$\begin{eqnarray} k(x,y) &=& \int_{R^d} p(w) e^{j w^T (x-y} dw \\ &=& \mathbb{E}_w[\psi_w(x) \psi_w(y)^*] \end{eqnarray}$

where,

$\psi_w(x) = e^{j w^T x}$, and $\psi_w(y)^* = e^{-j w^T y }$ is the complex conjugate

The statement the paper makes at this point is that since, $p(w)$ is real and even, the complex exponentials can be replaced with cosines, to give,

$k(x,y) = \mathbb{E}_w[z_w(x) z_w(y)]$

where $z_w(x) = \sqrt{2} cos(w^T x)$

I do not understand where this comes from.

From what I understand about Fourier Transforms, $p(w)$ is real and even for real and even $k(x,y)$.

Therefore, it should actually be,

$\begin{eqnarray} k(x, y) &=& \mathbb{E}_w[cos(w^T (x-y)] \\ &=& \mathbb{E}_w[cos(w^T x) cos(w^T y) + sin(w^T x) sin(w^T y)] \\ &=& \mathbb{E}_w[z_w(x)^T z_w(y)] \end{eqnarray}$

where $z_w(x) = [cos(w^T x), sin(w^T x)]^T$

What am I missing ?

Danica
  • 21,852
  • 1
  • 59
  • 115
Apoorv
  • 41
  • 1

1 Answers1

2

The direct Fourier interpretation would indeed be $\cos(w^T x), \sin(w^T x)]$, as you've listed.

You've actually slightly misunderstood the paper's proposal for only-cosine features, though; they use $\sqrt{2} \cos(w^T x + b)$, with $b \sim \mathrm{Uniform}[0, 2 \pi]$. Then \begin{align} \sqrt{2}\cos(w^T x + b) \sqrt{2}\cos(w^T y + b) &= \cos((w^T x + b) - (w^T y + b)) + \cos((w^T x + b) + (w^T y + b)) \\&= \cos(w^T (x - y)) + \cos(w^T (x + y) + 2 b) ,\end{align} and we have \begin{align} \mathbb E_{w,b} 2 \cos(w^T x + b) \cos(w^T y + b) &= \mathbb E_{w,b}\left[ \cos(w^T (x - y)) + \cos(w^T (x + y) + 2 b) \right] \\&= \mathbb E_w \left[ \cos(w^T (x - y)) \right] + \mathbb E_{w,b}\left[ \cos(w^T (x + y) + 2 b) ] \right] \\&= k(x, y) + 0 .\end{align} The random offset $b$ makes the second term zero.

This form is somewhat more convenient, in that you have one feature per dimension. But it turns out that actually, given a constant number of dimensions, you get slightly better kernel approximations by using pairs of features without the additive noise than you do by using single dimensions with the additive noise – see chapter 3 of my thesis, which fixes a few slight errors in our earlier paper.

Danica
  • 21,852
  • 1
  • 59
  • 115
  • Dougal, it seems like you've done a lot of work in this area. If you have time, I'd appreciate it if you could answer my RFF question here: https://stats.stackexchange.com/questions/440633/ – jds Dec 17 '19 at 16:12
  • Hey, could you clarify why E_w,b[cos(w^t(x+y) + 2b)] = 0? What am I missing here... – ec2604 Jul 22 '20 at 12:02
  • 1
    @ec2604, first do $$\mathbb{E}_{w,b}[ \cos(w^T(x+y) + 2b) ] = \mathbb{E}_w\left[ \mathbb{E}_b[ \cos(w^T(x+y) + 2 b) ] \right].$$ The inner expectation, then just the uniform average of the cosine function from $w^T(x+y)$ to $w^T(x+y) + 4 \pi$. This goes over two complete periods, and the average value of cosine over a period is 0, so the inner expectation is 0 for any value of $w^T(x+y)$ – so then the expectation over $w$ of something that's always 0 is just 0. – Danica Jul 22 '20 at 19:32