Say that mat1
is $n \times d$ and mat2
is $m \times d$.
Recall that the Gaussian RBF kernel is defined as $k(x, y) = \exp\left( - \frac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right)$.
But we can write $\lVert x - y \rVert^2$ as $(x - y)^T (x - y) = x^T x + y^T y - 2 x^T y$. The code uses this decomposition.
First, the trnorms1
vector stores $x^T x$ for each input $x$ in mat1
, and trnorms2
stores $y^T y$ for each $y$ in mat2
.
Then, the k1
matrix is obtained by multiplying the $n \times 1$ matrix of $x^T x$ entries by a $1 \times m$ matrix of ones, getting an $n \times m$ matrix with $x^T x$ entries repeated across the rows, so that k1[i, j]
is $x_i^T x_i$.
The next line does basically the same thing for the $y$ norms repeated across columns, getting an $n \times m$ matrix with k2[i, j]
of $y_j^T y_j$.
k
is then their sum, so that k[i, j]
is $x_i^T x_i + y_j^T y_j$. The next line then subtracts twice the product of the data matrices, so that k[i, j]
becomes $x_i^T x_i + y_j^T y_j - 2 x_i^T y_j = \lVert x_i - y_j \rVert^2$.
Then, the code multiplies by $\frac{-1}{2 \sigma^2}$ and finally takes the elementwise $\exp$, getting out the Gaussian kernel.
If you dig into the scikit-learn implementation, it's exactly the same, except:
- It's parameterized instead with $\gamma = \frac{1}{2 \sigma^2}$.
- It's written in much better Python, not wasting memory all over the place and doing computations in a needlessly slow way.
- It's broken up into helper functions.
But, algorithmically, it's doing the same basic operations.