13

I have a numpy array with m columns and n rows, the columns being dimensions and the rows datapoints.

I now need to calculate kernel values for each combination of data points.

For a linear kernel $K(\mathbf{x}_i,\mathbf{x}_j) = \langle \mathbf{x}_i,\mathbf{x}_j \rangle$ I can simply do dot(X,X.T)

How can I effectively calculate all values for the Gaussian Kernel $K(\mathbf{x}_i,\mathbf{x}_j) = \exp{-\frac{\|\mathbf{x}_i-\mathbf{x}_j\|_2^2}{s^2}}$ with a given s?

Danica
  • 21,852
  • 1
  • 59
  • 115
Peter Smit
  • 2,030
  • 3
  • 23
  • 36
  • 1
    Well if you don't care too much about a factor of two increase in computations, you can always just do $\newcommand{\m}{\mathbf} \m S = \m X \m X^T$ and then $K(\m x_i, \m x_j ) = \exp( - (S_{ii} + S_{jj} - 2 S_{ij})/s^2 )$ where, of course, $S_{ij}$ is the $(i,j)$th element of $\m S$. This is probably *not* the most numerically stable, either, though. – cardinal Sep 20 '11 at 13:46
  • 2
    (Years later) for large sparse arrays, see [sklearn.metrics.pairwise.pairwise_distances.html](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html) in scikit-learn . – denis Dec 20 '15 at 14:27

5 Answers5

27

I think the main problem is to get the pairwise distances efficiently. Once you have that the rest is element wise.

To do this, you probably want to use scipy. The function scipy.spatial.distance.pdist does what you need, and scipy.spatial.distance.squareform will possibly ease your life.

So if you want the kernel matrix you do

from scipy.spatial.distance import pdist, squareform
  # this is an NxD matrix, where N is number of items and D its dimensionalites
X = loaddata() 
pairwise_dists = squareform(pdist(X, 'euclidean'))
K = scip.exp(-pairwise_dists ** 2 / s ** 2)

Documentation can be found here

netok
  • 23
  • 3
bayerj
  • 12,735
  • 3
  • 35
  • 56
  • 3
    It seems to me that bayerj's answer requires some small modifications to fit the formula, in case somebody else needs it : `K = scipy.exp(-pairwise_dists**2 / s**2)` – chloe Feb 10 '14 at 16:26
  • If anyone is curious, the algorithm used by `pdist` is very simple: it's just a C-implemented loop that directly computes distances in [the obvious way](https://github.com/scipy/scipy/blob/v0.19.0/scipy/spatial/src/distance_impl.h#L41), the looping being done [here](https://github.com/scipy/scipy/blob/v0.19.0/scipy/spatial/src/distance_impl.h#L421); no fancy vectorization or anything beyond whatever the compiler can accomplish automatically. – Danica Mar 25 '17 at 02:27
12

As a small addendum to bayerj's answer, scipy's pdist function can directly compute squared euclidean norms by calling it as pdist(X, 'sqeuclidean'). The full code can then be written more efficiently as

from scipy.spatial.distance import pdist, squareform
  # this is an NxD matrix, where N is number of items and D its dimensionalites
X = loaddata() 
pairwise_sq_dists = squareform(pdist(X, 'sqeuclidean'))
K = scip.exp(-pairwise_sq_dists / s**2)
tenedor
  • 221
  • 2
  • 3
5

You can also write square form by hand:

import numpy as np
def vectorized_RBF_kernel(X, sigma):
    # % This is equivalent to computing the kernel on every pair of examples
    X2 = np.sum(np.multiply(X, X), 1) # sum colums of the matrix
    K0 = X2 + X2.T - 2 * X * X.T
    K = np.power(np.exp(-1.0 / sigma**2), K0)
    return K

PS but this works 30% slower

spetz911
  • 51
  • 1
  • 2
  • This, which is the method suggested by cardinal in the comments, could be sped up a bit by using inplace operations. It's [how scikit-learn does it](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L222), with [an `einsum` call](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py#L72) for your `X2`. – Danica Mar 25 '17 at 02:18
  • You wrote: K0 = X2 + X2.T - 2 * X * X.T - how does it can work with X and X.T having different dimensions? – matekrk Jan 16 '22 at 00:17
4
def my_kernel(X,Y):
    K = np.zeros((X.shape[0],Y.shape[0]))
    for i,x in enumerate(X):
        for j,y in enumerate(Y):
            K[i,j] = np.exp(-1*np.linalg.norm(x-y)**2)
    return K

clf=SVR(kernel=my_kernel)

which is equal to

clf=SVR(kernel="rbf",gamma=1)

You can effectively calculate the RBF from the above code note that the gamma value is 1, since it is a constant the s you requested is also the same constant.

John
  • 141
  • 4
  • Welcome to our site! We have a slightly different emphasis to Stack Overflow, in that we generally have less focus on code and more on underlying ideas, so it might be worth annotating your code or giving a brief idea what the key ideas to it are, as some of the other answers have done. That would help explain how your answer differs to the others. – Silverfish Mar 24 '17 at 01:41
  • This will be much slower than the other answers because it uses Python loops rather than vectorization. – Danica Mar 25 '17 at 02:13
-1

I think this will help:

def GaussianKernel(v1, v2, sigma):
    return exp(-norm(v1-v2, 2)**2/(2.*sigma**2))
sashkello
  • 2,198
  • 1
  • 20
  • 26
Kernel
  • 9
  • 3
    Welcome to the site @Kernel. You can display mathematic by putting the expression between $ signs and using LateX like syntax. And you can display code (with syntax highlighting) by indenting the lines by 4 spaces. See the markdown editing [help](http://stats.stackexchange.com/editing-help) for formatting guidelines, and the [faq](http://stats.stackexchange.com/faq) for more general ones. – Antoine Vernet Jan 11 '13 at 15:24
  • 1
    Doesn't this just echo what is in the question? – whuber Jan 11 '13 at 16:09