How to calculate a Gaussian kernel effectively in numpy

Question

I have a numpy array with m columns and n rows, the columns being dimensions and the rows datapoints.

I now need to calculate kernel values for each combination of data points.

For a linear kernel $K(\mathbf{x}_i,\mathbf{x}_j) = \langle \mathbf{x}_i,\mathbf{x}_j \rangle$ I can simply do dot(X,X.T)

How can I effectively calculate all values for the Gaussian Kernel $K(\mathbf{x}_i,\mathbf{x}_j) = \exp{-\frac{\|\mathbf{x}_i-\mathbf{x}_j\|_2^2}{s^2}}$ with a given s?

Well if you don't care too much about a factor of two increase in computations, you can always just do $\newcommand{\m}{\mathbf} \m S = \m X \m X^T$ and then $K(\m x_i, \m x_j ) = \exp( - (S_{ii} + S_{jj} - 2 S_{ij})/s^2 )$ where, of course, $S_{ij}$ is the $(i,j)$th element of $\m S$. This is probably *not* the most numerically stable, either, though. — cardinal, Sep 20 '11 at 13:46
(Years later) for large sparse arrays, see [sklearn.metrics.pairwise.pairwise_distances.html](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html) in scikit-learn . — denis, Dec 20 '15 at 14:27

score 27 · Answer 1 · edited Mar 25 '17 at 14:34

27

I think the main problem is to get the pairwise distances efficiently. Once you have that the rest is element wise.

To do this, you probably want to use scipy. The function scipy.spatial.distance.pdist does what you need, and scipy.spatial.distance.squareform will possibly ease your life.

So if you want the kernel matrix you do

from scipy.spatial.distance import pdist, squareform
  # this is an NxD matrix, where N is number of items and D its dimensionalites
X = loaddata() 
pairwise_dists = squareform(pdist(X, 'euclidean'))
K = scip.exp(-pairwise_dists ** 2 / s ** 2)

Documentation can be found here.

edited Mar 25 '17 at 14:34

netok

23
3

answered Sep 20 '11 at 22:45

bayerj

12,735
3
35
56

3

It seems to me that bayerj's answer requires some small modifications to fit the formula, in case somebody else needs it : `K = scipy.exp(-pairwise_dists**2 / s**2)` – chloe Feb 10 '14 at 16:26
If anyone is curious, the algorithm used by `pdist` is very simple: it's just a C-implemented loop that directly computes distances in [the obvious way](https://github.com/scipy/scipy/blob/v0.19.0/scipy/spatial/src/distance_impl.h#L41), the looping being done [here](https://github.com/scipy/scipy/blob/v0.19.0/scipy/spatial/src/distance_impl.h#L421); no fancy vectorization or anything beyond whatever the compiler can accomplish automatically. – Danica Mar 25 '17 at 02:27

tenedor · Answer 2 · 2014-07-08T15:05:14.880

12

As a small addendum to bayerj's answer, scipy's pdist function can directly compute squared euclidean norms by calling it as pdist(X, 'sqeuclidean'). The full code can then be written more efficiently as

from scipy.spatial.distance import pdist, squareform
  # this is an NxD matrix, where N is number of items and D its dimensionalites
X = loaddata() 
pairwise_sq_dists = squareform(pdist(X, 'sqeuclidean'))
K = scip.exp(-pairwise_sq_dists / s**2)

edited Jul 08 '14 at 15:05

answered Jul 08 '14 at 14:14

tenedor

221
2
3

1

Or simply `pairwise_sq_dists = cdist(X, X, 'sqeuclidean')` which gives the same. – user1721713 Jan 06 '18 at 02:09

spetz911 · Answer 3 · 2015-02-11T18:23:26.737

5

You can also write square form by hand:

import numpy as np
def vectorized_RBF_kernel(X, sigma):
    # % This is equivalent to computing the kernel on every pair of examples
    X2 = np.sum(np.multiply(X, X), 1) # sum colums of the matrix
    K0 = X2 + X2.T - 2 * X * X.T
    K = np.power(np.exp(-1.0 / sigma**2), K0)
    return K

PS but this works 30% slower

edited Feb 11 '15 at 18:23

answered Feb 11 '15 at 18:15

spetz911

51
1
2

This, which is the method suggested by cardinal in the comments, could be sped up a bit by using inplace operations. It's [how scikit-learn does it](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L222), with [an `einsum` call](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py#L72) for your `X2`. – Danica Mar 25 '17 at 02:18
You wrote: K0 = X2 + X2.T - 2 * X * X.T - how does it can work with X and X.T having different dimensions? – matekrk Jan 16 '22 at 00:17

John · Answer 4 · 2017-03-24T21:44:14.037

4

def my_kernel(X,Y):
    K = np.zeros((X.shape[0],Y.shape[0]))
    for i,x in enumerate(X):
        for j,y in enumerate(Y):
            K[i,j] = np.exp(-1*np.linalg.norm(x-y)**2)
    return K

clf=SVR(kernel=my_kernel)

which is equal to

clf=SVR(kernel="rbf",gamma=1)

You can effectively calculate the RBF from the above code note that the gamma value is 1, since it is a constant the s you requested is also the same constant.

edited Mar 24 '17 at 21:44

answered Mar 23 '17 at 23:38

John

141
4

Welcome to our site! We have a slightly different emphasis to Stack Overflow, in that we generally have less focus on code and more on underlying ideas, so it might be worth annotating your code or giving a brief idea what the key ideas to it are, as some of the other answers have done. That would help explain how your answer differs to the others. – Silverfish Mar 24 '17 at 01:41
This will be much slower than the other answers because it uses Python loops rather than vectorization. – Danica Mar 25 '17 at 02:13

score -1 · Answer 5 · edited Dec 04 '13 at 00:24

-1

I think this will help:

def GaussianKernel(v1, v2, sigma):
    return exp(-norm(v1-v2, 2)**2/(2.*sigma**2))

edited Dec 04 '13 at 00:24

sashkello

2,198
1
20
26

answered Jan 11 '13 at 14:45

Kernel

9

3

Welcome to the site @Kernel. You can display mathematic by putting the expression between $ signs and using LateX like syntax. And you can display code (with syntax highlighting) by indenting the lines by 4 spaces. See the markdown editing [help](http://stats.stackexchange.com/editing-help) for formatting guidelines, and the [faq](http://stats.stackexchange.com/faq) for more general ones. – Antoine Vernet Jan 11 '13 at 15:24
1

Doesn't this just echo what is in the question? – whuber Jan 11 '13 at 16:09

How to calculate a Gaussian kernel effectively in numpy

5 Answers5