3

So... I have been trying to make a radial basis kernel for hours but I am not sure of what my final matrix should look like. I have 30 features and 200000 data points. Should my matrix K be 200000*200000 or 30*30 ?

My code so far produces 30*30:

def build_kernel(x, func, option):
x = x.T
K = np.zeros([x.shape[0], x.shape[0]])
for i in range(x.shape[0]):
    xi = x[i,:]
    for j in range (x.shape[0]):  
        xj = x[j,:]
        K[i,j] = func(xi, xj, option)

return K 

def radial_basis(xi, xj, gamma):
    r = (np.exp(-gamma*(np.linalg.norm(xi-xj)**2))) 
    return r

My goal is to use the kernel trick in ridge regression, like it is explained here: http://www.ics.uci.edu/~welling/classnotes/papers_class/Kernel-Ridge.pdf

But I have no idea how to implement this manually (I have to do it manually for school !)

Somebody knows how to do such a thing ? :)

Thanks !

  • I thought the trick in kernel trick was that you do not explicitly have to evaluate the new coordinates in the transformed space, just the inner products. Try searching for kernel method SVM or kernel trick. You should see why the word trick is used... – Beyer Oct 29 '16 at 14:18

1 Answers1

3

The kernel function compares data points, so it would be $200,000 \times 200,000$. (It seems that your data in x is stored as instances by features, but then you do x = x.T for some reason, swapping it. The matrix you've computed isn't anything meaningful as far as I know.)

That's going to be very challenging to work with on a normal personal computer; if you just removed the x = x.T line so that your code computed the proper thing, the matrix K would be 298 GB in memory! (Plus, the way you've implemented it with Python nested loops and 40 billion calls to the function radial_basis, it's going to take a long time to compute even if you do have that much memory.)

This is an example of a situation where directly using the kernel trick is, frankly, a bad idea.

If you're dead-set on doing kernel ridge regression, there are various approximations you can make to make it computationally reasonable on that size of data, and I can point you to some of them. But it seems unlikely that a school assignment would really require you to do that.

Danica
  • 21,852
  • 1
  • 59
  • 115
  • Thank you. My school did not ask me directly to use this technique ! but I thought the whole goal of the kernel trick was to use higher dimensional space without actually having to compute the features in that space. I thought it could make my code faster :P And I can't use libraries.. :) I have two questions: 1 - Not having to compute the higher dimensional feature is only for the test set ? You compute K for the training set and keep the same K for the test set ? 2 - You said this is not a good problem for the kernel trick... what would be a good problem ? o/w Less training data ? – Albert James Teddy Oct 29 '16 at 13:51
  • Oh and yes, if you had any further reading on the approximation techniques you are referring to it would be great ! – Albert James Teddy Oct 29 '16 at 13:56
  • 2
    @AlbertJamesTeddy You're right about the point of the kernel trick, but the problem is that it removes dependence on the feature dimension (allowing you to use the infinite-dimensional RBF kernel), but worsens the dependence on the number of data points $n$. That's great for $n$ up to a few thousand, but bad for larger $n$ like yours. – Danica Oct 29 '16 at 14:15
  • 2
    The two main approximation techniques are the Nyström method and random Fourier features. Both are algorithmically pretty simple but the math behind why they work is a little complicated depending on what background you have. Here's a blog post talking about the two: http://peekaboo-vision.blogspot.co.uk/2012/12/kernel-approximations-for-efficient.html – Danica Oct 29 '16 at 14:17
  • Thank you, the post is great. One last question, lets say I have 1000 data points, I delete the x = x.T line. I have K as 1000*1000. What do I do with it ? w = inv(λ*I + Φ*Φ.T)*Φy = Φ*inv(Φ.T*Φ + λ*I)*y – Albert James Teddy Oct 29 '16 at 14:36
  • 1
    @AlbertJamesTeddy Look at equation (6) in the notes you linked. – Danica Oct 29 '16 at 14:37
  • Yes.. the κ(x) = K(xi, x) is confusing to me. Does it mean : $$ y_{new} = y_{old} (K - \lambda I) ^{-1} K (x_i , x) $$ ? What does $$K (x_i,x)$$ represent in code ? – Albert James Teddy Oct 29 '16 at 14:53
  • 2
    It's $y_\text{pred} = y_\text{train} (K_\text{train,train} + \lambda I)^{-1} K_\text{train,pred}$. You need to compute the RBF kernel from each of the training points to your test point. – Danica Oct 29 '16 at 14:58
  • Thank you ! That does make more sense, and looks quite intensive using for/loops in python even with fewer data points.. The trick is not as magic as I first hoped ! :) So for 1000 training example and 4000 tests this gives dimensions: 4000*1 = [1000*1] * [1000*1000] * [1000*4000] – Albert James Teddy Oct 29 '16 at 15:06
  • 1
    Don't use for loops! If you vectorize everything in numpy this will be quick on a regular computer. – Danica Oct 29 '16 at 15:07
  • Great thread. Thank you both, Dougal and Albert. Quick question... when you say $K_{train, pred}$, do you mean you're constructing another Kernel from a different combination of the input and output variables? – TheProletariat Jul 28 '19 at 19:19