How to tune bandwidth in machine learning kernel model?

Question

Gaussian kernel

$k(x,y) = \exp(-\lVert x-y \rVert^2/\sigma^2)$

has a hyperparameter $\sigma$.

I know grid search cross validation, but this would require a lot of computation since computational cost of kernel method scales with the number of samples to the power of 2.

The only way to tune the bandwidth of the kernel (or any other parameters) is via cross-validation. You can do this with grid search, random search or better with more advanced optimization techniques (look for Hyperopt in python). If you stick with grid search, try just a few values for σ, then try other values close to the optimal identified and repeat this procedure until you are happy with the result. — Stergios, May 22 '18 at 07:57

score 1 · Accepted Answer · answered May 23 '18 at 20:36

If you want to do kernel regression or Gaussian process regression you can use alternative approaches. They are better as they allow direct optimization instead of Grid search. I suggest you to try two approaches:

Maximum likelihood estimate. If you have a sample $D = (X, \mathbf{y}) = \{(x_i, y_i)\}_{i = 1}^n$ for inputs $x_i \in \mathbb{R}^d$ and output $y_i \in \mathbb{R}$ then you can compute kernel matrix: $K = \{k_{ij}\}_{i, j = 1}^n$ with $k_{ij} = k_{\sigma}(x_i, x_j)$. Then the log likelihood is: $$ L(D, \sigma) = -1/2 \left[det(K) + \mathbf{y}^T K^{-1} \mathbf{y} \right]. $$ If you maximize $L(D, \sigma)$ w.r.t. $\sigma$ you get a maximum likelihood estimate for $\sigma$ for Gaussian process regression. The likelihood function has derivatives and all that staff, and optimization problem is pretty good (but not convex).
LOO cross-validation. For kernel methods you can calculate leave-one-out cross validation faster. See the question Gaussian process regression: leave-one-out prediction

Note, that for both these approach we need to inverse $K$, so the computational complexity of each optimization step is $O(n^3)$.

However, there are some faster approximation calculation see e.g. Random Fourier Features - an easy to implement approach that allows you to handle high computational complexity. In this case complexity scales as $m^2 n$, where $m$ is a parameter. Нou can use $m$ significantly smaller than $n$.

How to tune bandwidth in machine learning kernel model?

1 Answers1