2

I'm constructing an optimization (Bayesian optimization) algorithm using Java code. I have created the program, but the similarity values between inputted vectors in the kernel equation does not translate into the output similarity expected between vectors that should be "similar". I have a suspicion this has to do with the weighting of the differences between each of the components of the vectors because the parameter ranges of the different components are of completely different magnitudes (for example one parameter has a range 0.0 - 0.9 and another has a range of 100 - 500000).

I guess my question falls into two parts. First, how do I weight each the of the components of the input vectors evenly? Second, do I make the hyperparameters (width variable and sigma) vectors or scalar values?

I've been using this function I found from this other question (Which is helpful, but does not fully answer any of my questions): Kernels in Gaussian Processes

$$f(x_i,x_k)=σ^2 \exp\left(−\frac{1}{2 \ell^2} \sum_{j=1}^q (x_{i,j} − x_{k,j})^2 \right)$$

Danica
  • 21,852
  • 1
  • 59
  • 115
Cooper Scher
  • 51
  • 1
  • 4

1 Answers1

4

As you've written it here, $\sigma$ and $\ell$ are scalars. You could use a similar kernel, sometimes called an "Automatic Relevance Determination" (ARD) kernel, where $\ell$ is a vector of the same dimensionality as the data points:

$$f(x_i, x_k) = \sigma^2 \exp\left( - \frac{1}{2} \sum_{j=1}^q \left( \frac{x_{i,j} - x_{k,j}}{\ell_j} \right)^2 \right)$$

This allows hyperparameter optimization to select the right weight for each dimension. You're right to be concerned about using a single $\ell$ when one dimension has a range 500,000 times the other one: that kernel will "care" about the bigger dimension 500,000 times as much as it does about the smaller dimension.

A reasonable thing to do with your data is to standardize it so each dimension is on the same scale. This is the same thing as using an ARD kernel with each $\ell_j$ set to the product of some global scale $\ell$ and the standard deviation of the data in the $j$th dimension.

Danica
  • 21,852
  • 1
  • 59
  • 115
  • I was playing around with some of the things that you said, and I decided to scale each of the dimensions by dividing the component differences by their parameter ranges. Basically, each difference is a ratio of the parameter dimension. I then took the sum of the squares of the ratios and replaced the sigma expression with the expression: p/(d-s) - (p/d) where p is a new parameter I call similarity because it governs how similar two vectors are, d is the dimension number of the vector, and s is the sum of the squares. It seems to be working, but is this mathematically sound? – Cooper Scher Aug 22 '18 at 15:45
  • @Cooper Making sure I understand: you're using the kernel $$k(x, y) = \frac{p}{d - \sum_{i=1}^d \left( \frac{x_i - y_i}{\mathrm{max}_i - \mathrm{min}_i} \right)^2} - \frac{p}{d}?$$ If that's right, that's a problem; $k(x,x) = 0$ which means your "similarity" measure is actually a dissimilarity, and I don't think a GP is going to be sensible. Or are you doing $\sigma^2 \exp(- \mathrm{that})$? – Danica Aug 22 '18 at 19:39
  • I'm doing the latter although I did keep the 1/2ℓ^2 term so its σ2exp(− 1/2ℓ^2 * that). Does that make sense mathematically? – Cooper Scher Aug 24 '18 at 14:34
  • Ah, okay, this is far more likely to be legitimate. (Though note that you don't need both $p$ and $\ell$, they do the same thing....) It [turns out](https://link.springer.com/chapter/10.1007/978-3-540-28649-3_27) that this is valid if and only if $\mathrm{that}$ is what's called a Hilbertian metric, i.e. there is some [Hilbert space](https://en.wikipedia.org/wiki/Hilbert_space) whose metric agrees with $\mathrm{that}$. I don't see an obvious way to check whether or not this is true in your case. One thing you can do: (...) – Danica Aug 24 '18 at 14:43
  • (...) make some kernel matrices $K_{ij} = k(x_i, x_j)$ and check whether they're [positive semi-definite](https://en.wikipedia.org/wiki/Positive-definite_matrix), i.e. have no negative eigenvalues. You might get some `-1e-8` eigenvalues from numerical error, but if there are any truly negative eigenvalues, then the kernel is not valid and your GP will [not be well-posed](https://stats.stackexchange.com/a/335512/9964). – Danica Aug 24 '18 at 14:45
  • Thank you for your help, and this is what the graph of the kernel looks like with the parameter t representing similarity (https://www.desmos.com/calculator/xzfz5uiskl). I've kept the ℓ term only as a constant because it gives the similarity parameter a nicer range of valid values. Also, I've only been working with positive semi-definite matrices because of the code I've been using. – Cooper Scher Aug 24 '18 at 14:51
  • @CooperScher Right, but you need to make sure that the kernel matrices actually *are* psd, not just assume that they are. :) [Also, slight mistake in comment about: it's valid iff $\mathrm{that}$ is a _squared_ Hilbertian metric.] – Danica Aug 24 '18 at 14:52
  • Maybe I should clarify that my program will throw an error and break if it encounters a non psd matrix while doing an eigenvalue decomposition (which it does every iteration of the optimizer). – Cooper Scher Aug 24 '18 at 14:58
  • Okay, then you're probably fine. Any kernel whose kernel matrices are always psd is mathematically valid. – Danica Aug 24 '18 at 14:59
  • Just wondering what if that condition (always producing psd) is only true for a certain range of values of p (the similarity parameter). I noticed that for my kernel and the squared exponential that I was originally using that there seemed to be a bounds on when the kernels would produce negative eigenvalues and throw an error in my program. If I'm using similarity values within that range though, is it still mathematically valid? – Cooper Scher Aug 24 '18 at 15:07
  • That's actually a strong sign that the kernel is *not* actually valid (unless they're just numerical-error negative eigenvalues, which could be fixed by adding a small amount to the diagonal of the kernel matrix); with the particular form of your kernel, if it's squared-Hilbertian for one $p$ then it will be for all $p > 0$, though technically it's *possible* (but I don't think likely) for it to be valid for some $p$ and not others if it's not squared-Hilbertian. That said, I don't know if anything in practice will break if all the matrices you actually deal with are PSD. – Danica Aug 24 '18 at 15:12
  • I was thinking that too, but I saw the same behavior with the Squared exponential kernel which to my understanding is a valid kernel. Also, while running the program, the results seem to be at least somewhat decent before any major tuning of the parameters. – Cooper Scher Aug 24 '18 at 15:15
  • If you saw it with the SE kernel, then it's either numerical error or a bug. Numerical error is pretty likely. – Danica Aug 24 '18 at 15:16