Probabilistic Interpretation of Radial Basis Function

Question

I was wondering if someone could flesh out the probabilistic interpretation of using the Radial Basis Function to compute the probability between an observation and some reference value.

My question is partially motivated by the top answer of this reddit thread:

The RBF kernel is a standard kernel function in $R^n$ space because it has just one free parameter, $\gamma$, and satisfies the condition $K(x,x') = K(x',x)$. More specifically, one way to think of the RBF kernel is that if we assume $x'$ is characteristic of some gaussian distribution (it is the mean value of that distribution), then $RBF(x,x')$ is the probability that $x$ is another sample from that distribution. In this interpretation, $\gamma$ is related to the tunable variance of that distribution.

Does this mean that if we have an observation $\bf{s}$ and we want to know if $\bf{s}$ is generated by a source $\bf{q}$ if $\bf{s}$ is a noisy version of $\bf{q}$, then we can say:

$$P(\mathbf{s} \text{ generated by } \mathbf{q}) \propto \text{exp}(-\gamma d(\mathbf{s},\mathbf{q}))$$ $$P(\text{belongs to a Gaussian region defined by } \mathbf{q} | \mathbf{s}) \approx \text{exp}(-\gamma d(\mathbf{s},\mathbf{q}))$$

where $d(\mathbf{s},\mathbf{q})$ is the distance between $\bf{s}$ and $\bf{q}$ and $\gamma$ is as described in the above quote.

Does this all seem consistent? That this probability is a direct consequence of the RBF comparing an observation to some mean value (or reference value or source value)?

Any references/links to tutorials are most welcome.

I think what they are getting at is closer to kernel density estimation — seanv507, Jan 13 '19 at 19:52
I thought this at one point, but I couldn't connect the two. — guy, Jan 13 '19 at 21:22

score 3 · Accepted Answer · answered Jan 13 '19 at 09:13

RBF kernel

The radial basis function (RBF) kernel is:

$$k(x, x'; \sigma^2) = \exp \Big( -\frac{\|x-x'\|^2}{2 \sigma^2} \Big)$$

where the parameter $\sigma^2$ specifies the width. This formulation is equivalent to the one you wrote involving the 'precision' parameter $\gamma$ if we let $\gamma = \frac{1}{2 \sigma^2}$.

Isotropic Gaussian distribution

Now, consider a $d$-dimensional Gaussian distribution with mean $\mu$ and covariance matrix $\sigma^2 I$ (where $I$ is the identity matrix). This means the variance is the same ($\sigma^2$) in all directions. The probability density function is:

$$\mathcal{N}(x \mid \mu, \sigma^2 I) = (2 \pi \sigma^2) ^{-\frac{d}{2}} \exp \Big( -\frac{\|x-\mu\|^2}{2 \sigma^2} \Big)$$

Their relationship

Notice that $k(x, x'; \sigma^2) = k(x', x; \sigma^2)$ is proportional to $\mathcal{N}(x \mid x', \sigma^2 I) = \mathcal{N}(x' \mid x, \sigma^2 I)$. That is, the value of the RBF kernel evaluated between $x$ and $x'$ (with width $\sigma^2$) is proportional to the probability density assigned to $x$ under an isotropic Gaussian distribution with mean $x'$ and variance $\sigma^2$. Or, equivalently, to the probability density assigned to $x'$ when the mean is $x$.

The following are not true:

$$P(\mathbf{s} \text{ generated by } \mathbf{q}) \propto \text{exp}(-\gamma d(\mathbf{s},\mathbf{q}))$$

As above, the correct term on the lefthand side would be the probability density assigned to $s$ by a Gaussian distribution with mean $q$ and variance $\frac{1}{2\gamma}$. This is not the same as the probability that $s$ is generated by this distribution (for more detail about this point, see the distinction between likelihood and probability). Similarly, the following statement in the reddit quote is incorrect for the same reason: "if we assume $x'$ is characteristic of some gaussian distribution (it is the mean value of that distribution), then $RBF(x,x')$ is the probability that $x$ is another sample from that distribution."

$$P(\text{belongs to a Gaussian region defined by } \mathbf{q} | \mathbf{s}) \approx \text{exp}(-\gamma d(\mathbf{s},\mathbf{q}))$$

There's no such thing as a 'Gaussian region'. Rather, we have a Gaussian probability distribution (which actually has infinite support, rather than being defined on a compact region). And, the RBF kernel is proportional to the density function, but "proportional to" doesn't imply "approximately equal to" (the numerical difference can be quite large, depending on the value of the normalizing constant $(2 \pi \sigma^2)^{-\frac{d}{2}}$.

That is a great explanation. Can you just reconcile what you said above with [page 21 of this paper that states the first formulation that you said is incorrect](https://dspace.mit.edu/bitstream/handle/1721.1/85399/870304955-MIT.pdf?sequence=2). I can't seem to get how they can say what they say. Are they simply wrong in their paper? — guy, Jan 13 '19 at 15:41
Where on page 21 does it say that that formulation is incorrect? — jbowman, Jan 14 '19 at 18:05
@jbowman It doesn't say it is incorrect but that formulation on page 21 is in direct contradiction to user20160 answer. As he said their formulation on page 21 is not true. Hence my confusion. I want to accept his answer but this is really a sticking point for me. — guy, Jan 15 '19 at 02:19
@guy The link you posted and my answer don't contradict each other. Rather, they apply to different situations. Say we have a distribution that assigns a probability density to an observation. Conceptually, this is not equivalent to the probability that the distribution generated the observation. — user20160, Jan 15 '19 at 17:29
(continued) But, we could use it together with Bayes' rule to compute such a probability (called the posterior probability) if 1) We have multiple distributions, and know the observation was generated by one of them, and 2) We can express a prior belief in each distribution. In the case where we have a uniform prior, the posterior probability for each distribution is proportional to the density it assigns to the observation. This is implicitly what's happening in the paper you linked. — user20160, Jan 15 '19 at 17:29
(continued) In contrast, there's only a single distribution to consider in your question / my answer. We don't have multiple known possibilities, so it's not possible to compute posterior probabilities. As a simple analogy, say I tell you the number 6. What's the probability I produced that number by rolling a fair die? I could have produced it any number of ways, and you can't calculate a probability without considering these. — user20160, Jan 15 '19 at 17:29
@user20160 I see. I didnt understand that aspect of the paper. I did not understand the implicit bayesian implementation of the paper but that makes sense. — guy, Jan 15 '19 at 18:12
Also, if you know any examples with actual data, that would be most welcome. — guy, Jan 15 '19 at 22:43
Not sure I understand...what were you looking for examples of? — user20160, Jan 21 '19 at 07:46

Probabilistic Interpretation of Radial Basis Function

1 Answers1