Beyond Fisher kernels

Question

For a while, it seemed like Fisher Kernels might become popular, as they seemed to be a way to construct kernels from probabilistic models. However, I've rarely seen them used in practice, and I have it on good authority that they tend not to work very well. They rely on the computation of the Fisher Information - quoting Wikipedia:

the Fisher information is the negative of the expectation of the second derivative with respect to θ of the natural logarithm of f. Information may be seen to be a measure of the "curvature" of the support curve near the maximum likelihood estimate (MLE) of θ.

As far as I can tell this means that the kernel function between two points is then the distance along this curved surface - am I right?

However this could be problematic for use in kernel methods, as

The MLE might be a very bad estimate for a given model
The curvature of the support curve around the MLE might not be any use for discriminating between instances, for example if the Likelihood surface was very peaked
This seems to throw away a lot of information about the model

If this is the case, are there any more modern ways of constructing kernels from probabilistic methods? For example, could we use a hold-out set to use MAP estimates in the same way? What other notions of distance or similarity from probabilistic methods could work to construct a (valid) kernel function?

score 10 · Accepted Answer · edited Aug 26 '13 at 18:18

You are right about the three issues you raise, and your interpretation is exactly right.

People have looked at other directions to build kernels from probabilistic models:

Moreno et al. propose Kullback-Leibler although when this satisfies Mercer's conditions was not well understood when I looked at this problem back when I read it.
Jebara et al. propose inner product in the space of distributions. This paper sounds a lot like what you're after: you can download it here.

I read them a while back (2008), not sure how that area has evolved the past few years.

There are also non-probabilistic ways to do so; people in Bioinformatics have looked at dynamic programming types of things in the space of strings and so on. These things are not always PSD and have problems of their own.

http://jmlr.org/papers/volume10/martins09a/martins09a.pdf develops some theory of kernels related to the KL divergence that are and aren't positive-definite. — Danica, Aug 27 '13 at 03:54

Beyond Fisher kernels

1 Answers1