For a while, it seemed like Fisher Kernels might become popular, as they seemed to be a way to construct kernels from probabilistic models. However, I've rarely seen them used in practice, and I have it on good authority that they tend not to work very well. They rely on the computation of the Fisher Information - quoting Wikipedia:
the Fisher information is the negative of the expectation of the second derivative with respect to θ of the natural logarithm of f. Information may be seen to be a measure of the "curvature" of the support curve near the maximum likelihood estimate (MLE) of θ.
As far as I can tell this means that the kernel function between two points is then the distance along this curved surface - am I right?
However this could be problematic for use in kernel methods, as
- The MLE might be a very bad estimate for a given model
- The curvature of the support curve around the MLE might not be any use for discriminating between instances, for example if the Likelihood surface was very peaked
- This seems to throw away a lot of information about the model
If this is the case, are there any more modern ways of constructing kernels from probabilistic methods? For example, could we use a hold-out set to use MAP estimates in the same way? What other notions of distance or similarity from probabilistic methods could work to construct a (valid) kernel function?