7

I am interested in estimating the gradient of the log probability distribution $\nabla\log p(x)$ when $p(x)$ is not analytically available but is only accessed via samples $x_i \sim p(x)$.

There seems to be various possible solutions utilizing nearest neighbors, kernel estimates etc. See 'The Estimation of the Gradient of a Density Function, with Applications in Pattern - Recognition' for example.

I am not very knowledgable about this specific problem, so I am looking for a method that is widely adopted in statistics community for high-dimensional distributions and has desirable computational properties. If there is an open-source implementation of the method that would also be extremely useful.

jkt
  • 513
  • 3
  • 14
  • 2
    You may want to look at what people do in the Approximate Bayesian Computation (ABC) literature, in particular, I dimly recall a Hamiltonian Monte Carlo version of ABC being described in one paper, which would have needed a solution for exactly that problem. – Björn Apr 14 '17 at 05:08

2 Answers2

8

This problem is sometimes called score estimation, because $\nabla_x \log p(x)$ is the score of whatever model $p$ with respect to a hypothetical location parameter. I've seen it called the Hyvärinen score after Aapo Hyvärinen's technique called score matching for fitting (unnormalized) statistical models.

Some flavors of score estimation:

From a density estimate

Just do any flavor of density estimation, $\hat p(x)$, and differentiate its log, perhaps using automatic differentiation as in the other answers.

  • Kernel density estimation might be reasonable in fairly low dimensions:

$$ \hat p(x) = \frac1n \sum_{i=1}^n k(X_i, x) \qquad \nabla_x \log \hat p(x) = \frac{\frac1n \sum_{i=1}^n \nabla_x k(X_i, x)}{\frac1n \sum_{i=1}^n k(X_i, x)} .$$

  • Deep density estimation models are a nice fit for this setup since they already use autodiff. One good recent model is

    Papamakarios, Pavlakou, and Murray. Masked Autoregressive Flow for Density Estimation. NIPS 2018.

    This type of approach is more computationally expensive than KDE but can yield excellent results, particularly if you have a fair amount of data and a GPU. Code is available from the authors.

From an unnormalized log-density estimate

Unnormalized log-densities are sometimes called the energy. Differentiating one of these estimates will be the score.

  • Kernel exponential families, which are fit via score matching, are a reasonably natural fit for score estimation. (They're hard to normalize, but luckily that doesn't matter for the score.) There's some evidence that these scale better than KDE in moderate dimensions for at least certain kinds of densities. For a recent computationally reasonable-ish approch, see

    Sutherland (ahem), Strathmann, Arbel, and Gretton. Efficient and principled score estimation with Nyström kernel exponential families. AISTATS 2018.

    These have some nice theoretical properties but, in my opinion as one of the authors of that paper: practical application still requires a lot of manual work in choosing the kernel and hyperparameters (something we're working on now). Code is available from the authors, but email me if you want to actually try it. :)

  • This is a more recent approach to directly train a network to output the energy, using a score matching variant. I'm not familiar with the practicalities here, but it seems potentially promising:

    Saremi, Mehrjou, Schölkopf, and Hyvärinen. Deep Energy Estimator Networks. arXiv, May 2018.

Direct score estimators

You can also train a model to just directly output the score.

One problem with this approach is that you might or might not get something that's actually a valid gradient of some vector field; the first approach below certainly suffers from that problem, I'm not sure about the later ones.

  • In theory, if you train a denoising autoencoder $r(x + \sigma \varepsilon) \approx x$, where $\varepsilon \sim \mathcal N(0, I)$, then as $\sigma \to 0$ you have $\frac{r(x + \sigma \varepsilon) - x}{\sigma} \to \nabla_x \log p(x)$ under some assumptions about the strength of the autoencoder architecture and your ability to optimize it.

    The estimator as written in this paper doesn't seem to work especially well (see the experimental results of the kernel exp family and deep energy estimator network papers above), but it might work well with a more complex network structure and more careful parameter tuning / etc. I wouldn't particularly recommend this strategy, but it's interesting.

    Alain and Bengio. What Regularized Auto-Encoders Learn from the Data Generating Distribution. JMLR 2014.

  • The following paper (in Section 3.1) proposes an estimator specifically for $\nabla_x \log p(X_i)$, given only $X_i \sim p$, based on inverting Stein's identity (and also depending on a kernel choice). I'm not familiar with its practical performance, but imagine it should be okay if you choose a reasonable kernel and can get away with only evaluating the score at the $X_i$.

    Li and Turner. Gradient Estimators for Implicit Models. ICLR 2018.

  • This paper is a sequel of sorts to the previous one, which gives a different (Stein-based) estimator that can be evaluated at out-of-sample points and provides some theoretical guarantees. I haven't read it yet, but it's high on the list:

    Shi, Sun, and Zhu. A Spectral Approach to Gradient Estimation for Implicit Distributions. arXiv, June 2018.

Danica
  • 21,852
  • 1
  • 59
  • 115
2

To get things started and also give some suggestions for open-source implementations, I know that theano (which is accessed via Python) allows one to compute gradients numerically, using theano.gradient.numeric_grad (http://deeplearning.net/software/theano/library/gradient.html#module-theano.gradient).

There is also a python package called PyMC3 which allows one to implement Bayesian methods in Python and it also makes heavy use of theano. It offers Hamiltonian Monte Carlo:

https://pymc-devs.github.io/pymc3/notebooks/getting_started.html :

PyMC3, Stan (Stan Development Team, 2014), and the LaplacesDemon package for R are currently the only PP packages to offer HMC.


This might help you to get some of the suggestions implemented.

MJW
  • 138
  • 5
  • I will try to add some more to my answer this afternoon. – MJW Apr 19 '17 at 10:07
  • thanks for the answer. I am aware of automatic differentiation systems like tf, theano, torch etc. However, all these systems require a compact form for the model. In my case I don't know anything about p(x), however I have access to samples from it. My question is then whether we can approximate $\nabla \log p(x)$ via these samples. – jkt Apr 19 '17 at 21:18