Fast density estimation

Question

Suppose you are trying to estimate the pdf of a random variable $X$, for which there are tons of i.i.d. samples $\{X_i\}_{i=1}^{n}$ (i.e. $n$ is very large, think thousands - millions).

One option is to estimate the mean and variance, and to just assume it's Gaussian.

On the other end, one can take kernel density estimates, to get something more accurate (especially when there's so much data).

The problem is, that I need to evaluate the resulting pdf very very fast. If I assume the pdf is Gaussian, then evaluating the pdf $f_X(x)$ is very fast, but the estimate might not be accurate. On the other hand, kernel density estimates will be way too slow to use.

So the question is: what are common ways to get pdf estimates that are more general than Gaussians, but in an incremental fashion? Ideally, I'd like to have a model with a number of parameters (say $k$), that can be used to trade-off estimation accuracy and evaluation speed.

Possible directions I thought about are:

Estimate the moments of the distribution, and find the pdf based on these moments alone. $k$ here is the number of moments. But then, what is the model for the pdf based on the model?
Gaussian mixtures with $k'$ mixtures (here $k=3k'-1$ since for each element of the mixture we keep the mean, variance and weight, and the weights sum to one). Is this a good idea?

Any other ideas are welcome.

Thanks!

Related questions: ML estimation;

Update / clarification:

Thanks for all the answers so far.

I really need the pdf (not the cdf, and not to sample from this distribution). Specifically, I am using the scalar pdf estimates for Naive Bayes (NB) classification and regression: given the label, each of the features has a pdf, and the NB assumption says that they are independent. So in order to calculate the posterior (the probability of the label given the feature values) I need the different pdf's evaluated at the observed feature values.

Keeping the values in sorted order $x_{[1]}\le x_{[2]}\le \cdots \le x_{[n]}$ requires only $O(\log(n))$ effort for each value and finding $i$ for which $x_{[i]}\lt x \le x_{[i+1]}$ for any $x$ also takes only $O(\log(n))$ effort. At that point there are all kinds of ways to estimate the PDF at $x$, if you even need it: perhaps you only need the CDF, which can be estimated by interpolating between $i/n$ and $(i+1)/n$? This work is done automatically in `R`, for instance, via the `ecdf` function. It will be more accurate but less fast than a solution based on binning: which do you prefer? — whuber, Nov 19 '14 at 23:01
@whuber: Can you elaborate on the ways to estimate the pdf from the sorted sample? (as you see in my update, I really do need the pdf itself). — Amir, Nov 20 '14 at 08:34

Glen_b · Accepted Answer · 2014-11-20T00:18:53.607

5

In the univariate case, one quick approximation: You could take a moderate number of bins (in the univariate case, say something on the order of a thousand, though it depends on your bandwidth - you need your bandwidth to cover lots of bins) and discretize the points to the bin-centers; you just scale each kernel-contribution by the respective bin-count. (This kind of approach is really not suitable in high dimensions)

Another approach is to only evaluate the kernel at a limited number of positions and use some form of smooth interpolation between them.

You might try log-spline density estimation I suppose, but it may not be any faster.

For multivariate density estimation, you might look into the Fast Gauss Transform, see for example, here.

edited Nov 20 '14 at 00:18

answered Nov 19 '14 at 22:29

Glen_b

257,508
32
553
939

2

For low dimensions, quadtrees, octrees, and their generalizations could be effective. However, as you propose (+1), evenly binned values will give the fastest possible algorithms ($O(1)$ effort with small implicit coefficients). I would therefore consider using a sort/quadtree algorithm for small $n$ and then rolling it over into a binned data structure once the range of the numbers becomes apparent, with provisions for dynamic expansion of the range and for handling gross outliers separately. – whuber Nov 19 '14 at 23:10
+1 @whuber There's a lot of value in that brief comment. – Glen_b Nov 20 '14 at 00:05

score 2 · Answer 2 · answered Nov 21 '14 at 04:19

OP notes that the sample moments can be calculated fast enough for his needs, and suggests:

Estimate the moments of the distribution, and find the pdf based on these moments alone

This can be done with Pearson fitting which just requires the first 4 moments. But, it does assume that your data is unimodal and ... to be useful and robust ... that the kurtosis etc is not too large. See, for instance, chapter 5 of our book, Rose/Smith(2002 - free download):

http://www.mathstatica.com/book/bookcontents.html

The 'input' is the first 4 moments ---- the pdf is then derived from those moments, where the functional forms are already worked out symbolically, so the resulting pdf is calculated effectively instantaneously.

I think the question would be better defined if the OP specified:

How well does a Gaussian fit work?
What does the kernel density estimate look like? Why not include a plot?
Does the distribution change shape? If so, please provide some examples.

score 0 · Answer 3 · answered Nov 19 '14 at 23:21

Is sub-sampling not an option here? If you've already started to consider using moments and parametric forms, then you probably don't need to look at all million(s) observations. For relatively simple parametric distributions (e.g. Gaussian), hundreds of observations would likely suffice.

The full answer will largely depend on the downstream use, too. Will you be seeking to subsequently sample new values from this unknown distribution? If so, the ecdf method in R mentioned above will work just fine, even from a down-sampled subset of your original data.

Subsampling is an option of course. There's no lack of data. We are trading off accuracy (=richness of the model) with speed (=how fast can we evaluate the pdf, which we really need [see the update to the question]) — Amir, Nov 20 '14 at 08:37

Fast density estimation

3 Answers3

Linked