How do I determine how well a dataset approximates a distribution?

Question

Quite simple, I have some probability distribution p(x), how can I measure whether one empirical density (set of delta masses) is a better approximation than another. I know that KL-divergence is a well accepted measure between two continuous densities, but it's not clear how to apply that to a set of samples.

jpillow · Accepted Answer · 2011-07-08T04:35:23.600

7

For visualization purposes, try a Q-Q plot, which is a plot of the quantiles of your data against the quantiles of the expected distribution.

If you want a statistical test, the Kolmogorov-Smirnov statistic provides a non-parametric test for whether the data come from $p(x)$, using the maximum difference in the empirical and analytic cdf.

Of course, you could also evaluate the log-probability of your data under the two distributions: $L_1 = \sum_i p_1(X_i)$ vs. $L_2 = \sum_i p_2(X_i)$, and take whichever is larger. This is equivalent to maximum likelihood density comparison. (However, this may not be valid if $p_1$ and $p_2$ are distributions fit to your data, especially if they have different numbers of fitted parameters; in that case you want to do "model comparison", and there are a variety of tools for this— AIC, BIC, Bayes Factors, Likelihood-ratio test, Cross-validation, etc.)

edited Jul 08 '11 at 04:35

answered Jul 08 '11 at 04:25

jpillow

2,646
1
18
11

+1 great answer, but it looks like I need the cdf of p(x) to compute the Kolmogorov-Smirnov statistic. Unfortunately I don't have that, in fact I only have p(x) up to a normalizing constant (which is why I'm approximating it with samples). – fairidox Jul 08 '11 at 08:04
Also, I have a question about log likelihood. Imagine I have a complex multi-modal function, and I have a set of samples all focused around one high density point. This is clearly a poor approximation, but it would have a large likelihood, in fact the maximum likelihood set of points is N points all exactly at the global max. – fairidox Jul 08 '11 at 08:25
Ah, ok—I understand your question a bit better. You're right that N points at the mode will have highest likelihood (making the likelihood a poor statistic to use if your goal is to get a "typical" sample from $p(x)$). I should have said that Q-Q and K-S both (typically) get applied to 1D distributions. Is that the case here? – jpillow Jul 08 '11 at 09:03
If you can't evaluate $p(x)$ but _can_ evaluate something proportional to it, then you might try running an MCMC sampler for a really long time. Use those samples to compute your "true" cdf. Then plot that cdf relative to the cdf of your two (presumably smaller) samples, and you'll have something similar to the Q-Q plot (some people call it a K-S plot). Although: if your problem is that you don't know how well an MCMC sampler is working in the first place, then this may not solve your problem! – jpillow Jul 08 '11 at 09:05
1

Final thought: if your density is 1D (or even low-D) and you can evaluate it up to a constant of proportionality, you can still construct the CDF by evaluating $p(x)$ at evenly-spaced increments and normalizing by the sum of all values $p(x_i)$ (i.e., so that the cdf goes from 0 to 1). – jpillow Jul 08 '11 at 09:07
1

Accepted the answer, I think the KS-statistic is what I need after all. Ultimately I am looking to test the quality of a particular sampling technique, for this analysis I can stick to functions where the the cdf is known, and estimating it online is not really required. – fairidox Jul 08 '11 at 19:33

How do I determine how well a dataset approximates a distribution?

1 Answers1

Linked