How to choose the most appropriate distribution for a given vector in R?

Question

When creating a Monte Carlo simulation model for a variable, a critical step is to choose the distribution that best fits the variable's probability density.

I generally do this by looking at the density plot and determining what distribution best fits the density shape. For (a very lame) example, this…

x <- rnorm(1000)
plot(density(x))

… appears to be a normal distribution (but only because it's a random sample from the normal distribution).

However, when dealing with real world data, it's difficult to know which of the 17 built-in distributions best represent the shape of the data.

For example, this data…

data <- c(6.515, 0.243, 0.725, 2.276, 1.456, 4.047, 0.766, 0.29, 2.368, 
0.543, 2.223, 0.488, 0.47, 3.511, 0.544, 4.191, 0.414, 0.704, 
4.917, 0.434, 0.773, 0.477, 3.257, 0.415, 1.921, 0.278, 3.159, 
4.193, 0.132, 1.109, 1.538, 4.088, 0.468, 0.047, 2.204, 3.765, 
0.168, 2.231, 0.164, 0.371, 2.33, 4.458, 0.046, 1.195, 1.714, 
1.046, 1.915, 2.66, 5.409, 0.466)

plot(density(data))

Plot

… seems like it could be best modeled with the chi-squared distribution, but it could also be a gamma distribution.

The only way I've found to fit the best type of model is to overlay a bunch of different possible distributions until I see one that visually matches (or comes close). But surely there's a more numerical, official way to do that, right?

Is there a systematic, non-visual (and automated) way to find the best distribution for a given variable? Is there a function in some R function that runs through different distributions to check their goodness of fit, or is that terribly inefficient?

I was hoping there was a function in some package (or something I could write) that would let me run a test to figure out the best distribution. I don't know how often I would need to do this, but I think it would be worth it to have a function that could do something like this… — Andrew, Feb 03 '12 at 04:48
If there was some nice an efficient way to find the best goodness-of-fit test, that would be ideal, obviously… — Andrew, Feb 03 '12 at 04:50
Related questions can be found with a search for one commonly recommended curve-fitting solution, Eureqa: http://stats.stackexchange.com/search?q=Eureqa. Please don't take this reference as a recommendation, though: it is not at all clear why you must use a parametric distribution in your simulation. Why not resample from the data you have? — whuber, Feb 03 '12 at 15:21

score 7 · Accepted Answer · answered Feb 04 '12 at 14:54

When deciding on a distribution, the science is more important than the tests. Think about what lead to the data, what values are possible, likely, and meaningful. The formal tests can find obvious differences, but often cannot rule out distributions that are similar (and note that the chi-squared distribution is a special case of the gamma distribution). Look at this quick simulation (and try it with other values):

> mean(replicate(1000, ks.test( rt(5000, df=20), pnorm )$p.value)<0.05)
[1] 0.111

The ks.test can only find the difference between a t-distribution with 20 df and a standard normal 11% of the time, even with a sample size of 5000.

If you really want to test the distributions, then I would suggest using the vis.test function in the TeachingDemos package. Instead of rigid tests of exact fit, it presents a plot of the original data mixed in with similar plots from the candidate distribution and asks you (or another viewer) to pick out the plot of the original data. If you cannot distinguish visually between your data and the simulated data then the candidate distribution is probably a reasonable starting point (but this does not rule out other possible distributions, think about which ones make the most sense scientifically).

Another approach would be to generate your new data from the density estimate of your original data. The logspline package for R has functions to estimate the density, then generate random data from that estimate. Or, generating data from a kernal density estimate means selecting a point from your data, then generating a random value from the kernal centered around that point. This can be as simple as selecting a random sample from the data with replacement, then adding small normal deviates to the values.

Xi'an · Answer 2 · 2012-02-04T16:36:49.790

4

There is no reason one of the "official" distributions would fit your data. The most relevant statistical test for checking fit to a distribution is the Kolmogorov-Smirnov test. E.g.,

> x=rnorm(133) 
> ks.test(x,"pnorm",mean(x),sd(x))

        One-sample Kolmogorov-Smirnov test

data:  x 
D = 0.0388, p-value = 0.9882
alternative hypothesis: two-sided

(with the caveat that the p-value does not account for the parameter estimation).

Edited: In order to find the proper p-value, one can use a Monte Carlo experiment, namely produce a sample of samples $x$ from the hypothetised distribution and for each of those, derive the ks.test distance. This sample of ks.distances can then be used to find an empirical p-value:

ksdist=rep(0,10^2)
for (t in 1:10^2){
  x=rnorm(length(x0),mean(x0),sd(x0))
  ksdist[t]=ks.test(x,"pnorm",mean(x),sd(x))$stat 
  }
empvalue=sum(ksdist>ks.test(x0,"pnorm",mean(x0),sd(x0))$stat)/10^2

For instance,

> x0=rt(123,df=4)
> empvalue
[1] 0.02
> ks.test(x0,"pnorm",mean(x0),sd(x0))$p
[1] 0.2996538

and

> x0=rnorm(321)
> empvalue
[1] 0.1
> ks.test(x0,"pnorm",mean(x0),sd(x0))$p
[1] 0.568052

shows how the simulation corrects the improper p-value. (This exercise is usually part of my final exam in exploratory statistics.)

edited Feb 04 '12 at 16:36

answered Feb 03 '12 at 06:29

Xi'an

90,397
9
157
575

2

This use of ks is incorrect, as its manual page points out: when you set the reference mean and sd to those of the sample, you greatly inflate the p-values. (Do a simulation and check!) That's why the p-value in your example is so unusually high. `f – whuber Feb 03 '12 at 15:12
@whuber: I agree! That's why I mentioned the caveat... – Xi'an Feb 03 '12 at 15:18
1

Ah, sorry. I did not understand the caveat in that sense. Why, then, did you provide an example you knew to be an incorrect application as a reply to the question? – whuber Feb 03 '12 at 15:24
@Xi'an No need to do the simulations here! In case of normality testing you can use the 'nortest' package which contains (among the others) the Lilliefors test. – Anton Korobeynikov Feb 04 '12 at 21:36

How to choose the most appropriate distribution for a given vector in R?

2 Answers2

Linked