4

I am trying to figure out the best distribution to fit some data to, and I'm not sure if what I am doing is statistically correct. My data consists of 20 samples / year over 10 years. For each sample I have run a distribution fitting algorithm (using fitdistr() in R), to get the estimated parameters for each type of distribution. I am testing gamma, chi-squared, weibull and lognormal distributions.

My next step was to then run a Kolmogorov Smirnov test, using the sample data, and setting the parameters as estimated from that data. I was going to find which distribution was the overall 'best' (lowest average p-value for all 200 samples), and say that this was the distribution my data described. I have read that using the KS test in this way is incorrect and the resulting p-values will be unreliable.

I'm not sure if I can use the KS test in this way, or if I should do and maximum likelihood estimation.

  • 1
    In addition to @Ezekiel2517's points, note that you'd have to use a bootstrapped version of the Kolmogorov-Smirnov test if the parameters from each model are estimated from the data. – Scortchi - Reinstate Monica Apr 10 '13 at 16:29
  • 1
    From what I gather, I would run the `fitdist()` on the sample, then run an AIC on the output of the `fitdistr()`. So I'm not sure why I'd need to bootstrap the KS test if I am no longer using it? – D'Arcy Mulder Apr 10 '13 at 19:01
  • Sorry: I meant that it was an additional issue with your original idea, not an additional thing to do after calculating AICs. – Scortchi - Reinstate Monica Apr 11 '13 at 08:27

1 Answers1

3

Indeed, that is not a formal comparison. First of all, if you use fitdistr, then you are using a maximum likelihood estimation approach. See: http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/fitdistr.html.

The formal way to compare these models is to employ a model selection technique such as AIC, BIC, DIC or some other.

Finally, (you have probably consider this) there seems to be a time indexing of your observations which may be relevant to take into consideration.

  • In addition, this question seems to be a duplicate http://stats.stackexchange.com/q/45033/24160 – Ezekiel2517 Apr 10 '13 at 16:46
  • Thanks! I ran the AIC, and found values for every sample. I was hoping you could give me a bit of insight into what I'm doing with the AIC values next. As with the p-values, I was planning on averaging the AIC values for each year (for each distribution). Then I would choose the best distribution as that which had the lowest average AIC. I'm not familiar with this value, so I don't know if averaging like this is valid. – D'Arcy Mulder Apr 10 '13 at 18:18
  • 1
    @darcy.mulder AIC has one term proportional to the log-likelihood and a bias-correction term proportional to the number of parameters. So you can just sum the AICs for each model to get an overall AIC. (If you're using the second-order AICc, sum the log-likelihoods & recalculate the bias-correction term based on the total no. parameters & the overall sample size.) You might also want to see if you could fit some common parameter across all datasets. – Scortchi - Reinstate Monica Apr 11 '13 at 09:34