If I have a sample:
set.seed(0)
x <- rlnorm(500)
Then I can use the fit.distr function to find the best fit among two candidate distributions, e.g.
library(MASS)
find.bestfit <- function(x){
logN <- fitdistr(x, "lognormal")
gam <- fitdistr(x, "gamma")
ans <- ifelse(AIC(logN) < AIC(gam), "logN", "gam")
return(ans)
}
find.bestfit(x)
[1] "logN"
However, there is some probability that I will not recover the "true" distribution that was sampled (in this case "lognormal" was used to simulate x
). How can I calculate this probability?
I have only gotten so far as to consider using a bootstrap approach, but I am not familiar with this technique and am not sure exactly where to start:
## create an empty vector
fit.samps <- rep(NA, 100)
## determine fit to subsamples from original distribution
for(i in 1:100){
fit.samps[i] <- find.bestfit(sample(x, 10))
}
I suspect that my approach is wrong, because the sample size is arbitrary, and ultimately, the best fit distribution based on the fitdistr
function should be selected most of the time.
I would appreciate some pointers on how I might apply the bootstrap approach to answer this question.