How to bootstrap the best fit distribution to a sample?

Question

If I have a sample:

set.seed(0)
x <- rlnorm(500)

Then I can use the fit.distr function to find the best fit among two candidate distributions, e.g.

library(MASS)
find.bestfit <- function(x){
   logN <- fitdistr(x, "lognormal")
   gam  <- fitdistr(x, "gamma")
   ans <- ifelse(AIC(logN) < AIC(gam), "logN", "gam")
   return(ans)
}

find.bestfit(x)
[1] "logN"

However, there is some probability that I will not recover the "true" distribution that was sampled (in this case "lognormal" was used to simulate x). How can I calculate this probability?

I have only gotten so far as to consider using a bootstrap approach, but I am not familiar with this technique and am not sure exactly where to start:

## create an empty vector
fit.samps <- rep(NA, 100)
## determine fit to subsamples from original distribution
for(i in 1:100){
  fit.samps[i] <- find.bestfit(sample(x, 10))
}

I suspect that my approach is wrong, because the sample size is arbitrary, and ultimately, the best fit distribution based on the fitdistr function should be selected most of the time.

I would appreciate some pointers on how I might apply the bootstrap approach to answer this question.

score 7 · Accepted Answer · answered Sep 04 '12 at 15:02

Since you know that $X$ is either lognormal or gamma, you can use a parametric bootstrap instead of the nonparametric version that you proposed. You would then resample from the fitted distribution instead, and compute the probability that find.bestfit gives the right answer.

This probability will depend on whether $X$ is lognormal or gamma, so you have to make two separate computations.

Here is a way to do this in R:

library(MASS)

n<-500 # Sample size
B<-100 # Number of bootstrap samples

set.seed(0)
x <- rlnorm(500)

## Create an empty vector
fit.samps <- rep(NA, B)

####

# LOGNORMAL DISTRIBUTION

# Lognormal parameters:
lnpar<-fitdistr(x, "lognormal")$estimate

# Determine fit to parametric bootstrap samples from original distribution
for(i in 1:B){
  fit.samps[i] <- find.bestfit(rlnorm(n,as.vector(lnpar)))
}

# Probability of correct classification if lognormal:
sum(fit.samps=="logN")/B

####

# GAMMA DISTRIBUTION

# Gamma parameters:
gammapar<-fitdistr(x, "gamma")$estimate

##  Determine fit to parametric bootstrap samples from original distribution
for(i in 1:B){
  fit.samps[i] <- find.bestfit(rgamma(n,as.vector(gammapar)))
}

# Probability of correct classification if gamma:
sum(fit.samps=="gam")/B

For $n=500$ these probabilities are both virtually 1. For $n\approx 50$ (or less), you get different probabilities though.

Michael R. Chernick · Answer 2 · 2012-09-03T19:25:43.257

The bootstrap can be used for this although it is not commonly done. Th approach would be to sample with replacement n times from your sample of size n. Each time you sample with replacement you compute the goodness of fit statistics for the competing distributions and pick the distribution that fits best. Take the number of times distribution A is selected divided by the total number of bootstrap samples to get an estimate for the probability that distribution A will be selected.

How to bootstrap the best fit distribution to a sample?

2 Answers2

Linked