Applying bootstrapping to test if the data followed a certain distribution

Question

I have a dataset with a large sample size (around 80,000). I would like to test if the data followed a certain distribution. I can fit a distribution function, such as log-normal or gamma, to the entire dataset in R, such as using the fitdist function from the fitdistrplus package in R. I can also look at some diagnostic plots to evaluate if the fitting is good. Nevertheless, given this large amount of data, I cannot apply some goodness-of-fit test, such as Kolmogorov Smirnov or Anderson-Darling test, because large sample size makes these tests too sensitive and any slight deviations from my sample would lead to the rejection of null hypothesis at p = 0.05.

As a result, I am thinking to apply bootstrapping to my dataset and conduct the goodness-of-fit test to each sub-sample and then evaluate the proportion when p value is smaller than 0.05. If most of the time the p value is not smaller than 0.05, I will conclude that my data followed a certain distribution.

Below is a sample code in R

# Load the package for distribution fitting
library(fitdistrplus)
library(goftest)

# Set seed and generate simulated data
set.seed(1)

s <- rgamma(80000, shape = 2, rate = 1)

# Add some random noises to the data
y <- runif(80000, min = 0, max = 0.2)
x <- s + y

# Fit a distribution to x
fit_x <- fitdist(x, distr = "gamma")

# Plot the data
plot(fit_x)

# Apply Anderon-Darling test to see if the distribution of x is as expected as the theoretical distribution 
ad.test(x, null = "pgamma", shape = fit_x$estimate[["shape"]], rate = fit_x$estimate[["rate"]])
# Anderson-Darling test of goodness-of-fit
# Null hypothesis: Gamma distribution
# with parameters shape = 2.29115085990351, rate = 1.09151800140921
# Parameters assumed to be fixed
# 
# data:  x
# An = 14.253, p-value = 7.5e-09

# The p-value is small

### Bootstrapping the data and conduct Anderson-Darling test to each sub-sample

result <- numeric() # A vector storing the result
B <- 10000          # Number of bootstrap

for (i in 1:B){
  temp <- sample(x, size = 500, replace = TRUE)
  temp_p <- ad.test(temp, null = "pgamma", shape = fit_x$estimate[["shape"]], 
                    rate = fit_x$estimate[["rate"]])
  result[[i]] <- temp_p[["p.value"]]
}

# The proportion when p value is smaller than 0
sum(result < 0.05)/length(result) * 100
# [1] 5.84

Given that only 5.84% of the time the P value is smaller than 0.05, I would like to conclude that my original dataset is likely following the gamma distribution.

Please let me know if the proposed steps make sense or if there is any concerns.

Here is a related post on Cross-Validated (How to bootstrap the best fit distribution to a sample?).

Edit

I realized that I did not conduct the Anderson-Darling test correctly. Please see my answer (https://stats.stackexchange.com/a/466589/152507) below. In this example, I should have set estimated = TRUE because I tested the distribution coefficients that are derived from my original data.

Richard Hardy · Answer 1 · 2020-05-02T17:08:18.407

6

My answer will focus not on answering your question directly, but on pondering upon its relevance to your actual objective, something that I believe can be worthwhile. You say

I cannot apply some goodness-of-fit test, such as Kolmogorov Smirnov or Anderson-Darling test, because large sample size makes these tests too sensitive and any slight deviations from my sample would lead to the rejection of null hypothesis at $p = 0.05$.

From this it seems you are not interested in testing the sharp $H_0$ that your distribution belongs to some given family of distributions. ~~You indicate that you know the hypothesis to be false and the above-mentioned tests to have sufficient power to reject it.~~ Then my question to you is, why would you try another test or procedure with less power to test the same sharp $H_0$? That does not make much sense to me.

You may however wish to do something else than formally test $H_0$. E.g. you may actually be interested in assessing how close your distribution is to some family of distributions (for whatever purpose), judge closeness by some distance metric, and then assess the subject-matter (in contrast to statistical) significance of the difference, i.e. figure out whether the closeness is sufficient for your purpose.

edited May 02 '20 at 17:08

answered May 02 '20 at 13:58

Richard Hardy

54,375
10
95
219

4

The post ["Normality testing with very large sample size"](https://stats.stackexchange.com/questions/414399/) indicates that there is nothing wrong with distributional (there: normality) tests in large samples. What may be wrong is the use of such tests for some purposes. – Richard Hardy May 02 '20 at 14:00
1

Thanks for your comment and answer. However, I don't recall that I said something like " I know the hypothesis to be false ...". I don't know if my original data followed by a certain distributions. I can assess some metrics like skewness or kurtosis to make some decisions, but I would welcome an objective way to determine this. – www May 02 '20 at 16:05
1

And thanks for the post you provided. I agree that there is nothing wrong with distribution test in large samples, but I don't need such strong power to detect any slight deviations. That is my motivation to apply this procedure. I hope that responds your question to me. One upovte for your answer and help. – www May 02 '20 at 16:11
2

@www, Sorry, I must have not read your question carefully enough. Indeed, you did not say you know you will get a rejection. Your points make sense. However, I would insist that the fundamental issue is not the power of a test being too great (greater power is always better for evaluating hypotheses), but the fact that you are not interested in testing the sharp null. How you proceed from there is the next question, and you are naturally interested in that. – Richard Hardy May 02 '20 at 17:05
1

Yes, I agree that I am not interested in a sharp null. My goal is to assess if my data are fairly followed a certain distribution. My question of this post is if the bootstrapping method I proposed here can achieve this goal. – www May 02 '20 at 17:47
2

That's an interesting discussion. I guess the point of Richard Hardy, with which I agree, is that if KS or AD tests reject $H_0$, then... $H_0$ is false. Finding another test with less power to accpet it is not something you should do. You should rather change your $H_0$ to something less restrictive like "is my distribution close to a gamma" where you need to specify precisely "close". – Pohoua May 04 '20 at 09:18
1

Your bootstrap is interesting. I could be nice to see if the samples on which $H_0$ is rejected tend to have the same points in it. Maybe a small fraction of your data does not follow a gamma distribution, and when your bootstrap sample has too many of those, test is rejected, otherwise it is accpeted. You could get some kind of "outlier detection" out of this test. – Pohoua May 04 '20 at 09:29
@Pohoua Thanks for your comments and suggestions. I totally agree with you that I need to change H0 to something less restrictive. But I found that in practice it is difficult to specify precisely how "close" is considered to be "close". I may have an idea, for example, it is fine that my data can have a longer tail or a more right skewed distribution than a perfect gamma distribution, but I don't know how to specify the actual mathematical formula and test those. That is also part of the motivation I want to use bootstrapping to help me. – www May 04 '20 at 19:25
1

@Pohoua I also like your suggestions on tracking the data points that tend to make the statistical tests reject H0. I feel like this could be a way to detect outliers in the dataset. – www May 04 '20 at 19:27
1

@www, indeed finding an adequate $H_0$ and a test for it does not seem easy. Maybe something like at least 95% of the data (or other percentage) is comming from the gamma distribution, but I don't see how to test this.. – Pohoua May 05 '20 at 09:12
Thanks for your help. I just realized that I did not conduct the Anderson-Darling test correctly. Please see my answer below. In this example, I should have set 'estimated = TRUE' because I tested the distribution coefficients that are derived from my original data. – www May 15 '20 at 01:02

score 2 · Accepted Answer · answered May 05 '20 at 18:30

Clearly, you are not really interested in the null hypothesis - given your concern about small deviations leading to a rejection. If the null hypothesis were really what you care about, a powerful test that can pick up on the slightest deviation given the large dataset would be great. Somehow fiddling around in some weird way to make the test less powerful just makes no sense - in fact we very often know that the null hypothesis that a certain distribution applies cannot possibly be true (e.g. blood pressure values or blood sugar levels - and their residuals in linear models - cannot possibly follow a normal distribution, because negative values are not possible, but it's still a perfectly good approximation for modeling).

Instead, you presumably care more about whether it is okay to assume a certain distribution for some modeling you intend to perform. For many such tasks it turns out that approximately correct distributions do just fine (for some cases that has been shown with simulation studies, for other cases we do not actually know for sure). So, presumably your question is rather whether there are deviations from modeling assumptions that are so large that distributional assumptions are not appropriate. To answer that question null hypothesis tests are completely unsuited and should not be used.

One of the best approaches is to look at regression residuals (or other suitable diagnostics) for a dataset with (about) the same data generating mechanism as the dataset you will model and to use that as a basis to specify up-front how you will model your new dataset. Very often, it may even be very well known in your scientific area how certain variables can reasonably be modeled and you may not have to do this investigation yourself. The reason why I emphasize is that checking assumptions on residuals on the actual data you model can be problematic, particularly if you aim for things like type I error (which can be inflated if you adapt your modeling strategy if some distributional assumptions appear to be violated) control. If you are in a more hypothesis generating experiment and not much prior data exsits, then you may of course want to/have to check on your main dataset. If there are deviations that are so large that they call the distributional assumptions into question, you may then have to adapt your approach.

Thanks for your answer. I just upvote it. I agree with you that my goal is to "whether it is okay to assume a certain distribution for some modeling you intend to perform". I also partly agree that " null hypothesis tests are completely unsuited and should not be used". — www, May 06 '20 at 02:50
If possible, please clarify what do you mean by "looking at regression residuals (or other suitable diagnostics)". I feel like my bootstrapping is an attempt to generate "suitable diagnostics". In my real-world application, I have hundreds of similar datasets, not just this one, and I need to decided if they are close to a certain distributions. If I only have one dataset, I can spend time to look at density plot or Q-Q plot to decide if the dataset is close enough to a certain distribution. But I will not be able to do this for all datasets. — www, May 06 '20 at 02:59
Sounds like you are looking for some kind of automatable heuristic for what is too large a deviation from assumptions. I don't think there's been that much research into that. The normal distribution case is quite well explored (e.g. it is known that even large deviations are unproblematic in the setting of comparing two groups in a randomized trial as long as the sample size is not very small), but for other distributions one would have to search the literature to see what's available. You could also consider if a few cases you can check manually would be representative of the other cases. — Björn, May 06 '20 at 18:56
Thanks for your help. I just realized that I did not conduct the Anderson-Darling test correctly. Please see my answer below. In this example, I should have set 'estimated = TRUE' because I tested the distribution coefficients that are derived from my original data. — www, May 15 '20 at 01:02

justme · Answer 3 · 2020-05-10T11:15:46.693

1

Firstly, I must agree with the other answerers: anything that tests your distribution against some fixed $H_0$ and returns a p-value is not the right answer for you. You're not interested in asking "do I have enough evidence to prove that this is not exactly this distribution" (which is what the p-value would be asking).

The geniuses on here might be able to suggest a more principle approach than this, but here is what I would do. You are very fortunate to have a huge sample, so why not treat your large samples as populations, and simulate the analysis you want to run, to see if a particular distributional assumption gets you the results you want? Let me give a really simple example. Suppose, I wanted to make a normal approximation to calculate a confidence interval for the mean of my sample. Then, sample (with replacement) from the distribution many times, fit CIs, and see how often the 95% CI hits the true mean of the sample. That way, you can decide what is adequate performace. Perhaps the 95% CI hits the mean of your sample only 94% of the time, but you might be happy with that. Then you should be pretty good to go.

You could, in principle, extend that to recreate any model you wanted (perhaps you could pull the residuals from a more complex model fitted to the data, and sample from those to give you error terms for your simulation). It's not exact (as other answerers commented, residuals from your own data aren't a perfect model for true residuals) but, again, your large sample size will help you with this.

(By the way, the inference of mean by normal approximation example above is a good example of why, ideally, you want to generate samples the same size as you already have (bootstrap-style: with replacement). If you tried that with small samples from your larger sample, you might be disappointed by the results, but with larger samples (and yours is very large!), CLT will kick in, and it would actually work very well).

EDIT:

As requested, here is a super simple example of how you might test the appropriateness of a given analysis under certain distributional assumptions. This is a really simple example. Suppose I have this really skewed sample, and I want to use a t-interval to calculate a CI for the mean of that sample. Note that this is different from bootstrapping: you are not estimating the parameter on each sampled dataset, but applying the entire model to it, and seeing whether the model "gets it right". The model I'm applying is specifically the model that I want to apply in the end (the t-interval), and based on samples of the same size as my dataset -- so it's a perfect mirror of the intended final analysis.

You could, in principle, with creativity, extend this to cover pretty much any model. Again, you would treat your sample as the "true population" and see how consistently you can recover the patterns in that "population" based on samples from it.

nsamples <- 10000

set.seed(1)
yourdata <- exp(rnorm(500)) # replace this with your actual data!!


does.CI.hit.target <- function(){
  truemean <- mean(yourdata)
  sampleddata <- sample(yourdata,replace=TRUE)
  CI <- t.test(sampleddata)$conf.int
  return(truemean>=CI[1] & truemean<=CI[2])
}

simulations_hit_the_target <- replicate(nsamples,does.CI.hit.target())
successrate <- mean(simulations_hit_the_target)
print(sprintf("95%% CI hits the target %.1f%% of the time", successrate*100))

Running this produces the following output:

[1] "95% CI hits the target 92.0% of the time"

And then it's just a question of whether you feel like a 92% hit-rate on a 95% CI is good enough for your purposes.

edited May 10 '20 at 11:15

answered May 07 '20 at 08:38

justme

651
5
15

Thanks for your nice answer. I just gave you an upvote. The procedure you described sounds like a bootstrapping to me, perhaps doing sample with replacement up to the original size of my data, and then calculate mean, skewness, ..., any statistics that can describe the distribution, and then compared to my original data. Please clarify if this is the procedure in your mind. – www May 07 '20 at 12:08
If this is the procedure you described, then my question is what is the issue if I conducted goodness-of-fit test for each of the bootstrap? I mean if it is common to calculate the mean or other statistics of the bootstrap, is it problematic to do goodness-of-fit test? I can see that I only do 500 samples for each bootstrap in this case. So is a far smaller sample size compared to my original data a concern? – www May 07 '20 at 12:14
1

@www the idea here is a bit different from the bootstrap, though you're right it is very similar. With the bootstrap, you are trying to understand the sampling distribution of one statistic through your samples. With this idea, you are trying to test the efficacy of a particular modelling approach across your different samples. You are using your sample simply as the basis for a simulation to test how well a particular model does its job. I perhaps shouldn't have used the word bootstrap at all (I only used that to suggest the sort of sampling I would use). – justme May 07 '20 at 13:32
1

@www I totally agree with all the other answers on here that p-value based testing is a red herring. Trying to reduce power by reducing your sample size will certainly mean that slightly less perfect fits will not be rejected, but not in a way that makes any sense. The goodness of fit test is for one job only: to ask "do I have enough data to prove that this distribution does not *perfectly* fit my data". If that is not the question you want to ask, it is not the right test for the job. – justme May 07 '20 at 13:35
Thanks. I would like to learn more about your ideas of simulations. If possible, please elaborate on "test the efficacy of a particular modelling approach across your different samples". What do you mean by modeling approach? When you talked about models, are you referring different theoretical density distribution function, such as log-normal or gamma? I understand now what sampling approach you would use, but what is the next step after I finished such sampling? – www May 07 '20 at 18:27
1

Sure -- it's late here now so I'll take a look tomorrow. But, first, what is it that you want to "do" once you have decided what distributional assumptions you can make? It is specifically that that you should simulate. E.g. do you want to perform a regression? – justme May 07 '20 at 20:10
I would like t do Bayesian analysis. Once I determine which distribution can best describe my data, I will use that to specify my likelihood model for my data. – www May 08 '20 at 02:00
OK, so saying "I would like to do Bayesian analysis" is a bit like saying "I would like to do statistics on my data" -- it's pretty vague! And I am not expert in Bayesian stuff at all, so I've given you an example that is frequentist. But with a little creativity, this idea should be applicable to any analysis, I think. You're really just testing the method out over and over again to see if it can come up with sensible conclusions. – justme May 10 '20 at 11:17
In Bayesian analysis, it is important to specify the distribution of the likelihood model, such as `y ~ Normal(a, b)`. In this example, normal distribution is the distribution reflects our belief that the data followed this distribution. I have several candidate distribution to select, such as log-normal, gamma, weibull, gumbel, .... I would like to know which distribution can describe my data (`y`) the best. – www May 10 '20 at 12:28
I guess I can just specify the different distributions, run the Bayesian analysis, and asses model performance. Naturally, the model with the best performance should be the one with the best distribution. But still, I think it would be great if there is a technique or procedure help me decided which distributions is the best to use. – www May 10 '20 at 12:34
@www sure, and same goes for parametric frequentist stats... but you'll be analysing some kind of model of something and getting a posterior distribution over its parameters, right? (Again, not experienced with Bayesian stats but that's my understanding of it). So the question is, do the results of your analysis uncover the "true" parameters in a reasonable way. Someone with more Bayesian experience will know better than me, but I would imagine the same principle can be applied to Bayesian inference – justme May 10 '20 at 12:37
@www yes, but not so much which gives the best performance, as also, is the performance good enough on any of them. On my example, the CI hits only 92% of the time. How much that matters (when it claims to hit 95% of the time) depends on context – justme May 10 '20 at 12:40
Thank you very much. There is a lot for me to take in. I will think about your suggestions. Notice that I probably will still call your approach bootstrapping. Since `CI` is the 95% confidence interval, it is not surprising about 95% it hits the target.I think I can change `CI` to the distribution parameter I am interested in, such as the shape and rate parameter of the gamma distribution I am interested in, take `10000` samples, and calculate if true shape and rate fall within the 95% interval of the bootstrapping distribution. – www May 10 '20 at 13:01
This is not bootstrapping -- please refer to it as simulation!! Don't replace the CI *only* with those parameters, but with a recreation of your whole analysis (if possible). Were it not for CLT, it actually *would* be very surprising that this hits as much as 92% of the time -- the distribution is *heavily* skewed and nothing like normal, but the t-interval assumes a normal distribution. But this simulation shows that assuming a normal distribution (despite being way-off on a histogram) actually produces almost passable results in this case. – justme May 10 '20 at 16:13
@www -- also (further to above comment) -- please note that the reason I tested against 95% is because that is claimed performance of a 95% CI. Your measure of success should be based on what your model claims to tell you; there might be no 95% in there. – justme May 10 '20 at 16:26
Thanks for your help. I just realized that I did not conduct the Anderson-Darling test correctly. Please see my answer below. In this example, I should have set 'estimated = TRUE' because I tested the distribution coefficients that are derived from my original data. – www May 15 '20 at 01:02
Yes -- estimated parameters need to be accounted for. But, as you correctly conclude, a hypothesis test will never be the right tool for this job. – justme May 15 '20 at 14:32

score 0 · Answer 4 · answered May 15 '20 at 00:56

While I agree with all the answers and comments, I believe the example I gave using Anderson-Darling test to assess the distribution is incorrect. I did not apply the ad.test function correctly.

Below is from the documentation of the ad.test function from the goftest package.

By default, the test assumes that all the parameters of the null distribution are known in advance (a simple null hypothesis). This test does not account for the effect of estimating the parameters.

If the parameters of the distribution were estimated (that is, if they were calculated from the same data x), then this should be indicated by setting the argument estimated=TRUE. The test will then use the method of Braun (1980) to adjust for the effect of parameter estimation.

Note that Braun's method involves randomly dividing the data into two equally-sized subsets, so the p-value is not exactly the same if the test is repeated. This technique is expected to work well when the number of observations in x is large.

Since in my example, I used the coefficients derived from my data to conduct the ad.test, I should have set estimated = TRUE.

Here is the same code but I set estimated = TRUE when applying the ad.test. It seems like the p value changes a lot, which means the statistical power decreases. This post discussed the issue when applying the Anderson-Darling test while the parameters are being estimated.

# Load the package for distribution fitting
library(fitdistrplus)
library(goftest)

# Set seed and generate simulated data
set.seed(1)

s <- rgamma(80000, shape = 2, rate = 1)

# Add some random noises to the data
y <- runif(80000, min = 0, max = 0.2)
x <- s + y

# Fit a distribution to x
fit_x <- fitdist(x, distr = "gamma")

# Plot the data
plot(fit_x)

# Apply Anderon-Darling test to see if the distribution of x is as expected as the theoretical distribution 
# Set Estimated  = TRUE

# Set seed and generate simulated data
set.seed(1)

ad.test(x, null = "pgamma", shape = fit_x$estimate[["shape"]], rate = fit_x$estimate[["rate"]],
        estimated = TRUE)
#   Anderson-Darling test of goodness-of-fit
#   Braun's adjustment using 283 groups
#   Null hypothesis: Gamma distribution
#   with parameters shape = 2.29115085990351, rate = 1.09151800140921
#   Parameters assumed to have been estimated from data
# 
# data:  x
# Anmax = 5.398, p-value = 0.4093

I believe all the other answers here still address my original question correctly, which is hypothesis test is not the right tool here. But it is just that I did not do the Anderson-Darling test correctly and it becomes a bad example.

Applying bootstrapping to test if the data followed a certain distribution

Edit

4 Answers4