I have a dataset with a large sample size (around 80,000). I would like to test if the data followed a certain distribution. I can fit a distribution function, such as log-normal or gamma, to the entire dataset in R, such as using the fitdist
function from the fitdistrplus
package in R. I can also look at some diagnostic plots to evaluate if the fitting is good. Nevertheless, given this large amount of data, I cannot apply some goodness-of-fit test, such as Kolmogorov Smirnov
or Anderson-Darling
test, because large sample size makes these tests too sensitive and any slight deviations from my sample would lead to the rejection of null hypothesis at p = 0.05
.
As a result, I am thinking to apply bootstrapping to my dataset and conduct the goodness-of-fit test to each sub-sample and then evaluate the proportion when p value
is smaller than 0.05
. If most of the time the p value
is not smaller than 0.05
, I will conclude that my data followed a certain distribution.
Below is a sample code in R
# Load the package for distribution fitting
library(fitdistrplus)
library(goftest)
# Set seed and generate simulated data
set.seed(1)
s <- rgamma(80000, shape = 2, rate = 1)
# Add some random noises to the data
y <- runif(80000, min = 0, max = 0.2)
x <- s + y
# Fit a distribution to x
fit_x <- fitdist(x, distr = "gamma")
# Plot the data
plot(fit_x)
# Apply Anderon-Darling test to see if the distribution of x is as expected as the theoretical distribution
ad.test(x, null = "pgamma", shape = fit_x$estimate[["shape"]], rate = fit_x$estimate[["rate"]])
# Anderson-Darling test of goodness-of-fit
# Null hypothesis: Gamma distribution
# with parameters shape = 2.29115085990351, rate = 1.09151800140921
# Parameters assumed to be fixed
#
# data: x
# An = 14.253, p-value = 7.5e-09
# The p-value is small
### Bootstrapping the data and conduct Anderson-Darling test to each sub-sample
result <- numeric() # A vector storing the result
B <- 10000 # Number of bootstrap
for (i in 1:B){
temp <- sample(x, size = 500, replace = TRUE)
temp_p <- ad.test(temp, null = "pgamma", shape = fit_x$estimate[["shape"]],
rate = fit_x$estimate[["rate"]])
result[[i]] <- temp_p[["p.value"]]
}
# The proportion when p value is smaller than 0
sum(result < 0.05)/length(result) * 100
# [1] 5.84
Given that only 5.84% of the time the P value
is smaller than 0.05, I would like to conclude that my original dataset is likely following the gamma distribution.
Please let me know if the proposed steps make sense or if there is any concerns.
Here is a related post on Cross-Validated (How to bootstrap the best fit distribution to a sample?).
Edit
I realized that I did not conduct the Anderson-Darling
test correctly. Please see my answer (https://stats.stackexchange.com/a/466589/152507) below. In this example, I should have set estimated = TRUE
because I tested the distribution coefficients that are derived from my original data.