I've read that the KS-test is biased when the pdf parameters are estimated from the same data source. This post on Cross Validated lead me to the paper which has a workaround using synthetic datasets. I've been implementing this for a Weibull distribution in matlab (as part of my master's thesis) and would like to verify that what I've done is correct. A snapshot from the procedure described in Clauset, A., Shalizi, C. R., & Newman, M. E. (2009).:
" our procedure is as follows. First, we fit our empirical data to the power-law model using the methods of Section 3 and calculate the KS statistic for this fit. Next, we generate a large number of power-law distributed synthetic data sets with scaling parameter α and lower boud xmin equal to those of the distribution that best fits the observed data. We fit each synthetic data set individually to its own power-law model and calculate the KS statistic for each one relative to its own model. Then we simply count what fraction of the time the resulting statistic is larger than the value for the empirical data. This fraction is our p-value. Note crucially that for each synthetic data set we compute the KS statistic relative to the best-fit power law for that data set, not relative to the original distribution from which the data set was drawn. In this way we ensure that we are performing for each synthetic data set the same calculation that we performed for the real data set, a crucial requirement if we wish to get an unbiased estimate of the p-value. "
The pseudocode of my implementation is as follows:
Fit a Weibull distribution to the data in X. (X is a 1 by n vector that contains my data values). For this I use the matlab function fitdist.
Create m synthethic data sets using the distribution from 1. For this I use the matlab function random().
For every m do:
a) Fit a weibull distribution to the synthetic dataset, again via fitdist;
b) Perform the KS-test ,using the kstest() function, and store p values.
Compute the new p-value by:
a) counting all instances where p value of synthethic dataset m > p value initial dataset;
b) divide the count of a by m to get the adjusted (non-biased or less biased?) p-value.
It's mainly part 4 (corresponding to the bold part in the quote) that i'm not sure is correct. Also i'm not sure if the random() function creates a synthetic dataset as it "draws random numbers from the specified distribution". My guess was that when you ask random() to create a sufficiently large enough dataset it will automatically have the same features as the specified distribution thus qualifying for a synthetic dataset?
Any help would be greatly appreciated. Small disclaimer: I'm an industrial engineering major so less-technical explanations would be appreciated;)
Finally there is an implementation of this code in R (which i'm not familiar with) for the power law distribution, see this link. Maybe it might help you guys in answering my question, maybe it helps someone for future reference.