check implementation ks test for distribution with estimators fitted from same data

Question

I've read that the KS-test is biased when the pdf parameters are estimated from the same data source. This post on Cross Validated lead me to the paper which has a workaround using synthetic datasets. I've been implementing this for a Weibull distribution in matlab (as part of my master's thesis) and would like to verify that what I've done is correct. A snapshot from the procedure described in Clauset, A., Shalizi, C. R., & Newman, M. E. (2009).:

" our procedure is as follows. First, we fit our empirical data to the power-law model using the methods of Section 3 and calculate the KS statistic for this fit. Next, we generate a large number of power-law distributed synthetic data sets with scaling parameter α and lower boud xmin equal to those of the distribution that best fits the observed data. We fit each synthetic data set individually to its own power-law model and calculate the KS statistic for each one relative to its own model. Then we simply count what fraction of the time the resulting statistic is larger than the value for the empirical data. This fraction is our p-value. Note crucially that for each synthetic data set we compute the KS statistic relative to the best-fit power law for that data set, not relative to the original distribution from which the data set was drawn. In this way we ensure that we are performing for each synthetic data set the same calculation that we performed for the real data set, a crucial requirement if we wish to get an unbiased estimate of the p-value. "

The pseudocode of my implementation is as follows:

Fit a Weibull distribution to the data in X. (X is a 1 by n vector that contains my data values). For this I use the matlab function fitdist.
Create m synthethic data sets using the distribution from 1. For this I use the matlab function random().
For every m do:

a) Fit a weibull distribution to the synthetic dataset, again via fitdist;

b) Perform the KS-test ,using the kstest() function, and store p values.
Compute the new p-value by:

a) counting all instances where p value of synthethic dataset m > p value initial dataset;

b) divide the count of a by m to get the adjusted (non-biased or less biased?) p-value.

It's mainly part 4 (corresponding to the bold part in the quote) that i'm not sure is correct. Also i'm not sure if the random() function creates a synthetic dataset as it "draws random numbers from the specified distribution". My guess was that when you ask random() to create a sufficiently large enough dataset it will automatically have the same features as the specified distribution thus qualifying for a synthetic dataset?

Any help would be greatly appreciated. Small disclaimer: I'm an industrial engineering major so less-technical explanations would be appreciated;)

Finally there is an implementation of this code in R (which i'm not familiar with) for the power law distribution, see this link. Maybe it might help you guys in answering my question, maybe it helps someone for future reference.

gotSchwifty · Accepted Answer · 2017-08-10T11:04:33.770

0

In the following I will refer to Clauset, A., Shalizi, C. R., & Newman, M. E. (2009) as "the paper".

Let's assume we're using the following Weibull pdf for $x \geq 0$: $$ f\left(x,\lambda,\beta\right) = \frac{\beta}{\lambda} \left(\frac{x}{\lambda} \right)^{\beta-1} e^{-\left(x/\lambda\right)^{\beta}} $$ This is essentially the same as the stretched-exponential distribution mentioned in the paper but with $x_{min}=0$ (see the paper Tab. 2.1).

As you correctly wrote, first we fit the Weibull distribution to the empirical set of data. I am not familiar with matlab's fitdist (the authors of the paper suggest maximum likelihood estimation method, which is not too complicated, discussed for the power law in the paper appendix B. I guess it would not be too difficult to transfer it to the Weibull distribution). But let's assume matlab found an acceptable fit with the parameters $\lambda_{fit,empiric}$ and $\beta_{fit,empiric}$. We also calculated the KS statistics obtaining the maximum distance $D_{fit,empiric}$.
Next, we need randomly generated data, as you correctly wrote, using the distribution from 1. However, I suppose that the matlab function does not generate Weibull-distributed random numbers (probably rather uniformly distributed). In the appendix D of the paper the authors describe how to generate synthetic data from a specific distribution. In the case of the Weibull distribution we can use (see the comments below): $$ x_{synthetic} =\lambda \left[- ln \left(1-r\right)\right]^{\frac{1}{\beta}},$$ where $r$ is drawn from uniform distribution, e.g. $r$=rand in matlab. And we generate as many $x_{synthetic}$ as there are data points in the empirical data set.

We repeat 2., say, m=1000 times so we have generated m synthetic data sets.

Yes, fit again a Weibull distribution to each of the m generated synthetic data sets, obtaining their own $\lambda_{fit,synthetic,n}$ and $\beta_{fit,synthetic,n}$ for the $n$-th synthetic data set. We also calculate each time the KS statistic yielding $D_{fit,synthetic,n}$, respectively.
Finally, we compute the $p$-value like this: $$p = \frac{1}{m} \sum_{n=1}^{m} \delta_{n},$$ where $\delta_{n} = 1$ if $D_{fit,synthetic,n} > D_{fit,empiric}$ and $\delta_{n} = 0$ otherwise.

The point of this procedure is to find out how "lucky" we were when we found a "good" fit to the empiric data in 1.. With other words, to what extent is success of the fit from 1. attributed to random fluctuations in the empirical data. If $p$ is close to 1 then the random fluctuations played an unsignificant role in finding a good fit, that is we were not just "lucky" but the fit is actually good.

For a more thorough analysis it might be useful to fit your data to other distributions and compare those fits with the Weibull results.

In case you're familiar with python, there is a neat python package where you can do all that was discussed above: https://pypi.python.org/pypi/powerlaw Its focus is on the power law but with some slight modifications it could also be used for stretched exponential.

edited Aug 10 '17 at 11:04

answered Aug 09 '17 at 10:53

gotSchwifty

26
2

Thank you very much, the matlab function fitdist uses MLE so that should be fine. Only mistake I made is with the synthethic dataset, fixed that. I'm not familiar with phython though, will look at the code to see if I can learn something from it. Thanks a bunch for the effort! – Mr. N Aug 09 '17 at 13:59
Could you elaborate on how you derived the formula for the Xsynthetic? I've looked into the appendix but not really found an answer. You know this from experience, or derived it from some formula? I need to substantiate such modeling decisions in my thesis that's why ;) – Mr. N Aug 09 '17 at 14:09
Also, why do you use this weibull pdf instead of (the common?): $$ f(x,\lambda,\beta)=\frac{\beta}{\lambda}(\frac{x}{\lambda})^{\beta-1}e^{(-x/\lambda)^\beta} $$ – Mr. N Aug 10 '17 at 08:14
Thank you for your comment, there was indeed a small mistake in the equation for x_{synthetic} which I now corrected. A. Clauset et al. refer in the Appendix D in their publication to W.H.Press et al. "Numerical Recipes in C" which was kindly made open access by the university of Trieste: https://www2.units.it/ipl/students_area/imm2/files/Numerical_Recipes.pdf You will find a step by step discussion on page 287ff with an example for the exponential function. – gotSchwifty Aug 10 '17 at 08:21
Essentially, the authors show that P(x) = 1-r, where x is the random number that you need, r is the uniformly distributed random number and P(x) is the complementary cumulative distribution function of the target distribution. So all you have to do is to invert P(x) and calculate this inverse for (1-r). In your case P(x) is equal to the stretched exponential function and its inverse is given in https://en.wikipedia.org/wiki/Weibull_distribution#Cumulative_distribution_function – gotSchwifty Aug 10 '17 at 08:21
The pdf I used has a form which is more similar to the stretched exponential given in the paper by A. Clauset et al., so I thought it would be easier to understand what I'm talking about :) – gotSchwifty Aug 10 '17 at 08:24
see also: https://en.wikipedia.org/wiki/Inverse_transform_sampling#The_method – gotSchwifty Aug 10 '17 at 08:28
So I've checked (and derived myself) the 2 examples from you wiki "the method" link. If I apply the same method to the Weibull as described in my comment I get the following, which is the same as the inverse CDF on wikipedia, but not exactly equal to your expression but I guess that's because you use the other form for the Weibull pdf: $f(x,\lambda,\beta)=\frac{\beta}{\lambda}(\frac{x}{\lambda})^{\beta-1}e^{(-x/\lambda)^\beta}$ for $x \geq 0$ and $F(x, \lambda, \beta) = 1-e^{(-x/\lambda)^k}$ Then I need to solve: – Mr. N Aug 10 '17 at 09:26
(cont') $1-e^{(-F^{-1}/\lambda)^\beta} = u$ Where $F^{-1}$ represents the Weibull inverse cdf. Thus: $1-e^{-(F^{-1}/\lambda)^\beta} = u \rightarrow$ $e^{-(F^{-1}/\lambda)^\beta} = 1 - u \rightarrow$ $ln(e^{-(F^{-1}/\lambda)^\beta}) = ln(1 - u) \rightarrow$ $-(F^{-1}/\lambda)^\beta = ln(1 - u) \rightarrow$ $F/\lambda = \sqrt[\beta]{-ln(1-u)} \rightarrow$ $F = \lambda \sqrt[\beta]{-ln(1-u)}$ So in essence, $x_{synt}$ is the inverse cdf with u substituted for the random number $r$ from matlab. – Mr. N Aug 10 '17 at 09:28
sorry for the typo above, F should ofcourse be $F^{-1}$... – Mr. N Aug 10 '17 at 09:39
yep, you got it exactly right! I edited my answer to the more common function so it is less confusing – gotSchwifty Aug 10 '17 at 11:01

check implementation ks test for distribution with estimators fitted from same data

1 Answers1