Simulating p-value statistics for Liliefors test (Python statsmodels)

Question

Python statsmodels has an implementation of Lilliefors' test for goodness of fit (i.e. if the parameters of the distribution were obtained from fitting the data and not per-determined as in the original Kolmogorov-Smirnow test). The implementation is based on tabulated p-value distributions from 10 mio simulations for different sample sizes and covers normal and exponential distributions. In Wilks' book "Statistical Methods in the Atmospheric Sciences" some tabulated values are provided for gamma distributions with different shape parameters.

I would like to re-create and extend these tables using the lilliefors_critical_value_simulation.pycode, which is provided in the statsmodels package. In principle, this is rather straightforward, but there is one step in the code, which is unclear to me, and this concerns z-normalisation of the random samples, which are drawn in each experiment. For the normal and exponential distributions, the code reads:

if sim_type == 'normal':
    std = sample.std(0, ddof=1)
    z = (sample - mu) / std
    cdf_fn = stats.norm.cdf
elif sim_type == 'exponential':
    z = sample / mu
    cdf_fn = stats.expon.cdf

Indeed, scaling is pretty straightforward. However, for gamma distributions, a z-scale is not well defined (I also came across the term probability integral transform, but also this was limited to normal distributions). Upon research I found that people recommend to use quantiles instead of the standard deviation. I am now wondering, how critical the z-scaling is for the derivation of the p-value statistics, and if a pseudo z-scale (x - mu)/(75%-ile - 25%-ile) can be safely used in the context of simulating the p-value distribution.

Further information about the implementation details: The original source code of lilliefors_critical_value_simulation.py can be found at https://github.com/statsmodels/statsmodels/blob/master/statsmodels/stats/tests/results/lilliefors_critical_value_simulation.py. The relevant part begins in line 57, where the CDF of the z-normalized sample is calculated. Then, plus and minus are constructed as [1/N, ..., N/N] and [0, (N-1)/N], respectively and the differences plus-cdf and cdf-minus are evaluated. The result for this specific sample is then calculated as maximum absolute difference (concatenating d_plus and d_minus via the somewhat obstruse _csyntax; see https://numpy.org/doc/stable/reference/generated/numpy.c_.html). These results are collected for all sample sizes to be tabulated. Lateron (line 83 ff) the critical values are calculated by evaluating the precentiles of the results.

(Or has someone computed extensive tables for gamma distributions already and made them available somewhere?)

References:

How can one compute Lilliefors' test for arbitrary distributions?

https://www.elsevier.com/books/statistical-methods-in-the-atmospheric-sciences/wilks/978-0-12-385022-5

The probability integral transform applies to *all* continuous distributions, not just Normal ones. The role of standardization in your code is unclear, in part because you haven't described what the code actually does or what the tables represent. For instance, the p-value depends on which parameter(s) are known and which are estimated. — whuber, Jun 28 '20 at 14:03
Thanks. I edited the question and added further information about the algorithm and code. — maschu, Jun 28 '20 at 16:55

Simulating p-value statistics for Liliefors test (Python statsmodels)

0 Answers0