I read the following procedure to perform KS test with estimated parameters
Testing whether data follows T-Distribution
At first I couldn't make head or tails of it, partly because I mistaked the language for R, while I guess it's actually MATLAB (thanks to @Glen_b for noticing my error). Now I understand more of it, but there's still a step that baffles me. According to the definition of p-value (the probability of getting a statistic as extreme or more as the sample one, under the null hypothesis), I thought this would work:
- choose a family of distributions $\{f(x|\theta_1,\ldots,\theta_n)\}$ indexed by parameters $\theta_1,\ldots,\theta_n$
- from my original sample $S$ of size $N$ ($N$ and $n$ are not related, apart from the obvious condition $N>n$), estimate parameters $\hat{\theta_1},\ldots,\hat{\theta_n}$ using method of moments, or maximum likelihood,etc.. The corresponding distribution from the family, $\hat{f}(x)=f(x|\hat{\theta_1},\ldots,\hat{\theta_n})$, is my null distribution.
- generate $M$ random samples from $\hat{f}$. For each sample compute KS distance $K_i$ of random sample $i$ from $\hat{f}$
- sort $K_i$ and tabulate empirical CDF at various $x$ as $G(x)=\frac{M_i}{M}$ where $M_i$ is the number of samples $K_i<x$
- from tabulated $G(x)$, interpolate or use splines to compute probability of a KS distance as large as or larger than that between the original sample $S$ and $\hat{f}$
However, it seems to me that the MATLAB code in the above link does something different, in the bootstrap bit:
% get KS-test critical values by parametric bootstrapping from estimated
m=999;
r=random(null_pd,n,m);
stats = zeros(m,1); % store test statistics
est_pd = makedist('tlocationscale');
opts = statset(statset('tlsfit'),'MaxIter',1000);
opts = statset(opts,'MaxFun',2000);
for i=1:m
bsample = r(:,i);
[~,~,stats(i)] = kstest(bsample,'CDF',est_pd.fit(bsample,'options',opts));
end
I am not that familiar with MATLAB for statistical analysis, but my understanding is that for each sample from the null distribution, the code is re-estimating the parameters of the distribution...why is that? Given the definition of p-value, what's wrong with my approach?