11

I was looking at this page on Lillefors test's Monte Carlo implementation. I don't understand this sentence:

There is random error in this calculation from the simulation. However, because of the trick of adding 1 to the numerator and denominator in calculating the P-value it can be used straight without regard for the randomness.

What do they mean by the trick of adding 1 to numerator and denominator?

The relevant piece of code is here:

n <- length(x)
nsim <- 4999
d.star <- double(nsim)
for (i in 1:nsim) {
    x.star <- rnorm(n)
    d.star[i] <- fred(x.star)
}
hist(d.star)
abline(v = d.hat, lty = 2)
## simulation-derived P-value
pval <- (sum(d.star > d.hat) + 1) / (nsim + 1)
amoeba
  • 93,463
  • 28
  • 275
  • 317
Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • Can you add the relevant context here? – gung - Reinstate Monica Jan 03 '15 at 20:13
  • 4
    Looks like [Laplace smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) for the Monte Carlo estimator of the probabilities, which shrinks it towards 1/2; main effect is probably to avoid ever getting a p-value of 0, as @Tim noted (though there's no risk of dividing by 0 as he said unless you're doing 0 simulations). I don't really see why this allows you to use it "without regard for the randomness", though. – Danica Jan 03 '15 at 20:44
  • 2
    Have you written Geyer directly to ask what the sentence means? – Alexis Jan 03 '15 at 21:00
  • @Alexis, nope, but it's a good idea. – Aksakal Jan 03 '15 at 21:04
  • @Dougal, yes, it does look like Laplace smoothing. It's not clear why is he applying it here. – Aksakal Jan 03 '15 at 23:13

2 Answers2

6

The explanation on the referenced page is

Under the null hypothesis the probability $\Pr(P \le k / n_\text{sim})$ is exactly $k / n_\text{sim}$ when both the randomness in the data and the randomness in the simulation are taken into account.

To understand this, we must look at the code, of which the key lines (considerably abbreviated) are

fred <- function(x) {ks.test(...)$statistic}  # Apply a statistical test to an array
d.hat <- fred(x)                              # Apply the test to the data
d.star <- apply(matrix(rnorm(n*nsim), n, nsim),
                2, fred)                      # Apply the test to nsim simulated datasets
pval <- (sum(d.star > d.hat) + 1) / (nsim + 1)# Estimate a simulation p-value

The salient problem is that the code does not match the quotation. How can we reconcile them? One attempt begins with the last half of the quotation. We might interpret the procedure as comprising the following steps:

  1. Collect independently and identically distributed data $X_1, X_2, \ldots, X_n$ according to some probability law $G$. Apply a test procedure $t$ (implemented in the code as fred) to produce the number $T_0=t(X_1, \ldots, X_n)$.

  2. Generate via computer $N=n_\text{sim}$ comparable datasets, each of size $n$, according to a null hypothesis with probability law $F$. Apply $t$ to each such dataset to produce $N$ numbers $T_1,T_2,\ldots,T_N$.

  3. Compute $$P = \left(\sum_{i=1}^N I(T_i \gt T_0) + 1\right) / (N+1).$$

    ("$I$" is the indicator function implemented by the vector-valued comparison d.star > d.hat in the code.) The right hand side is understood to be random by virtue of the simultaneous randomness of $T_0$ (the actual test statistic) and the randomness of the $T_i$ (the simulated test statistics).

To say that the data conform to the null hypothesis is to assert that $F=G$. Pick a test size $\alpha$, $0 \lt \alpha \lt 1$. Multiplying both sides by $N+1$ and subtracting $1$ shows that the chance that $P\le \alpha$ for any number $\alpha$ is the chance that no more than $(N+1)\alpha - 1$ of the $T_i$ exceed $T_0$. This says merely that $T_0$ lies within the top $(N+1)\alpha$ of the sorted set of all $N+1$ test statistics. Since (by construction) $T_0$ is independent of all the $T_i$, when $F$ is a continuous distribution this chance will be the fraction of the total represented by the integer part $\lfloor (N+1)\alpha\rfloor$; that is, $$\Pr(P\le \alpha)=\frac{\lfloor(N+1)\alpha\rfloor}{N+1} \approx \alpha$$ and it will be exactly equal to it provided $(N+1)\alpha$ is a whole number $k$; that is, when $\alpha = k/(N+1)$.

This certainly is one of the things we want to be true of any quantity that deserves to be called a "p-value": it should have a uniform distribution on $[0,1]$. Provided $N+1$ is fairly large, so that any $\alpha$ is close to some fraction of the form $k/(N+1) = k/(n_\text{sim}+1)$, this $P$ will have close to a uniform distribution. (To learn about additional conditions required of a p-value, please read the dialog I posted on the subject of p-values.)

Evidently the quotation should use "$n_\text{sim}+1$" instead of "$n_\text{sim}$" wherever it appears.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
5

I believe that here, 1 is added to both because the observed statistic is included in the reference distribution; if this is the case, it's because of the "at least as large" part of the definition of p-value.

I don't know for sure because the text seems to be saying something different, but that would be why I'd do it.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • (+1) The "at least as large" part is a property of a *one-sided* Lilliefors test. It cannot refer to the general hypothesis-testing setup, where "at least as extreme" has to be interpreted in the sense of likelihood ratios rather than test statistics. – whuber Jan 04 '15 at 05:33
  • 1
    @whuber I don't see how I can agree. Not all tests are likelihood ratio tests; when they're not LRTs, what relevance can interpreting it in terms of likelihood ratios have? – Glen_b Jan 04 '15 at 07:10
  • The principal relevance has to do with admissibility. The example I work through at http://stats.stackexchange.com/a/130772 was specifically chosen to illustrate how, at least in some cases, "at least as large" (when interpreted in terms of the magnitude of the test statistic) produces just about the worst possible test. – whuber Jan 04 '15 at 16:17
  • 1
    @whuber It certainly can do. But consider, for example, a Wilcoxon-Mann-Whitney (or indeed, permutation tests more widely). There are any number of perfectly reasonable tests in wide use that are neither a Lilliefors test nor a likelihood ratio test. When there's a clear alternative against which power is desired, it's often possible to construct a meaningful test statistic where the ordering on the sample space given by the test statistic makes perfect sense and has reasonable properties in a wide range of alternatives. – Glen_b Jan 04 '15 at 16:21
  • Thank you for expanding on the idea; it is very helpful. But I think my main point stands, because even with those examples you are not ultimately appealing to the magnitude of any test statistic as being inherently dispositive: you need to justify the use of "at least as large" with some other principle, whether it be expressed in terms of likelihoods, risk functions, or Bayesian posteriors. Thus the meaning of "at least as large" is far subtler than most readers would infer--and it is not part of the *definition* of a p-value, but only an incidental consequence in particular cases. – whuber Jan 04 '15 at 16:31
  • 1
    Certainly when coming up with a test statistic which corresponds to (in the sense of taking more extreme values, whether larger, smaller or both) the kind of alternative one is interested in, one is appealing to "the kind of alternative one is interested in" -- but even if one were to be using an inadmissable (indeed, even a useless test), the principle I outline in my answer of including the observed sample in the simulated results would still apply. Once you have an ordering, even if it's not the best one, when calculating p-values, the observed case still would belong in the count. – Glen_b Jan 04 '15 at 16:48
  • 2
    @whuber we may not be so far apart now. In choosing a reasonable test statistic we would certainly want to appeal to *something*. But once we have a test statistic (as we must have by the time we're simulating under the null), we've already done that. And once we have, the reason why we would include the observed case in our calculation of the p-value is because of what a p-value is. – Glen_b Jan 04 '15 at 16:55
  • 1
    I don't think we have any differences at all. (Note that my own answer makes it clear that including the observed sample in the count is appropriate.) My comment was not directed at your answer to the question (with which I agree and upvoted), but only at the problematic phrase "at least as large." I see that phrase misinterpreted in so many places on this site (and elsewhere) that I wanted to draw readers' attention to what it must *really* mean. – whuber Jan 04 '15 at 17:17