Why do two implementations of the Anderson-Darling test produce such different p-values?

Question

I want to check the normality assumption of my series

I am trying to figure out the way is computed the p-value of the Anderson Darling in R packages, goftest and nortest. According to the R help, once is computed from Table 4.9 in Stephens (1986) and the other with the Marsaglia routine. But I do not find out why are they so different, which is more accurate and when should I use them.

library(goftest)
library(nortest)

x = iris$Sepal.Length

goftest::ad.test(x, "pnorm", mean = mean(x), sd = sd(x))
nortest::ad.test(x)

The result of goftest package:

    Anderson-Darling test of goodness-of-fit
    Null hypothesis: Normal distribution
    with parameters mean = 5.84333333333333, sd = 0.828066127977863

data:  x
An = 0.8892, p-value = 0.4208

and the nortest package:

    Anderson-Darling normality test

data:  x
A = 0.8892, p-value = 0.02251

Without consulting the manual pages, it strikes me that your use of `goftest` is likely incorrect. I would expect it to base its p-value on a *prespecified* normal distribution. By offering it a normal distribution whose parameters have been estimated from the data, you rig the test strongly in favor of finding a "good fit," which can explain the large p-value. Might I therefore suggest reading the documentation for both tests closely? — whuber, Feb 14 '19 at 22:06
Thank you for your reply, but according with the documentation `nortest::nortest::ad.test()` only has one parameter https://www.rdocumentation.org/packages/nortest/versions/1.0-4/topics/ad.test — dangulo, Feb 14 '19 at 22:32
Right: that means that internally it performs the parameter estimation and therefore knows how to cope with the fact that the parameters were estimated from the data. My suspicion is that `goftest::ad.test` does not know that, which is why it gets the wrong p-value. That also explains why it expects you to supply the parameters, for otherwise it has no way of knowing precisely which Normal distribution you specified beforehand. — whuber, Feb 14 '19 at 22:57
Thank you, but I guess the key point is the way they are obtaining the statistic distribution, `nortest::ad.test` is using some kind of approximation with the tabl, whereas `goftest::ad.test` is computing the exact value. — dangulo, Feb 14 '19 at 23:12
I don't think that's the case: an approximation would not be so grossly bad. — whuber, Feb 14 '19 at 23:55
Thank you, the thing is that apparently I did the same, as you can see the statistic value is the same, but in one case the test is accepted and the other reject the null hypothesis. I can not figured out which one should I trust. — dangulo, Feb 15 '19 at 06:19

whuber · Accepted Answer · 2019-02-15T19:25:41.290

The Anderson-Darling test is a good test--but it has to be correctly applied and, as in most distributional tests, there is a subtle pitfall. Many analysts have fallen for it.

Distributional tests are Cinderella tests. In the fairy tale, the prince's men searched for a mysterious princess by comparing the feet of young ladies to the glass slipper Cinderella had left behind as she fled the ball. Any young lady whose foot fit the slipper would be a "significant" match to the slipper. This worked because the slipper had a specific shape, length, and width and few feet in the realm would likely fit it.

Imagine a twist on this fairy tale in which Cinderella's ugly stepsisters got wind of the search. Before the prince's men arrived, the stepsisters visited Ye Olde Paylesse Glasse Slipper Store and chose a slipper off the rack to match the width and length of the eldest sister's foot. Because glass slippers are not malleable and this one was not specifically tailored to the foot, it was not perfect--but at least the shoe fit.

Returning to their home just as the prince's men arrived, the older sister distracted the men while the younger surreptitiously substituted the slipper she had just purchased for the one brought by the men. When the men offered this slipper to the older sister, they were astonished that it fit her warty foot. Nevertheless, because she had passed the test she was promptly brought before the prince.

I hope the prince, like you, was sufficiently sharp-eyed and clever to recognize that the ugly lady brought before him could not possibly be the same beautiful girl he had danced with at the ball. He wondered, though, how the slipper could possibly fit her foot.

That is precisely what this question asks: how can the goftest version of the A-D test seem like the dataset is such a good match to the Normal distribution "slipper" while the nortest version indicates it is not? The answer is that the goftest version was offered a distribution constructed to match the data while the nortest version anticipates that subterfuge and compensates accordingly, as did the prince.

The goftest version of the Anderson-Darling test asks you to provide the glass slipper in the form of "distributional parameters," which are perfect analogs of length and width. This version compares your distribution to the data and reports how well it fits. As applied in the question, though, the distributional parameters were tailored to the data: they were estimated from them. In exactly the same way the ugly stepsister's foot turned out to be a good match to the slipper, so will these data appear to be a good match to a Normal distribution, even when they are (to any sufficiently sharp-eyed prince) obviously non-Normal.

The nortest version assumes you are going to cheat in this way and it automatically compensates for that. It reports the same degree of fit--that's why the two A-D statistics of 0.8892 shown in the question are the same--but it assesses its goodness in a manner that compensates for the cheating.

It is possible to adjust the goftest version to produce a correct p-value. One way is through simulation: by repeatedly offering samples from a Normal distribution to the test, we can discover the distribution of its test statistic (or, equivalently, the distribution of the p-values it assigns to that statistic). Only p-values that are extremely low relative to this null distribution should count as evidence that the shoe fits.

Here is a histogram of 100,000 iterations of this simulation.

P-values are supposed to have a uniform distribution (when the null hypothesis is true). The histogram would be almost level. This distribution is decidedly non-uniform.

The vertical line is at $0.42,$ the p-value reported for the original data. The values smaller than it are shown in red as "extreme." Only $0.0223$ of them are this extreme. (The first three decimal places of this number should be correct.) This is the correct p-value for this test. It is in close agreement with the value of $0.0251$ reported by nortest. (In this case, because I performed such a large simulation, the simulated p-value should be considered more accurate than the computed p-value.)

The R code that produced the simulation and the figure is reproduced below, but with the simulation size set to 1,000 rather than 100,000 so that it will execute in less than one second.

library(ggplot2)
x = iris$Sepal.Length
p.value <- goftest::ad.test(x, "pnorm", mean = mean(x), sd = sd(x))$p.value
#
# Simulation study.
#
mu <- mean(x)
sigma <- sd(x)
sim <- replicate(1e3, {
  x.sim <- rnorm(length(x), mu, sigma)
  goftest::ad.test(x.sim, "pnorm", mean = mean(x.sim), sd = sd(x.sim))$p.value
})

p.value.corrected <- mean(sim <= p.value)
message("Corrected p-value is ", p.value.corrected, 
        " +/- ", signif(2 * sd(sim <= p.value) / sqrt(length(sim)), 2))
#
# Display the p-value distribution.
#
X <- data.frame(p.value=sim, Status=ifelse(sim <= p.value, "Extreme", "Not extreme"))
ggplot(X, aes(p.value)) + 
  xlab("goftest::ad.test p-value") + ylab("Count") + 
  geom_histogram(aes(fill=Status), binwidth=0.05, boundary=0, alpha=0.65) + 
  geom_vline(xintercept=p.value, color="#404040", size=1) + 
  ggtitle("A-D 'p-values' for Simulated Normal Data",
          "`goftest` calculations")

(+1) Some readers may wonder whether it makes a difference simulating from other distributions consonant with the null hypothesis, rather than that particular one. It doesn't, for EDF-based GoF test statistics in general; provided that the estimated parameters are those of a location-scale family. (See https://stats.stackexchange.com/q/110272/17230 & https://stats.stackexchange.com/q/237779/17230). — Scortchi - Reinstate Monica, Feb 15 '19 at 20:31

Why do two implementations of the Anderson-Darling test produce such different p-values?

1 Answers1

Linked