6

I have a number of samples that I would like to test for normality. One of the samples exceeds 5,000 data points, the limit up to which the shapiro test accepts samples. This is the data:

c1 <- exp(rnorm(505))
c2 <- exp(rnorm(550))
c3 <- exp(rnorm(5500))

cluster.data <- c(c1, c2, c3)
cluster.factors <- c(rep("Cluster_1", length(c1)), 
                     rep("Cluster_2", length(c2)),
                     rep("Cluster_3", length(c3)))

# set up data for test:
cluster.df <- data.frame(cluster.data, cluster.factors)

To circumvent the 5,000 restriction, would it be statistically acceptable if I run the test on smallish subsamples of the data only? Here, for example, I draw a subsample of size 500 for all three variables:

tapply(cluster.df[,1], cluster.df[,2], function(x) shapiro.test(sample(x, 500)))

And the test returns sigificant results for all three:

$Cluster_1

    Shapiro-Wilk normality test

data:  sample(x, 500)
W = 0.59561, p-value < 2.2e-16


$Cluster_2

    Shapiro-Wilk normality test

data:  sample(x, 500)
W = 0.57891, p-value < 2.2e-16


$Cluster_3

    Shapiro-Wilk normality test

data:  sample(x, 500)
W = 0.67686, p-value < 2.2e-16

4 Answers4

14

I have comments on five levels.

  1. On this evidence this is a limitation of a particular R function shapiro.test() and need not imply that that there aren't other ways to do it in R, on which I can't advise specifically. It may or may not be of practical relevance to you that no such limit applies to all software. For example, the Stata command swilk isn't limited in quite this way, but the manuals and the command output warn that P-value calculation can't be trusted much for sample sizes above about 5000. (EDIT: this paragraph edited 26 January 2021 in the light of @Ben Bolker's comment and separate answer.)

  2. I can't comment on why that particular function won't perform, but the larger question is why you are doing this kind of testing at all. A good reason not to care is generic: for sample sizes of that order, or even larger, such tests are arguably fairly useless as even minute deviations from normality will qualify as significant at conventional levels. More specifically: why is it important or interesting to test for normality? People often apply such tests to marginal distributions given a widespread myth that marginal normality is a requirement for very many procedures. Where normality is a relevant assumption, or ideal condition, it usually applies to distributions conditional on a structure of mean outcomes or responses.

  3. In response to your specific query of whether subsampling is acceptable, the serious reply in return is acceptable in what sense? A personal reply: as a reader, author and reviewer of statistical papers, and as a statistical journal editor, my reaction would be to suggest that such subsampling is at best awkward and at worst an avoidance of the main issue, which would be to find an implementation without such a limit, or more likely to think about the distribution in different terms.

  4. As often emphasised on CV, and elsewhere, the most helpful and informative way to check departure from normality is a normal quantile plot, often also called a normal probability plot, a normal scores plot, or a probit plot. Such a plot not only provides a visual assessment of degree of non-normality, it makes precise in what sense there are departures from the ideal shape. The lack of an associated P-value is not in practice much of a loss, although the procedure may be given some inferential impetus through confidence levels, simulations and so forth. (EDIT 26 January 2021: yet other terms are Gaussian percentile plot and Gaussian probability plot.)

  5. Specifically, your examples consist of generating lognormal samples and then establishing that indeed they fail to qualify as normal with P-values $\ll 10^{-15}$. That has to seem puzzling, but be reassured that with larger samples your P-values will be, or should be, even more minute, subject to a machine level question of the minimum reportable P-value here. Conversely, it may well be that your real problem lies elsewhere and these examples are no more than incidental illlustrations.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
  • 3
    +1. But maybe it would also help to point out the obvious: if one takes a single random subsample, tests it, and rejects the null, then the null can be rejected for the entire sample. Given the expectation of rejecting the null on a suitably large sample, this could be an effective strategy--FWIW. A second obvious point to make explicitly would be to ask what's the point of conducting a distributional test in the first place. – whuber Jan 24 '20 at 14:23
  • 1
    @whuber Thanks for comments. I have tinkered with my answer to reflect your "second obvious point". – Nick Cox Jan 24 '20 at 15:54
  • (One year later!) I disagree with point 1 here (see my answer below), although the rest is fine. – Ben Bolker Jan 26 '21 at 00:52
  • @Ben Bolker Thanks for your helpful answer. – Nick Cox Jan 26 '21 at 08:19
8

A small historical note/correction: contrary to what is said in (or might be inferred from) other answers here and elsewhere, the limitation of R's Shapiro-Wilk test to <=5000 observations is not:

  • an accidental limitation in R's implementation
  • an intentional limitation (as possibly suggested here) imposed to protect users from performing questionable tests

The limitation occurs because R refuses to provide a $p$-value for a range where the original function was not validated. In contrast, the original implementation by Royston (1995), and Stata's swilk function, do provide the p-value but give an error code/warning saying that the $p$-values may not be reliable.

The $p$-values corresponding to a given $W$ statistic are hard to compute: there is a whole series of papers in the literature (see refs below) using sophisticated mathematical techniques to come up with approximations that are computationally efficient and sufficiently accurate over a given range of $n$ to provide reliable estimates of the p-values of the Shapiro-Wilk statistic. Royston (1995) says:

All calculations are carried out for samples larger than 5000, but IFAULT is returned as 2. Although $W$ will be calculated correctly, the accuracy of its $P$-value cannot be guaranteed.

In other words, this is outside the range for which Royston and other authors have painstakingly constructed efficient functions that give good approximations to the $p$-value corresponding to a given value of $W$.

I suspect that the implementations of Shapiro-Wilk $p$-values in modern statistics packages are all based on the Fortran code described in Royston (1995). If you wanted to compute reliable Shapiro-Wilk $p$-values for samples with $n>5000$ (ignoring all the advice given here and elsewhere about why Normality testing on very large data sets is usually just silly), you would have to go back to the papers by Royston 1992 and Verrill and Johnson 1988 and re-do/extend those methods to larger values of $n$ — not a project for the faint-hearted.


  • Royston, Patrick. “Approximating the Shapiro-Wilk W-Test for Non-Normality.” Statistics and Computing 2, no. 3 (September 1, 1992): 117–19. https://doi.org/10.1007/BF01891203.
  • ———. “Remark AS R94: A Remark on Algorithm AS 181: The W-Test for Normality.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 44, no. 4 (1995): 547–51. https://doi.org/10.2307/2986146.
  • Verrill, Steve, and Richard A. Johnson. “Tables and Large-Sample Distribution Theory for Censored-Data Correlation Statistics for Testing Normality.” Journal of the American Statistical Association 83, no. 404 (1988): 1192–97. https://doi.org/10.2307/2290156.
Ben Bolker
  • 34,308
  • 2
  • 93
  • 126
  • 1
    You cited the manuals for Stata 13. With a Google search that comes up first hit, which I think is why it comes up first hit The up-to-date version is always https://www.stata.com/manuals/rswilk.pdf Otherwise good point, but (1) Stata 16.1 won't refuse to calculate the `swilk` test for $n > 5000$, but it does warn that the P-values can't be trusted (2) if a given deviation from normality is significant at conventional levels for lower sample sizes, it is also significant at such levels for larger sample sizes. Cue usual comments about significance tests and P-values. . – Nick Cox Jan 26 '21 at 08:13
  • 1
    From experience the main point is that at sample sizes around 5000, the test is over-sensitive and therefore fairly useless. It's indeed hard to get a P-value for even larger sample sizes but I don't think that undermines or contradicts any of the other serious reservations about this test. If anyone wants to take the line that unreliable P-values mean that you should not do the test with larger sample sizes. that seems to be a consistent if purist view. – Nick Cox Jan 26 '21 at 09:35
  • 1
    Specifically, I did not and would not ever imply "laziness or poor programming" in this kind of case. At most, my guess would be proper diffidence about approximations being used where they are not known to work adequately. – Nick Cox Jan 26 '21 at 09:46
  • thanks @NickCox, I will update accordingly. – Ben Bolker Jan 26 '21 at 14:34
  • Your comments about Stata 16.1 are useful (don't seem to be explicitly stated in the PDF you linked - I searched for "5000"). I'm not really surprised that R-core has decided to err on the side of "purist" (refusing to report p-values that might be inaccurate, even with a warning) – Ben Bolker Jan 26 '21 at 15:03
  • 1
    I ran the command on a dataset to see what happens. Stata users can see the code using `viewsource swilk.ado`. – Nick Cox Jan 26 '21 at 16:37
1

I think Nick Cox points out some of the difficulties with the approach.

A possible alternate recommendation would be to use another normality test. In classes I took we used a test based on skewness and kurtosis due to D'Agostino for larger samples. I implemented these tests in my lolcat statistical package. Consider:

#Install/load step
require(devtools)
install_github("burrm/lolcat")
require(lolcat)

set.seed(1)

#Normal distribution - no rejection
zz <- rnorm(5500)
skewness.test(zz)
kurtosis.test(zz)

# Log normal distribution - rejection on both skewness and kurtosis
zz1 <- exp(zz1)
skewness.test(zz1)
kurtosis.test(zz1)

Interestingly enough, even with a sample size of 5500, skewness/kurtosis would likely not reject with these tests. A log normal distribution would most likely reject, even at substantially lower sample sizes. As an example:

> set.seed(1)
> 
> #Normal distribution - no rejection
> zz <- rnorm(5500)
> skewness.test(zz)

    D'Agostino Skewness Normality Test

data:  input data
skewness = -0.035209, null hypothesis skewness = 0, p-value = 0.286
alternative hypothesis: true skewness is not equal to 0
95 percent confidence interval:
 -0.09992690  0.02950877
sample estimates:
   skewness           z      se.est     root.b1 
-0.03520907 -1.06683621  0.03301991 -0.03519946 

> kurtosis.test(zz)

    D'Agostino Kurtosis Normality Test

data:  input data
kurtosis = -0.052102, null hypothesis kurtosis = 0, p-value = 0.4362
alternative hypothesis: true kurtosis is not equal to 0
95 percent confidence interval:
 -0.18151406  0.07731029
sample estimates:
   kurtosis           z      se.est          b2 
-0.05210189 -0.77868046  0.06602783  2.94685476 

> 
> # Log normal distribution - rejection on both skewness and kurtosis
> zz1 <- exp(zz1)
> skewness.test(zz1)

    D'Agostino Skewness Normality Test

data:  input data
skewness = 5.2214, null hypothesis skewness = 0, p-value < 2.2e-16
alternative hypothesis: true skewness is not equal to 0
95 percent confidence interval:
 5.156675 5.286111
sample estimates:
   skewness           z      se.est     root.b1 
 5.22139319 63.31231869  0.03301991  5.21996907 

> kurtosis.test(zz1)

    D'Agostino Kurtosis Normality Test

data:  input data
kurtosis = 61.259, null hypothesis kurtosis = 0, p-value < 2.2e-16
alternative hypothesis: true kurtosis is not equal to 0
95 percent confidence interval:
 61.13006 61.38888
sample estimates:
   kurtosis           z      se.est          b2 
61.25946799 44.06817706  0.06602783 64.20270103 
Mike Burr
  • 131
  • 4
  • 1
    There are good reasons to regard skewness and kurtosis as (sometimes) helpful measures of distribution shape but a poor basis for testing. That’s why the Shapiro-Wilk test and some others don’t use them. Specifically even if the parent is normal, sample skewness and kurtosis approach their asymptotic sampling distributions extraordinarily slowly. I don’t recall whether the D’Agostino test is smart about this, but many skewness and kurtosis tests are not. – Nick Cox Jan 26 '20 at 10:22
0

I apologize for reviving an old thread, but I came across this in a search and I wanted to add my input in case others have the same question. Nick Cox has provided some good input to this question, but I would like to present an answer from a different point of view. In a standardized environment, such as a corporation, it is common to have to meet certain constraints and criteria such as p-values. While I absolutely agree that a graphical assessment such as a normality plot should be completed, these types of analysis are difficult to write policy around. Establishing quantified constraints is important in a regulated environment, and a p-value is a reasonable solution to this.

Sampling a portion of the sample is indeed a reasonable solution. Think of it this way: You have a manufacturing system in which parts are constantly being produced, let's say legos. Millions (billions?) are created, which is the population. You select a barrel to perform your assessment on, the barrel contains, well let's say 100,000 legos. That's far more than you need! You grab a scoop and pull out about 100 legos. See what you've done? You took a sample, the barrel, and then you re-sampled, the scoop. Now in this example I've done a terrible job of randomizing the components, but I think it still makes a good illustration that we all do this sort of thing every day, whether it's with physical components or sampling a list of data.

So to summarize, it's definitely acceptable to sample your sample, but you want to make sure to get enough random samples that you're not only representing the larger sample datapoints, but the population in general. Your example of using 500 datapoints is still a huge sample to analyze.

Porter
  • 143
  • 1
  • 1
  • 8
  • 1
    The trouble ethos is in determining the correct sample size. Unlike, for example, a t-test, where it is fairly straightforward to calculate the required sample size for a given effect size. For a more general test like this, effect size is far from straightforward. In the absence of a power calculation, concluding “$p>0.05$, quality control is good” is dubious. – Dave Feb 14 '21 at 15:59