Is it meaningful to test for normality with a very small sample size (e.g., n = 6)?

Question

I have a sample size of 6. In such a case, does it make sense to test for normality using the Kolmogorov-Smirnov test? I used SPSS. I have a very small sample size because it takes time to get each. If it doesn't make sense, how many samples is the lowest number which makes sense to test?

Note: I did some experiment related to the source code. The sample is time spent for coding in a version of software (version A) Actually, I have another sample size of 6 which is time spent for coding in another version of software (version B)

I would like to do hypothesis testing using one-sample t-test to test whether the time spent in the code version A is differ from the time spent in the code version B or not (This is my H1). The precondition of one-sample t-test is that the data to be tested have to be normally distributed. That is why I need to test for normality.

i really like whuber's answer in general (as it applies to statistical tests and small samples). In this case, however, I think the OP should be encouraged to give more details about the context. Without more information, i think Joris Meyer's answer above is justified. — user603, Aug 08 '11 at 14:44
I, for one, have difficulty imagining a context in which n=6 and normality would be an hypothesis worth testing. I fear this is a case of an inexperienced user doing multiple hypothesis testing (run a regression then test for normality of residuals) and that we are addressing the symptoms but ignoring the skeletons in the closet, so to speak. — user603, Aug 08 '11 at 14:45
@user It's unfair to speculate about the questioner. Let's address the question, shall we? So, suppose you plan to compute an upper prediction limit for a value that will be used to make a costly decision. The value of the PL will be sensitive to normality assumptions. You're pretty sure the data generating process is non-normal, but data are expensive and time-consuming to generate. Previous experiments suggest $n=6$ will be sufficiently powerful to reject normality. (I have just described a standard framework for groundwater monitoring programs in the US.) — whuber, Aug 08 '11 at 14:59
User603 (re your first comment): I would like to point out that @Joris has not supplied an answer, nor is his comment accompanied with any justification whatsoever. If an emphatic "no" is a valid general answer to this question, let's see it written down as such, with a supporting argument, so it can be evaluated up and down by the community. — whuber, Aug 08 '11 at 15:03
Somehow related: [Normality Testing: 'Essentially Useless?'](http://stats.stackexchange.com/questions/2492/normality-testing-essentially-useless) — nico, Aug 08 '11 at 16:04
@nico Good reference, thanks. The accepted answer (by Joris Meys, BTW) has clear examples of the situation for *large* $n$, where the difficulties become reversed: normality tests become *too* powerful. — whuber, Aug 08 '11 at 16:11
This is off topic b/c OP asks about t-tests and normality. But, is it worth discussing some non-parametric alternatives that don't assume normality? And (just a shot in the dark), is there any chance that the same subjects are giving you both an A sample and a B sample? If so then some other tests that leverage this could have more power in the scientific question you're interested in. whuber and @Joris *wonderful* answers! Noobs like me can look to these as examples of how to try to be useful on this site. — ImAlsoGreg, Aug 08 '11 at 23:03
@whuber: thanks for the example. I was wondering, is a rejection of normality, in the case you present, an end in itself? i.e. isn't an H_1 assumed a priori in this case? The following example: suppose in one sample you reject normality because 2 observations are located very far to the right of the others, showing evidence of a right skew in toxin concentrations, whereas in another example the test reject normality because all the observations are pilled near one another ('platykurtic') — user603, Aug 09 '11 at 09:17
My question then is: in the first case, one would presumably concludes that there is already enough evidence of more risks than normality and therefore no need to carry further sampling. But how about the second case? In that hypothetical, would you interpret the second case as commanding the same course of action as well (no more samples are needed)? If not, aren't you then implicitly using the prior information that, say, the sample is either normal or right skewed when carring the normality test in your example? — user603, Aug 09 '11 at 09:18
What my question tries to clarify is this: a test for normality when one has prior information on what the likely alternative is indeed makes sense even when n is as small as 6 (and part of your answer is about pointing this out). But the need for that information (about the assumed likely alternative) is precisely the reason i would have preferred the OP to give more context to his question, for as Joris Meys's answer indicates, the safe course of action (when n=6) very much depends on how one weights the respective importance of T1&T2 errors. — user603, Aug 09 '11 at 09:23
@user Excellent points in your last three comments. Answering the questions about risk assessment would take too long and take us too far afield; I'll have to be content to point out that testing distributions is not an objective, but is required (by law!) as part of assessing the performance characteristics of the decision procedure, whether it be determining that contamination has been released into the environment or simply computing a UCL of a mean for further analysis. Normality testing is conducted as an ongoing way to identify potential failures of a predetermined decision procedure. — whuber, Aug 09 '11 at 13:52
Is the post's title (vs. text) the issue? @whuber answers that you can test w/ controlled size even with $n=6$. Joris answers (essentially) that you shouldn't do model selection this way. I think they both gave great, correct answers to different questions (title vs. text). I'd also suggest [Good's textbook](http://vk.cs.umn.edu/mikes/books/good/book.pdf) for small-$n$ tests (permutation, bootstrap), if you are not bound by government regulations (which have made me want to retreat to the theoretical world of statistics for the rest of today). — David M Kaplan, Aug 09 '11 at 16:29
@David I'm having trouble interpreting either the title or the text as primarily concerned with "model selection." The text asks about finding an appropriate decision procedure. I'll grant that one could conceive of this process as an iterative one of model selection and procedure selection, but it doesn't have to be (and in formal applications, such as drug trials, regulatory compliance, etc. it cannot involve model selection at all). Of course if the question were re-interpreted as "is n=6 enough for model selection," the answer often (but not always!) is "no." — whuber, Aug 09 '11 at 16:38

whuber · Accepted Answer · 2011-08-08T13:25:38.297

Yes.

All hypothesis tests have two salient properties: their size (or "significance level"), a number which is directly related to confidence and expected false positive rates, and their power, which expresses the chance of false negatives. When sample sizes are small and you continue to insist on a small size (high confidence), the power gets worse. This means that small-sample tests usually cannot detect small or moderate differences. But they are still meaningful.

The K-S test assesses whether the sample appears to have come from a Normal distribution. A sample of six values will have to look highly non-normal indeed to fail this test. But if it does, you can interpret this rejection of the null exactly as you would interpret it with higher sample sizes. On the other hand, if the test fails to reject the null hypothesis, that tells you little, due to the high false negative rate. In particular, it would be relatively risky to act as if the underlying distribution were Normal.

One more thing to watch out for here: some software uses approximations to compute p-values from the test statistics. Often these approximations work well for large sample sizes but act poorly for very small sample sizes. When this is the case, you cannot trust that the p-value has been correctly computed, which means you cannot be sure that the desired test size has been attained. For details, consult your software documentation.

Some advice: The KS test is substantially less powerful for testing normality than other tests specifically constructed for this purpose. The best of them is probably the Shapiro-Wilk test, but others commonly used and almost as powerful are the Shapiro-Francia and Anderson-Darling.

This plot displays the distribution of the Kolmogorov-Smirnov test statistic in 10,000 samples of six normally-distributed variates:

Histogram of KS statistic

Based on 100,000 additional samples, the upper 95th percentile (which estimates the critical value for this statistic for a test of size $\alpha=5\%$) is 0.520. An example of a sample that passes this test is the dataset

0.000, 0.001, 0.002, 1.000, 1.001, 1000000

The test statistic is 0.5 (which is less than the critical value). Such a sample would be rejected using the other tests of normality.

I think any distribution that gives a sig. result with N = 6 will be so non normal that it will pass the IOTT with flying colors - that's the interocular trauma test. It hits you between the eyes. — Peter Flom, Aug 08 '11 at 13:24
@Peter If you were to rephrase this comment, it would be correct. After all, many $N=6$ samples from a normal distribution will look perfectly normal, so clearly "any" is too strong a quantifier. What you meant to say is that there's a good chance that a random sample with $N=6$ will be clearly non-normal when plotted in a reasonable way (*e.g.*, probability plot) but will not be rejected by this test. — whuber, Aug 08 '11 at 13:27
@Peter Good! A KS test for normality has rejected a uniform sample. That's what one hopes. — whuber, Aug 08 '11 at 14:12
`set.seed(140);x=rnorm(6);ks.test(x,pnorm)` produces `p-value = 0.0003255`. Of course I had to try it with 140 seeds before I found this... — Spacedman, Aug 09 '11 at 15:22

Joris Meys · Answer 2 · 2011-08-08T20:04:13.590

22

As @whuber asked in the comments, a validation for my categorical NO. edit : with the shapiro test, as the one-sample ks test is in fact wrongly used. Whuber is correct: For correct use of the Kolmogorov-Smirnov test, you have to specify the distributional parameters and not extract them from the data. This is however what is done in statistical packages like SPSS for a one-sample KS-test.

You try to say something about the distribution, and you want to check if you can apply a t-test. So this test is done to confirm that the data does not depart from normality significantly enough to make the underlying assumptions of the analysis invalid. Hence, You are not interested in the type I-error, but in the type II error.

Now one has to define "significantly different" to be able to calculate the minimum n for acceptable power (say 0.8). With distributions, that's not straightforward to define. Hence, I didn't answer the question, as I can't give a sensible answer apart from the rule-of-thumb I use: n > 15 and n < 50. Based on what? Gut feeling basically, so I can't defend that choice apart from experience.

But I do know that with only 6 values your type II-error is bound to be almost 1, making your power close to 0. With 6 observations, the Shapiro test cannot distinguish between a normal, poisson, uniform or even exponential distribution. With a type II-error being almost 1, your test result is meaningless.

To illustrate normality testing with the shapiro-test :

shapiro.test(rnorm(6)) # test a the normal distribution
shapiro.test(rpois(6,4)) # test a poisson distribution
shapiro.test(runif(6,1,10)) # test a uniform distribution
shapiro.test(rexp(6,2)) # test a exponential distribution
shapiro.test(rlnorm(6)) # test a log-normal distribution

The only where about half of the values are smaller than 0.05, is the last one. Which is also the most extreme case.

if you want to find out what's the minimum n that gives you a power you like with the shapiro test, one can do a simulation like this :

results <- sapply(5:50,function(i){
  p.value <- replicate(100,{
    y <- rexp(i,2)
    shapiro.test(y)$p.value
  })
  pow <- sum(p.value < 0.05)/100
  c(i,pow)
})

which gives you a power analysis like this :

enter image description here

from which I conclude that you need roughly minimum 20 values to distinguish an exponential from a normal distribution in 80% of the cases.

code plot :

plot(lowess(results[2,]~results[1,],f=1/6),type="l",col="red",
    main="Power simulation for exponential distribution",
    xlab="n",
    ylab="power"
)

edited Aug 08 '11 at 20:04

answered Aug 08 '11 at 15:46

Joris Meys

5,475
2
32
43

1

You're quite right: power is a concern in tests of distribution with small $n$. However, you seem to have set the logic of hypothesis testing on its head: tests never "confirm" the null hypothesis; they can only reject it. Thus, anyone seeking to use a test for confirmation will either be deluded or disappointed. As a counterpoint to your carefully chosen alternatives--and to refute that "type II error is bound to be almost 1"--try performing a test against a lognormal(1,2) distribution with $n=6$. (BTW, note that KS is not designed for testing discrete distributions like the Poisson.) – whuber Aug 08 '11 at 16:07
@whuber : I hope you did notice I didn't take the sd into account in my analysis (and hence I did make a mistake when the sd is not 1 actually). With the updated code (which does take the sd into account, the power graph for the lognormal (rlnorm() ) is the same: 20+ values needed for a power > 0.8 – Joris Meys Aug 08 '11 at 16:34
2

@whuber : regarding the logic of hypothesis testing on its head : in which case are you interested in the alternative hypothesis? In all applications of these tests I've seen, people are interested in the confirmation of the null : my data do not differ significantly from a normal distribution. Which is why I emphasize the type II-error. – Joris Meys Aug 08 '11 at 16:39
4

See my comments to the OP concerning groundwater monitoring. Typically people are interested in *rejecting* one or both of two default assumptions: normality and lognormality. Because this is done under strict regulatory supervision, eyeballing a probability plot (which is a powerful tool for experienced IOTT practitioners like @Peter Flom) does not suffice: formal tests are needed. A similar application occurs in human health risk assessment; US EPA guidance documents specifically contemplate tests with $n$ as low as $5$. See http://www.epa.gov/oswer/riskassessment/pdf/ucl.pdf, *e.g.*. – whuber Aug 08 '11 at 16:53
If I understand the code right, it appears you're getting the wrong power because the KS test is not being correctly applied: it's not valid to use `ks.test` with estimates of mean and sd derived from the data. "If a single-sample test is used, the parameters specified in ... must be pre-specified and not estimated from the data." -- http://www.stat.psu.edu/~dhunter/R/html/stats/html/ks.test.html – whuber Aug 08 '11 at 19:33
@whuber : fair enough. Same thing with the shapiro test. Same result. – Joris Meys Aug 08 '11 at 19:58
4

To get back to the title: is it meaningful to test for normality with small sample sizes? In some cases it is, especially when testing against strongly skewed alternatives. (SW has 80% power at $n=8$ against an LN(1,2) alternative, e.g.) Low power against many alternatives when $n$ is small is something normality tests share, to one degree or another, with any hypothesis test. That does not preclude its use. Thus, an unqualified "no" is, to put it mildly, unfair to the test. More generally, it suggests we shouldn't ever use hypothesis tests on "small" samples ever. That sounds too Draconian. – whuber Aug 08 '11 at 20:29
3

@whuber : We'll have to agree to differ. I'm not completely a fan of EPA (and definitely not of FDA) guidelines. I've seen this abused once too often to still believe in its usefulness. Chance is a weird thing, and with only 6 cases highly unpredictable. I don't believe you can say anything about a complex function like a PDF based on only 6 observations. YMMV – Joris Meys Aug 08 '11 at 20:54
1

The assumption that within-group distributions *are normal* within the t-test seems like an impossible requirement. As @whuber points out, isn't acceptance of the $H_0$ in hopes of satisfying that requirement **always** tantamount to flipping hypothesis testing logic on its head? A valid alternative might be, if t-test could specify the minimum deviation from normal that could lead to spurious t-test positives, then the power of the normality test *could* be used to make legal statements about whether the assumptions are met. ? – ImAlsoGreg Aug 08 '11 at 22:54
@ImAlsoGreg : the hypothesis is that the within-group distributions are *drawn from* a close-to-normal distribution. acceptance of the H0 is a valid approach, used often in pharmaceutics (this medicine is not worse than its precedessor and has less side effects) and requires strict assessment of power. How else would you statistically validate a hypothesis that two things don't differ enough to be important? – Joris Meys Aug 09 '11 at 08:26
@Joris: Right. But specification of a minimum interesting difference is a fundamental part of power analysis I thought. Is it true that you can never show that some statistic computed from group A is exactly the same as that from group B, only that they must be less different than some minimum interesting difference? And likewise, you can never show a distribution is perfectly normal, only that it doesn't differ from normality by more than a minimum interesting amount? So I'm wondering what is that maximum amount of non-normality that still satisfies the normality assumption in t-test? – ImAlsoGreg Aug 09 '11 at 13:52
5

@ImAlso The t-test can tolerate a *lot* of non-normality if it's fairly symmetric, but it can't tolerate too much asymmetry. (Indeed, a skewness test for normality might actually be a better option in the O.P. than the K-S test, for just this reason.) This points out one of the biggest differences between goodness of fit tests and other hypothesis tests: there is a huge space of possible alternatives and the GoF tests tend to be good against certain of them but not against others. You can't make them work well against all alternatives. – whuber Aug 10 '11 at 22:30
@whuber [this comment](http://stats.stackexchange.com/questions/13983/is-it-meaningful-to-test-for-normality-with-a-very-small-sample-size-e-g-n/13986#comment24585_13992) and [this one](http://stats.stackexchange.com/questions/13983/is-it-meaningful-to-test-for-normality-with-a-very-small-sample-size-e-g-n/13986#comment24613_13992) are so insightful that I believe they deserve to be incorporated into your actual answer. – Silverfish Jan 12 '16 at 21:18
@Silverfish He can incorporate that in his answer. The OP clearly stated that he's interested in that normality test in the context of a T-test. In that context, you want to check whether your data doesn't deviate enough from normality to make the T-test lose power. Hence you're interested in the null hypothesis and not the alternative. Hence my focus on the power of these tests. And hence my answer that with N=6, even the most powerful tests can't distinguish between a normal and a heavily skewed distribution. In that context it does not make sense. – Joris Meys Jan 13 '16 at 13:51

score -2 · Answer 3 · answered Mar 31 '18 at 07:31

Question posed here have some misconception that why Normality check is required for a sample size of 6. Here the main objective is “to test whether the time spent in the code version A is differ from the time spent in the code version B or not (This is my H1)”. When the word “differ” is used, is it one tail test?. However testing of Normality is a second step. The first step is to check the adequacy of predetermined (1-β) power of the test for a given sample size when the power is very bad then what is the use of testing of normality condition?. Normality condition checking will help us in deciding whether to go Parametric or Non-Parametric test?. If your sample size not having adequate power why one should think of testing of Normality?. When there is no idea about the parent population from which samples are coming and the sample size is very small (< 10) it is always Non-parametric tests are justifiable.

(-1) This is very unclear. Please read this page on how to answer questions: https://stats.stackexchange.com/help/how-to-answer — mkt, Mar 31 '18 at 07:46

Is it meaningful to test for normality with a very small sample size (e.g., n = 6)?

3 Answers3

Linked