Inconsistent normality tests: Kolmogorov-Smirnov vs Shapiro-Wilk

Question

I'm currently looking into some data that was produced by an MC simulation I wrote - I expect the values to be normally distributed. Naturally I plotted a histogram and it looks reasonable (I guess?):

[Top left: histogram with dist.pdf(), top right: cumulative histogram with dist.cdf(), bottom: QQ-plot, data vs dist]

Then I decided to take a deeper look into this with some statistical tests. (Note that dist = stats.norm(loc=np.mean(data), scale=np.std(data)).) What I did and the output I got was the following:

Kolmogorov-Smirnov test:

scipy.stats.kstest(data, 'norm', args=(data_avg, data_sig))
KstestResult(statistic=0.050096921447209564, pvalue=0.20206939857573536)

Shapiro-Wilk test:

scipy.stats.shapiro(dat)
(0.9810476899147034, 1.3054057490080595e-05)
# where the first value is the test statistic and the second one is the p-value.

QQ-plot:
```
stats.probplot(dat, dist=dist)
```

My conclusions from this would be:

by looking at the histogram and the cumulative histogram, I would definitely assume a normal distribution
same holds after looking at the QQ plot (does it ever get much better?)
the KS test says: 'yes this is a normal distribution'

My confusion is: the SW test says it is not normally distributed (p-value much smaller than significance alpha=0.05, and the initial hypothesis was a normal distribution). I don't understand this, does anyone have a better interpretation? Did I screw up at some point?

QQplots for normality can be better than that: try plotting some random normals of the same sample size to get a benchmark. You have slight non-normality, as indicated by systematic curvature on the QQplot. Histograms and cumulative distribution plots are less useful for precise work. I wouldn't privilege K-S here; it tends to be more sensitive in the middle of a distribution than in the tails, which is the reverse of what you need. S-W is a test, and doesn't (can't!) measure how problematic non-normality is. — Nick Cox, Aug 21 '17 at 12:46
I can't comment on your use of Python. Asking about particular software is usually off-topic here unless the query is essentially statistical. — Nick Cox, Aug 21 '17 at 12:47
We can't comment on your grounds for expecting a normal distribution; what difference would slight non-normality make to your project? — Nick Cox, Aug 21 '17 at 12:48
First of all, thanks for the quick comment (and sorry for the software specific question, I put it in here on the off-chance to meet somebody with statistics AND Python knowledge here ;). Normality is expected after this type of MC simulation, slight non-normality is typically not a problem, if the values are not at all normally distributed it is an indicator for a fishy algorithm (e.g. too much auto-correlation between the samples, unaccounted systematic effects..). — rammelmueller, Aug 21 '17 at 12:52
As I said, I typically only take a quick look - since we changed up the algorithm quite a bit I decided to check this a little closer and found this behavior. The 'amount' of normality is certainly enough (given the sample number of about 450) but now I am interested as to why the test give different results and if my interpretation is any good here. — rammelmueller, Aug 21 '17 at 12:54
My main comment is that K-S is oversold here. The criterion of maximum difference between cumulatives doesn't match what's needed. You don't have to trust me: you have the evidence that here it doesn't find what is visually obvious. — Nick Cox, Aug 21 '17 at 13:06
There are plenty of Python users here, supposedly 319 following that tag. Not me, however. — Nick Cox, Aug 21 '17 at 13:07
Thanks a lot in any way! As is clearly visible, I'm not too much of a statistician - any input is more than welcome here! — rammelmueller, Aug 21 '17 at 13:09
I outflank you. I am not a statistician, but a geographer. But I use statistics a lot. — Nick Cox, Aug 21 '17 at 13:12
@Nick This application of K-S is invalid, because it compares the data to a Normal distribution *with parameters determined by the data*: it needs the Lilliefors version. (I know you know that, but you seem to have overlooked this error.) Consequently its p-value is grossly too high. — whuber, Aug 21 '17 at 14:10
@whuber Thanks for the compliment. I was and am certainly aware of the issue but did not know whether the Python function (or whatever it's called) did the right thing in applying Lilliefors. I haven't used it and haven't looked at its documentation, as my affections lie elsewhere. — Nick Cox, Aug 21 '17 at 14:13
@Nick I presumed the application was erroneous, based on two pieces of evidence: (1) the function name refers to K-S and (2) there is no way in the `args` argument to reveal whether the parameters were derived from the data or not. The documentation [is not clear](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.kstest.html), but its lack of any mention of these distinctions strongly suggests it is not performing the Lilliefors test. That test is described, with a code example, at https://stackoverflow.com/a/22135929/844723. — whuber, Aug 21 '17 at 14:17
Ah! This is something I found fishy but I wasn't aware of that method - I will change that right away. Thanks for pointing that out @whuber! — rammelmueller, Aug 21 '17 at 14:18
@whuber Impressive detective work. Yet another reason not to use the test. (Personal prejudice: yet another reason to worry whether Python for statistics is oversold.) — Nick Cox, Aug 21 '17 at 14:21
Oversold seems to be a strong word - I think it is fine for some quick tests (also with caution there, as we clearly see) but if one would want to do some more sophisticated analysis other tools are certainly worth a look. — rammelmueller, Aug 21 '17 at 14:24
@Nick I *love* the K-S test for several reasons: its simplicity, its direct connection to the Q-Q plot, its flexibility, and its power. I maintain that every statistical test can be visualized and (almost) every visualization suggests a corresponding test--and this is one of the best examples of that thesis (especially if one plots the *residuals* in a Q-Q plot, which is visually more powerful). Although I have implemented many other GoF tests like S-W and S-F and A-D, K-S has always been my go-to test for those (relatively rare) occasions when a formal test of distribution was needed. — whuber, Aug 21 '17 at 14:43
Oversold? Do correct me when wrong, but I tend to choose words very carefully and I maintain my stance. Note that I said "oversold here" and the context is clearly testing for -- more generally assessing -- non-normality. Oversold means that many authors urge this on others even though the test is inferior to others and even invalid if it uses parameters estimated from the data, points made by others too in this thread. I like the interpretation that K-S is linked to the graph of empirical cumulatives very much, but the link with QQplots seems indirect rather than direct. cc:@whuber — Nick Cox, Aug 21 '17 at 15:15

score 6 · Accepted Answer · answered Aug 21 '17 at 14:51

There are innumerable ways a distribution can differ from a normal distribution. No test could capture all of them. As a result, each test differs in how it checks to see if your distribution matches the normal. For example, the KS test looks at the quantile where your empirical cumulative distribution function differs maximally from the normal's theoretical cumulative distribution function. This is often somewhere in the middle of the distribution, which isn't where we typically care about mismatches. The SW test focuses on the tails, which is where we typically do care if the distributions are similar. As a result, the SW is usually preferred. In addition, the KW test is not valid if you are using distribution parameters that were estimated from your sample (see: What is the difference between the Shapiro-Wilk test of normality and the Kolmogorov-Smirnov test of normality?). You should use the SW here.

But plots are generally recommended and tests are not (see: Is normality testing 'essentially useless'?). You can see from all your plots that you have a heavy right tail and a light left tail relative to a true normal. That is, you have a little bit of right skew.

Aksakal · Answer 2 · 2017-08-21T13:56:10.380

2

You can't cherry pick normality tests based on the results. In this case, you either go with the rejection in any test conducted, or not use them at all. KS test is not very powerful, it's not a "specialized" normality test. If anything SW is probably more trustworthy in this case.

To me your QQ plot has signs of either fat right tail or skew to the left, or both. I would suggest using Tukey's tool to study the fatness of tails. It'll give you an indication how much a distribution is like normal or Cauchy.

edited Aug 21 '17 at 13:56

answered Aug 21 '17 at 13:40

Aksakal

55,939
5
90
176

How do you conclude from QQ-plots to the fatness of the tails? And: which distribution would you suggest? – rammelmueller Aug 21 '17 at 13:45
1

@rammelmuller, the fatter tails would show s-like curve where left bends down and right bends up. In your case the left bends up too, which could be a sign of left skew. – Aksakal Aug 21 '17 at 13:56
Thanks for pointing out the tool, I'll look into it. Just for the sake of completeness: I have some other datasets and the results are sometimes slightly differ: the upper tail of the QQ plot varies, but the lower tail is consistently a little too high - a sign for skewedness? – rammelmueller Aug 21 '17 at 14:03
1

I think you need to ask yourself how important is normality assumption testing for you as @NickCox suggested. Why are you testing in the first place? Short tail up and long term down could be a sign of short tails. Most importantly this may all be inconsequential to you – Aksakal Aug 21 '17 at 14:15
1

I am aware, that I might get decapitated after this statement, but here I go: I need my data to be "reasonably gaussian" - if there was something very fishy, i.e. extremely fat tails or extreme skewness, then I would have to hunt for some fundamental issues. This doesn't seem to be the case and the project is fine. The reason for the question here was more to check if I am not entirely wrong in my doing (i.e. interpreting results and such) – rammelmueller Aug 21 '17 at 14:22

Inconsistent normality tests: Kolmogorov-Smirnov vs Shapiro-Wilk

2 Answers2