7

I have a 1D data set with 83163 data points, and I would like to know whether the data follows a normal distribution.

I tried using shapiro.test and ks.test in R:

d is a vector containing the data

shapiro.test(sample(d, 5000))

    Shapiro-Wilk normality test

data:  sample(d, 5000) 
W = 0.9694, p-value < 2.2e-16

(Repeated several times. Note subsampling.)

ks.test(d, dnorm, mean=mean(d), sd=sd(d))

    One-sample Kolmogorov-Smirnov test

data:  d 
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided 

Warning message:
In ks.test(d, dnorm, mean = mean(d), sd = sd(d)) :
  cannot compute correct p-values with ties

Both tests indicate that the data distribution is not normal.

So I tried plotting the data (black), and it appears to be "taller" than a normal distribution with mean and sd estimated from the data (blue).

http://dl.dropbox.com/u/38050492/data-distrib.png

I wondered if the variance is over-estimated due to outliers, so I tried calculating Winsorized variance. I heuristically matched the peak to the data distribution, but I cannot get a good fit (red).

Edit:

qqplot also suggest non-normality.

http://dl.dropbox.com/u/38050492/qqplot.png

Is there a distribution that may better model the data?

The reason why I wanted to check normality is because others have done two-sample z-tests, and modeled the data using Gaussian distributions.

To make the long story short, the math work out much nicely if the data distribution is assumed to be normal. As the normality assumption goes beyond simple application of parametric tests, I don't know how robust the results are when the data is not normally distributed...

There does appear to a considerable deviation from the normality in terms of the kurtosis of the distribution. And this deviation is consistent from dataset to dataset...

Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
David Shih
  • 251
  • 1
  • 6
  • It turned out that the datasets reproducibly have excess kurtosis of 1.1. I fitted a specialized Pearson Type VII distribution (parameterized by mean, sd, and excess kurtosis), and got a much better fit. I suspect that the increased kurtosis is due to pre-processing normalization. See http://www.qualityamerica.com/knowledgecente/articles/PYZDEKnonnormal.html on non-normality. – David Shih Oct 20 '11 at 16:33
  • See also here: http://stats.stackexchange.com/questions/16611/why-would-all-the-tests-for-normality-reject-the-null-hypothesis – xmjx Oct 19 '11 at 22:35
  • But I cannot fit a normal distribution to the data using the estimated mean and variance, or even an Winsorized variance. Is the data really truly normal? Why does it appear 'taller' given the estimated variance? In the shapiro test, I am only use a sub-sample of the data. If the problem is truly due to large sample size, should sub-sampling not solve the problem? – David Shih Oct 19 '11 at 22:37
  • Some of the answers in the linked question have some good advice. Personally, I'd like to see a [qq-plot](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/qqnorm.html). What you're going to do with the data afterwards also determines how much you really care about the 'significantly non-normal' result. –  Oct 19 '11 at 22:51
  • qqplot seems to suggest non-normality (see updated post)... – David Shih Oct 19 '11 at 23:13
  • Karl beat me to do it. That qq-plot is nasty. –  Oct 19 '11 at 23:49
  • By the way, the normality assumption in z-tests should be fine, as the normality assumption concerns the distribution of the sample mean, and with sample sizes on the order of 80k, the population distribution would have to be pretty crazy for the sample mean to not be normally distributed. – Karl Oct 20 '11 at 00:14
  • Just wanted to say that you guys are really awesome. I learn lots of stuff from you even if my own answers are wrong and display a huge level of being underinformed. ;-) – xmjx Oct 20 '11 at 06:34

1 Answers1

12

It's definitely not normal, and it's not just the large sample size. That qq-plot is really clear. The wikipedia page about kurtosis may be useful to you; it mentions the Pearson type VII family of distributions, with an image of densities quite similar to yours.

Karl
  • 5,957
  • 18
  • 34