1

I m trying to check my data normal_data for normal distribution in R.

I always used shapiro.test or ks.test, but now I have more than 5000 values to check. Is there any other function or possibility to check the data?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 2
    Why are you doing this? Why do you need to check this? – Roland Oct 29 '21 at 17:47
  • ... and why is 5000 a critical value? – Limey Oct 29 '21 at 18:00
  • 4
    @Limey Most software does not bother to compute the sampling distribution (or its critical values) for distributional tests of large sample sizes, because (a) it can be a lot of work and (b) is pointless. – whuber Oct 29 '21 at 18:06
  • 3
    @Tar it is incorrect to conclude that "$n\gt 30$" is "large enough." It is also incorrect to conclude one can therefore "ignore the distribution and use parametric tests." The CLT does not assert what you claim it does. Exactly whether and how to check one's data depends on what kinds of tests one plans on making. – whuber Oct 29 '21 at 18:20
  • One visual way is with a [QQ plot](http://www.sthda.com/english/wiki/qq-plots-quantile-quantile-plots-r-base-graphs) – DanY Oct 29 '21 at 18:59
  • 1) [It looks like there might have been a common misinterpretation of the central limit theorem posted (and then deleted).](https://stats.stackexchange.com/questions/473455/debunking-wrong-clt-statement) // 2) [Normality testing is less helpful than one might hope.](https://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless) – Dave Oct 29 '21 at 19:02
  • 3
    @Dave The possible misunderstanding, now deleted, was subtler than that. It read "According to the central limit theorem, no matter what kind of distribution we have, the sampling distribution tends to be normal if the sample is large enough (n $\gt$ 30)." Missing from this was any sense of *what statistic* is involved. For instance, the sampling distribution of the maximum is not going to be Normal (and rarely even close to it). – whuber Oct 29 '21 at 19:27
  • If you really must test a sample `x` of size greater than 5000 for normality, then you can use `shapiro.test(sample(x, 5000)`, In practice, huge samples often have 'harmless' quirks that lead to non-informative rejection. Example: `pv = replicate(10^4, shapiro.test(rt(5000, 70))$p.val); mean(pv <= .05)` returns $ 0.146 >0.05.$ How often is the distinction between $\mathsf{T}(70)$ and standard normal of practical importance? – BruceET Oct 29 '21 at 21:52

2 Answers2

4

For the question in the title, see How to perform a test using R to see if data follows normal distribution which gives many possibilities. As for why the limitations of the Shapiro-Wilk test, see Can a sample larger than 5,000 data points be tested for normality using shapiro.test by applying the test to a subsample?

Then, (as mentioned by many in comments), why do you do this? See Testing large dataset for normality - how and is it reliable? and Is normality testing 'essentially useless'?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
1

For example, you could use Jarque Bera test, see there. But, in almost all cases you should reject the null hypothesis despite that you see bell-shaped distribution of your data. The reason is that usual tests are quite sensitive to the big samples. Solution of this problem is using qq plot and histograms with choosing certain bins. Also, read this useful discussion.