3

I am having difficulty in exactly understanding several statistical tests, such as the t-test and ANOVA test. These tests require that the data we use be normally distributed.

However, whilst sharing my experience in analyzing data a bit, I have analyzed several data from numerous sources online (web scraping, open-accessed data sources online, etc.), with considerably high number of samples (hundreds, thousands). An example of the data in question is the amount of donation given to certain campaigns in a fixed periods of time (day 1 at 1pm, day 2 at 1pm, etc.).

And when I tested whether the normality distribution of the data, using visual aids (histograms, Q-Q plots) and Shapiro-Wilks test, they all showed me that the data is not normal. For example, Shapiro-Wilks test gave a p-value of so small (less than 0.00000000000000022), of course the null hypothesis has to be rejected, i.e. the data is nor normally distributed.

Because I read in articles like in this link, it says

However, even if the distribution of the individual observations is not normal, the distribution of the sample means will be normally distributed if your sample size is about 30 or larger

So naturally, I am confused, is my data normally distributed or not? How often do you encounter normal and not-normal distribution, in real-life data?

EDIT

Many posts and forums also agree that normality in the data is quite rare. But if that is the case, then are parametric tests such as Chi-Square, ANOVA, t-tests, etc., by nature rarely applicable, and therefore useless? An example of this discussion that supports this is here.

user2552108
  • 171
  • 3
  • 1
    Related: https://stats.stackexchange.com/q/2492/27276 – hplieninger Aug 21 '18 at 09:09
  • 1
    Two further notes: t-test and ANOVA require the residuals to be normally distributed, not necessarily the data. Furthermore, amounts (of donations) may be better described using distributions for [count variables](https://en.wikipedia.org/wiki/Count_data). – hplieninger Aug 21 '18 at 09:12
  • 5
    Large samples will lead to rejection for almost any "point" null. When is any model including assumptions really exactly true? ("*All models are wrong*", the mantra goes. The real question is how wrong do they have to be to not be useful?) I think this large-sample rejection issue is discussed in many posts on site. The (nonsensical) n=30 claim is debunked on site more than once as well. – Glen_b Aug 21 '18 at 09:12
  • 3
    @hplieninger the residuals are not iid. The assumptions actually relate to the errors, not the residuals. Note that within each group (for one way ANOVA or t-tests), normality of the data and normality of the errors are the same assumption – Glen_b Aug 21 '18 at 09:13
  • @Glen_b Totally agree – hplieninger Aug 21 '18 at 09:14
  • 5
    Possible duplicate of [Is normality testing 'essentially useless'?](https://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless) – Nick Cox Aug 21 '18 at 09:27
  • 1
    Test is Shapiro-Wilk (Martin B. Wilk). – Nick Cox Aug 21 '18 at 09:28
  • 3
    Relevant: [Is there an explanation for why there are so many natural phenomena that follow normal distribution?](https://stats.stackexchange.com/questions/204471/is-there-an-explanation-for-why-there-are-so-many-natural-phenomena-that-follow) (particularly the denial of the premise by both amoeba and myself). On n=30 see (for example) Greg Snow's answer here: [Under what circumstances is an N < 30 acceptable?](https://stats.stackexchange.com/a/48999/805) and the answers at [Role of Central Limit Theorem in one-way ANOVA](https://stats.stackexchange.com/q/195452/805) – Glen_b Aug 21 '18 at 09:28
  • 1
    An example of one of the many answers on site that address your title question is [here](https://stats.stackexchange.com/a/300452/805) : *Data are (almost) never normal. Whether that's an issue depends what forms of deviation from normality the procedure you want to use is sensitive to (and how much), how non-normal it is and in what way it's non-normal (strictly we're talking about the distribution the sample was drawn from rather than the sample itself).*... Also see [why does the distribution of height follow Normal Distribution?](https://stats.stackexchange.com/q/360254/805) – Glen_b Aug 21 '18 at 09:51

2 Answers2

1

How often do you encounter normal and not-normal distribution, in real-life data?

Honestly, you almost never encounter normal data in real-life cases. There are several tests like Shapiro-Wilks, and yes, with real data you are more likely to reject, even with big samples. (Almost always with time series data for example).

Often it is better to be a little less strict, for example by looking at the QQ-plot (and not at the p-value). Is the distribution of the points close to what is expected in the normal case? If yes (and you define how close) then you can assume that the data are somewhat normal (ie: unimodal, not heavy tails ecc).

However, even if the distribution of the individual observations is not normal, the distribution of the sample means will be normally distributed if your sample size is about 30 or larger

This doesn't mean that if your sample is big the data is normally distributed. This refers to the Central Limit Theorem and the Law of large Numbers.

RLave
  • 590
  • 2
  • 8
  • Regarding your very last point the bit you quote is about the distributions of sample means, not the distributions of your data. – Alexis Aug 23 '18 at 05:37
  • Yes, I really thought to point that out just to avoid confusion, should have been more clear probably. – RLave Aug 23 '18 at 06:32
0

There are, as the comments show, various questions that are similar to this one, but none that exactly match it, so I think it is valuable to answer this question.

First, OLS regression makes assumptions about the shape of the errors, not the data. The errors are estimated by the residuals. It is entirely possible to have very non-normal data that yields nearly-normal residuals.

Second, almost no data is perfectly normal. Sometimes the residuals will be close enough to normal for the parametric tests to be fine. There's literature on which types of violations are problematic vs. not.

Third, there are several reasons to use parametric tests:

  • When the assumptions are met, they are somewhat more powerful than non-parametric tests. This mostly matters with small data sets.
  • They are faster to run than a lot of nonparametric tests. With modern computers, this only matters for large data sets and complex models (where "large" and "complex" depend in part on your computer).
  • They are familiar. A journal editor, dissertation committee member, or boss may object to unfamiliar statistical methods.
  • Ronald Fisher was born before Alan Turing; a lot of statistical methods were developed when calculations had to be done by hand.
  • At least some nonparametric tests are harder to interpret.

Third, I am assuming "donations to campaigns" is a monetary amount. Such amounts are very rarely even close to normal. It often makes sense to take logs of such variables.

Finally, despite all the above, I think that other methods are often useful. For instance, quantile regression makes no assumptions about the distribution of errors. It also lets you look at more aspects of the relationship between the variables.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276