6

I'm in a debate with a coworker and I'm starting to wonder if I'm wrong but the internet is confusing me more.

We have continuous data $[0, \infty)$ that is retrospectively selected on individuals. The selection is non random. Our sample sizes are $\approx 1000$. Our data is heavily skewed towards the left with some strong bumps towards the tail.

My strategy is to look at the distribution of the data before statistical tests between two groups via histograms, q-q plots, and Shapiro Wilk test. If the data is approximately normal I use an appropriate test (t-test, ANOVA, Linear Regression etc). If not I use an appropriate non-parametric method (Mann-Whitney Test, Kruskal-Wallis, Bootstrap regression model).

My coworker doesn't look at the distribution if the sample size is >30 or >50 he automatically assumes it is normal and cites the central limit theorem for using the t-test or ANOVA.

They cite this paper: t-tests, non-parametric tests, and large studies—a paradox of statistical practice? and say that I'm over-using non parametric tests. My understanding is my method would tell me if it's appropriate to do a normal distribution though because I thought that for heavy skewed data the n to reach ~normal distribution was higher. I know given a large enough sample size it would eventually get there but especially for the smaller sample sizes isn't it better to check? To me it makes sense that since multiple tests show that the data isn't normal it's inappropriate to use normal distribution then. Also if needing a sample size of 30 was all you needed for assuming normality why is so much work done on other distributions in statistical software? Everything would be normal distribution or non parametric then. Why bother with binomial distributions or gamma distributions? However they keep sending me papers about central limit theorem and now I'm not so sure. Maybe I am wrong and I shouldn't bother checking these assumptions.

Who is right and why?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Jacob Ian
  • 175
  • 6
  • 7
    Show your coworker some examples, such as the (extreme) one discussed at https://stats.stackexchange.com/questions/69898. It clearly controverts the paper's overly general conclusion that "For studies with a large sample size, t-tests and their corresponding confidence intervals can and should be used even for heavily skewed data." This conclusion is based on (extremely) limited simulations and one case study and then justified with "the t-test is robust even to severely skewed data and should be used almost exclusively." That is, at best, a statistically naive statement. – whuber Oct 23 '20 at 16:15
  • 4
    You seem to be making the mistake of thinking that the central limit theorem says that your data will converge to a normal distribution. That is false: https://stats.stackexchange.com/questions/473455/debunking-wrong-clt-statement. – Dave Oct 23 '20 at 16:25
  • I think your colleague is essentially correct, and you are doing the wrong tests. You should be using a bootstrap to estimate the distribution of the sample average and then doing a normality test (or eyeballing) on that. – seanv507 Oct 23 '20 at 18:52
  • 3
    @sean Could you explain why "automatically assuming [a sample] is normal" when its size exceeds 50 is any *any* sense "essentially correct"?? Even supposing the colleague is really referring to the sampling distribution of a statistic, very strong assumptions (which are not in evidence here) are needed to support any such conclusion. – whuber Oct 23 '20 at 19:05
  • @sean My college doesn't want to use bootstrap to estimate the distribution. He just wants to use t-test without checking any assumptions under the assumption that n> 30 means ~ normal distribution. – Jacob Ian Oct 23 '20 at 19:08
  • 4
    More like NEVER. The CLT may lead to normally distributed $\bar X,$ but if data are not normal then $\bar X$ and $S$ cannot be independent, and the "t-statistic" cannot have a Student t dist'n. [Note: $\bar X$ and $S$ are independent _only_ for normal data.] // If your co-worker is your boss, maybe best to tread softly; otherwise, maybe let him know his information is from uninformed sources. – BruceET Oct 24 '20 at 06:02
  • 1
    @whuber, I take the report as *hearsay*. As implied by my comment, I assume that the colleague is referring to a sample average, and similarly I don't believe the colleague is literally saying that the CLT says that over 50 samples the sample average suddenly becomes normal. In daily practise , people are not exposed to a wide range of distributions. So you don't have to run a normality test every time you perform a statistical test - you do it once for your particular data (revenue per user etc) sample size and then forget about it. – seanv507 Oct 24 '20 at 10:03
  • 2
    @BruceET, you cannot have a t-distribution, but maybe you get something that is close to it. – Sextus Empiricus Oct 24 '20 at 13:59
  • 2
    @SextusEmpiricus. But certainly not dependably for $n = 31.$ – BruceET Oct 24 '20 at 18:20
  • @BruceET that $S$ and $\bar{x}$ are dependent may not be such a problem. The point of computing their ratio $T=\bar{x}/S$ is that you eliminate the uncertainty about the variance (which still stands). They may be somewhat correlated but that does not matter if the distribution is close to the T distribution. – Sextus Empiricus Oct 24 '20 at 18:59
  • 2
    I used to believe advice in mainly elementary texts about the accuracy of t tests with nonnormal data of moderate sample sizes as long as $\bar X$ seemed roughly normal. May even have said that in some answers on this site. (And t tests _are_ remarkable robust against mild departure from normality for moderately large samples.) But then I tried illustrating robustness via simulation for samples of sizes 30-100. Surely, not an exhaustive program of all possibilities, but I quickly learned that the '>30 rule' is generally much too expansively claimed. Roughly, cautions for <30 need extending. – BruceET Oct 24 '20 at 21:10
  • 1
    @BruceET In this q&a, https://stats.stackexchange.com/questions/470488/ , I made a simulation that showed that it did not matter so much. But indeed, you may always get discrepancies and it will depend on how much different the distribution is from a normal distribution. It is all relative. – Sextus Empiricus Oct 26 '20 at 21:49

2 Answers2

2

My strategy is to look at the distribution of the data before statistical tests between two groups via histograms, q-q plots, and Shapiro Wilk test. If the data is approximately normal I use an appropriate test (t-test, ANOVA, Linear Regression etc). If not I use an appropriate non-parametric method (Mann-Whitney Test, Kruskal-Wallis, Bootstrap regression model).

What is 'approximately normal'? Do you need to pass a hypothesis test to be sufficiently approximate normal?

A problem is that those tests for normality are becoming more powerful (more likely to reject normality) when the sample size is increasing, and can even reject in the case of very small deviations. And ironically for larger sample sizes deviations from normality are less important.

My coworker doesn't look at the distribution if the sample is >30 or >50 he automatically assumes it is normal and cites the central limit theorem for using the t-test or ANOVA.

Can we ALWAYS assume normal distribution if n >30?

It is a bit strong to say 'always'. Also it is not correct to say that normality can be assumed (instead we can say that the impact of the deviation from normality can be negligible).

The problem that the article from Morten W Fagerland addresses is not whether the t-test works if n>30 (it does not work so well for n=30 which can also be seen in the graph, and it requires large numbers like their table which used sample size 1000). The problem is that a non-parametric test like Wilcoxon-Mann-Whitney (WMW) is not the right solution, and this is because WMW is answering a different question. The WMW test is not a test for equality of means or medians.

In the article it is not said to 'never' use WMW. Or to always use a t-test.

Is the WMW test a bad test? No, but it is not always an appropriate alternative to the t-test. The WMW test is most useful for the analysis of ordinal data and may also be used in smaller studies, under certain conditions, to compare means or medians.

Depending on the situation, a person might always use a t-test without analysing the normality, because of experience with distributions that might occur. Sure, one can think of examples/situations where t-tests in samples of 30 or 50 are a lot less powerful (too high p-values), but if you never deal with these examples then you can always use a t-test.


Something else.

If you have a sample size of 1000 then you might consider that not only the mean is important and you could look at more than just differences in means. In that case a WMW test is actually not a bad idea.

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
2

The data does NOT get closer to being normally distributed as the sample size grows.

Rather, the thing that gets closer to being normally distributed is the sample mean or the sample sum.

And if the population distribution is very skewed, then you may need far more than $30,$ and it it isn't then maybe $10$ would be enough.

Michael Hardy
  • 7,094
  • 1
  • 20
  • 38
  • 2
    It also matters how far out in the tail you want to go. As an extreme case, genome-wide association studies work with nominal per-test thresholds on the order of $10^{-8}$ and need substantially larger sample sizes for $t$ statistics to be close enough to Normal (ie, the tails of a $t$ get relatively heavier the further out you go) – Thomas Lumley Nov 17 '21 at 23:51