Is it safe to use one sample-t test with a dataset of size 5000 while the population distribution is not normal?

Question

I have a random variable X, number of hours that kids from 10-12 years old spend on video games in one month in a specific country, and I want to test the hypothesis that the mean of my population $u_{x}$ is higher than 50hours. I have a sample with size 5000 where each data point shows me the number of hours that a different kid (10-12 years old) spent on video games. I constructed my hypothesis as

$H_{o}: u_{x} = 50$

$H_{a}: u_{x} > 50$

To test this hypothesis, I want to use one sample t-test because I don’t know the population variance. I know that one of the assumptions of t-test is that the population should have a normal distribution. When I draw the distribution of my sample, I have the below plot.

When I run Shapiro-Wilk Test for normality usin my sample, I reject the null hypothesis, my sample is not drawn from a population with a normal distribution. Under these circumstances shouldnt I use one-sample t-test but rather should I use a non-parametric test, if so which test (maybe Wilcoxon Rank Sum test)? Or since I have a sample size 5000, can I simply ignore the normality assumption and continue with one-sample t-test (from T-test for non normal when N>50? discussion, I feel like continuing with t-test is still safe)?

Thank you!

score 1 · Answer 1 · answered Apr 18 '21 at 16:28

1

I know that one of the assumptions of t-test is that the population should have a normal distribution

Not true. See here for my arguments as to why this is not neccesary. I have a link to another blog post which provides empirical evidence that this is not necessary either here. With this much data, the CLT will more than likely take care of the normality condition. A bigger problem will be any sampling considerations.

You're fine to use the t test in this case.

answered Apr 18 '21 at 16:28

Demetri Pananos

24,380
1
36
94

2

I think I disagree slightly, even if mostly just in phrasing. The test statistic has the claimed distinction only under the assumption of a normal population. Lucky for us, the test is fairly robust to deviations from the normality assumption. (At least my colleagues and customers have seemed to care about this distinction.) – Dave Apr 18 '21 at 16:35
1

The duplicate shows why this answer "you're fine" *could* be correct but requires more justification. – whuber Apr 18 '21 at 16:43
2

While I agree with the recommendation, the cited text is undoubtedly true: The *derivation* of the $t$-test assumes a normal population. Only then does the $t$-statistic have a $t$-distribution. Now that's not a contradiction to the claim that in many cases, the application of the $t$-test is valid, even if the population is not normal (though people tend to concentrate more on type 1 errors in these discussions and tend to ignore power, which can be abysmal). – COOLSerdash Apr 18 '21 at 16:46
@Dave Here is where I am coming from. A t distribution is the ratio of a normal random variable and a chi-square random variable divided by its dof. Justification for the sample mean is provided from the CLT (as the sampling dist of the sample mean is asymptotically normal). Furthermore, the CI for a t test uses the standard error, which is the standard deviation of the sampling dist. Is the robustness of the t test not a property of using normality from the CLT? – Demetri Pananos Apr 18 '21 at 18:31
But that would be normality of the sampling distribution rather than of the data, no? – Dave Apr 18 '21 at 19:31
@Dave That's my point. If all that is needed for the t statistic is a normal random variable, the CLT takes care of that (especially in large samples) regardless of the population level dist. – Demetri Pananos Apr 18 '21 at 20:01

Is it safe to use one sample-t test with a dataset of size 5000 while the population distribution is not normal?

1 Answers1