Why a T-statistic needs the data to follow a normal distribution

Question

I was looking at this notebook, and I am puzzled by this statement:

When we talk about normality what we mean is that the data should look like a normal distribution. This is important because several statistic tests rely on this (e.g. t-statistics).

I don't understand why a T-statistic needs the data to follow a normal distribution.

Indeed, Wikipedia says the same thing:

Student's t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population

However, I don't understand why this assumption is necessary.

Nothing from its formula indicates to me that the data has to follow a normal distribution:

I looked a bit on its definition but I don't understand why the condition is necessary.

Greenparker · Accepted Answer · 2017-12-20T18:12:31.060

The information you require is in the "Characterization" section of the Wiki page. A $t$-distribution with degrees of freedom $\nu$ may be defined as the distribution of the random variable $T$ such that $$T = \dfrac{Z}{\sqrt{V/\nu}} \,,$$ where $Z$ is a standard normal distribution random variable and $V$ is a $\chi^2$ random variable with degrees of freedom $\nu$. In addition, $Z$ and $V$ must be independent. So given any $Z$ and $V$ that follow the above definition, you can then arrive at a random variable that has a $t$-distribution.

Now, suppose $X_1, X_2, \dots, X_n$ is distributed according to a distribution $F$. Let $F$ have mean $\mu$ and variance $\sigma^2$. Let $\bar{X}$ be the sample mean and $S^2$ be the sample variance. We will then look at the formulae:

$$\dfrac{\bar{X} - \mu}{S/\sqrt{n}} = \dfrac{\frac{\bar{X} - \mu}{\sigma/\sqrt{n}}}{\sqrt{\frac{(n-1)S^2}{(n-1)\sigma^2}}} \,.$$

If, $F$ denotes the normal distribution, then $\bar{X} \sim N(\mu, \sigma^2/n)$, and thus $\frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0,1)$. In addition, $\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}$ by Cochran's Theorem. Finally, by an application of Basu's theorem, $\bar{X}$ and $S^2$ are independent. This then implies that the resulting statistic has a $t$-distribution with $n-1$ degrees of freedom.

If the original data distribution $F$ was not normal, then, the exact distribution of the numerator and denominator will not be standard normal and $\chi^2$, respectively, and thus the resulting statistics will not have a $t$-distribution.

I've always found it quite interesting how much mathematical technology go into these foundational results in mathematical statistics. — Matthew Drury, Dec 20 '17 at 16:04
Good post. However, we don't need to invoke those big theorems to prove the independence between $\bar{X}$ and $S$, as well as the $\chi^2$ distribution. See [the first answer](https://stats.stackexchange.com/questions/312337/easy-proof-of-sum-i-1n-leftz-i-barz-right2-sim-chi2-n-1) of this post. — Zhanxiong, Dec 20 '17 at 21:40

score 4 · Answer 2 · answered Aug 26 '21 at 22:30

Just to add to the earlier responses something I think is relevant to the question, albeit possibly only indirectly: The normality of the data as pointed out in the answers is both necessary and sufficient for the t-statistic to have a t-distribution (hence, a characterization of it as a t-distributed random variable) because the normality of the data also characterizes the independence of the sample mean and sample variance (see, e.g., Lucaks (1942). A characterization of the normal distribution. Annals of Mathematical Statistics, 13(1), 91-93), which is crucial to the t-statistic having a t-distribution. An investigation of the necessity and sufficiency of the normality of the data for the t-distribution in this case is provided in Chen and Adatia (1997), "Independence and t distribution," The American Statistician, 51(2), 176-177.

score 2 · Answer 3 · answered Dec 20 '17 at 20:15

I think there may be some confusion between the statistic and its formula, versus the distribution and its formula. You can apply the t-statistic formula to any dataset and get a "t-statistic", but this statistic will not be distributed according to the student-t distribution unless the data came from a normal distribution (or at least, will not be guaranteed to be; my guess is that non-normal distributions won't produce a student-t distribution when the t-statistic formula is applied, but I'm not certain of that). The reason for this is simply that the distribution of the t-statistic is calculated from the distribution of the data that generated it, so if you have a different underlying distribution, then you're not guaranteed to have the same distribution for derived statistics.

Geoffrey Johnson · Answer 4 · 2021-08-28T02:02:04.820

-4

All that is needed is that $\bar{X}$ is normally distributed. If $\bar{X}$ is exactly normally distributed (not approximately normal) then the $X_i$ are normally distributed, $(n-1)S^2/\sigma^2$ is chi-square distributed and independent of $\bar{X}$, and $\frac{\sqrt{n}(\bar{X}-\mu)}{S}\sim T_{n-1}$. If $\bar{X}$ is only normally distributed asymptotically there is no guarantee that $\bar{X}$ and $S$ are independent nor that $(n-1)S^2/\sigma^2$ is chi-square distributed, but $\frac{\sqrt{n}(\bar{X}-\mu)}{S}\overset{asymp}{\sim}N(0,1)$ and of course a $T_{n-1}$ distribution and a $N(0,1)$ distribution are indistinguishable asymptotically.

Below is a histogram of $X_1,...,X_{100}\sim Gamma(2,3)$ with mean $\mu=2\times 3=6$, and below that is the sampling distribution of $\bar{X}$.

Of course the sample standard deviation is not independent of the sample mean as evidenced by the scatter plot below.

Nevertheless, the sampling distribution of $\sqrt{n}(\bar{X}-\mu)/S$ is well approximated by a $T_{n-1}$ distribution, i.e. $\sqrt{n}(\bar{X}-\mu)/S\overset{asymp}{\sim} T_{n-1}$.

For the distribution of $\sqrt{n}(\bar{X}-\mu)/S$ to be exactly $T_{n-1}$ distributed for any sample size then $X_i$ must come from a normal distribution.

edited Aug 28 '21 at 02:02

answered Aug 26 '21 at 23:45

Geoffrey Johnson

2,460
3
12

3

What about the distribution of the denominator? and the independence of the numerator and denominator? – user551504 Aug 26 '21 at 23:58
Again, this can be achieved asymptotically. Of course having subject-level observations that are indeed normally distributed provides the required distribution of the denominator as well as the independence of the numerator and denominator, even in small sample sizes. – Geoffrey Johnson Aug 27 '21 at 14:43
3

Sorry, it's just not true, certainly not only due to the numerator being asymptotically normal. See https://arxiv.org/pdf/2012.14530.pdf for some recent details. No one is saying that the t test is not at all robust to violations of assumptions--it's just much less robust than you realize and your arguments are flawed – user551504 Aug 27 '21 at 15:47
What I have written is not at odds with your arguments. If $\bar{X}$ is exactly normally distributed, then so too are the $X_i$ and the t-statistic follows a $T_{n-1}$ distribution. If $\bar{X}$ is normally distributed asymptotically, then the t-statistic is asymptotically $T_{n-1}$ distributed. – Geoffrey Johnson Aug 27 '21 at 22:58
Most people would agree that if $\bar{X}$ is asymptotically normally distributed then the t-statistic is asymptotically standard normal. Most people would also agree that a t-distribution is indistinguishable from a standard normal distribution asymptotically. Is it a minor technicality that precludes me from saying the t-statistic is asymptotically t-distributed if $\bar{X}$ is asymptotically normally distributed? – Geoffrey Johnson Aug 27 '21 at 23:32
2

Geoffrey, I think you're missing the point. You're not wrong due to some minor technicality. Perhaps you could write a question on when it is appropriate to use a t test if you're like to know more? – user551504 Aug 28 '21 at 15:38
I stand by my first statement - all that is needed is that $\bar{X}$ is normally distributed (not approximately). *This implies that* $X_i$ is normally distributed, $(n-1)S^2/\sigma^2$ follows a chi-square distribution and is independent of $\bar{X}$, and the $t$ statistic follows a $T_{n-1}$ distrbution. I have added these details to my answer. Please tell me how any of this is wrong and what point I am missing. If you have downvoted my answer, please undo it. – Geoffrey Johnson Aug 28 '21 at 20:25
I hope your last few days have gone well. Geoffrey, I highly encourage you to ask a question on this side. Perhaps the title could be "To what extent is the $t$ test robust to nonnormality?". – user551504 Aug 31 '21 at 02:05

Why a T-statistic needs the data to follow a normal distribution

4 Answers4