12

I was looking at this notebook, and I am puzzled by this statement:

When we talk about normality what we mean is that the data should look like a normal distribution. This is important because several statistic tests rely on this (e.g. t-statistics).

I don't understand why a T-statistic needs the data to follow a normal distribution.

Indeed, Wikipedia says the same thing:

Student's t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population

However, I don't understand why this assumption is necessary.

Nothing from its formula indicates to me that the data has to follow a normal distribution:

enter image description here

I looked a bit on its definition but I don't understand why the condition is necessary.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
octavian
  • 909
  • 2
  • 11
  • 18

4 Answers4

19

The information you require is in the "Characterization" section of the Wiki page. A $t$-distribution with degrees of freedom $\nu$ may be defined as the distribution of the random variable $T$ such that $$T = \dfrac{Z}{\sqrt{V/\nu}} \,,$$ where $Z$ is a standard normal distribution random variable and $V$ is a $\chi^2$ random variable with degrees of freedom $\nu$. In addition, $Z$ and $V$ must be independent. So given any $Z$ and $V$ that follow the above definition, you can then arrive at a random variable that has a $t$-distribution.

Now, suppose $X_1, X_2, \dots, X_n$ is distributed according to a distribution $F$. Let $F$ have mean $\mu$ and variance $\sigma^2$. Let $\bar{X}$ be the sample mean and $S^2$ be the sample variance. We will then look at the formulae:

$$\dfrac{\bar{X} - \mu}{S/\sqrt{n}} = \dfrac{\frac{\bar{X} - \mu}{\sigma/\sqrt{n}}}{\sqrt{\frac{(n-1)S^2}{(n-1)\sigma^2}}} \,.$$

If, $F$ denotes the normal distribution, then $\bar{X} \sim N(\mu, \sigma^2/n)$, and thus $\frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0,1)$. In addition, $\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}$ by Cochran's Theorem. Finally, by an application of Basu's theorem, $\bar{X}$ and $S^2$ are independent. This then implies that the resulting statistic has a $t$-distribution with $n-1$ degrees of freedom.

If the original data distribution $F$ was not normal, then, the exact distribution of the numerator and denominator will not be standard normal and $\chi^2$, respectively, and thus the resulting statistics will not have a $t$-distribution.

Greenparker
  • 14,131
  • 3
  • 36
  • 80
  • 4
    I've always found it quite interesting how much mathematical technology go into these foundational results in mathematical statistics. – Matthew Drury Dec 20 '17 at 16:04
  • 3
    Good post. However, we don't need to invoke those big theorems to prove the independence between $\bar{X}$ and $S$, as well as the $\chi^2$ distribution. See [the first answer](https://stats.stackexchange.com/questions/312337/easy-proof-of-sum-i-1n-leftz-i-barz-right2-sim-chi2-n-1) of this post. – Zhanxiong Dec 20 '17 at 21:40
4

Just to add to the earlier responses something I think is relevant to the question, albeit possibly only indirectly: The normality of the data as pointed out in the answers is both necessary and sufficient for the t-statistic to have a t-distribution (hence, a characterization of it as a t-distributed random variable) because the normality of the data also characterizes the independence of the sample mean and sample variance (see, e.g., Lucaks (1942). A characterization of the normal distribution. Annals of Mathematical Statistics, 13(1), 91-93), which is crucial to the t-statistic having a t-distribution. An investigation of the necessity and sufficiency of the normality of the data for the t-distribution in this case is provided in Chen and Adatia (1997), "Independence and t distribution," The American Statistician, 51(2), 176-177.

2

I think there may be some confusion between the statistic and its formula, versus the distribution and its formula. You can apply the t-statistic formula to any dataset and get a "t-statistic", but this statistic will not be distributed according to the student-t distribution unless the data came from a normal distribution (or at least, will not be guaranteed to be; my guess is that non-normal distributions won't produce a student-t distribution when the t-statistic formula is applied, but I'm not certain of that). The reason for this is simply that the distribution of the t-statistic is calculated from the distribution of the data that generated it, so if you have a different underlying distribution, then you're not guaranteed to have the same distribution for derived statistics.

Acccumulation
  • 3,688
  • 5
  • 11
-4

All that is needed is that $\bar{X}$ is normally distributed. If $\bar{X}$ is exactly normally distributed (not approximately normal) then the $X_i$ are normally distributed, $(n-1)S^2/\sigma^2$ is chi-square distributed and independent of $\bar{X}$, and $\frac{\sqrt{n}(\bar{X}-\mu)}{S}\sim T_{n-1}$. If $\bar{X}$ is only normally distributed asymptotically there is no guarantee that $\bar{X}$ and $S$ are independent nor that $(n-1)S^2/\sigma^2$ is chi-square distributed, but $\frac{\sqrt{n}(\bar{X}-\mu)}{S}\overset{asymp}{\sim}N(0,1)$ and of course a $T_{n-1}$ distribution and a $N(0,1)$ distribution are indistinguishable asymptotically.

Below is a histogram of $X_1,...,X_{100}\sim Gamma(2,3)$ with mean $\mu=2\times 3=6$, and below that is the sampling distribution of $\bar{X}$.

enter image description here

enter image description here

Of course the sample standard deviation is not independent of the sample mean as evidenced by the scatter plot below.

enter image description here

Nevertheless, the sampling distribution of $\sqrt{n}(\bar{X}-\mu)/S$ is well approximated by a $T_{n-1}$ distribution, i.e. $\sqrt{n}(\bar{X}-\mu)/S\overset{asymp}{\sim} T_{n-1}$.

enter image description here

For the distribution of $\sqrt{n}(\bar{X}-\mu)/S$ to be exactly $T_{n-1}$ distributed for any sample size then $X_i$ must come from a normal distribution.

Geoffrey Johnson
  • 2,460
  • 3
  • 12
  • 3
    What about the distribution of the denominator? and the independence of the numerator and denominator? – user551504 Aug 26 '21 at 23:58
  • Again, this can be achieved asymptotically. Of course having subject-level observations that are indeed normally distributed provides the required distribution of the denominator as well as the independence of the numerator and denominator, even in small sample sizes. – Geoffrey Johnson Aug 27 '21 at 14:43
  • 3
    Sorry, it's just not true, certainly not only due to the numerator being asymptotically normal. See https://arxiv.org/pdf/2012.14530.pdf for some recent details. No one is saying that the t test is not at all robust to violations of assumptions--it's just much less robust than you realize and your arguments are flawed – user551504 Aug 27 '21 at 15:47
  • What I have written is not at odds with your arguments. If $\bar{X}$ is exactly normally distributed, then so too are the $X_i$ and the t-statistic follows a $T_{n-1}$ distribution. If $\bar{X}$ is normally distributed asymptotically, then the t-statistic is asymptotically $T_{n-1}$ distributed. – Geoffrey Johnson Aug 27 '21 at 22:58
  • Most people would agree that if $\bar{X}$ is asymptotically normally distributed then the t-statistic is asymptotically standard normal. Most people would also agree that a t-distribution is indistinguishable from a standard normal distribution asymptotically. Is it a minor technicality that precludes me from saying the t-statistic is asymptotically t-distributed if $\bar{X}$ is asymptotically normally distributed? – Geoffrey Johnson Aug 27 '21 at 23:32
  • 2
    Geoffrey, I think you're missing the point. You're not wrong due to some minor technicality. Perhaps you could write a question on when it is appropriate to use a t test if you're like to know more? – user551504 Aug 28 '21 at 15:38
  • I stand by my first statement - all that is needed is that $\bar{X}$ is normally distributed (not approximately). *This implies that* $X_i$ is normally distributed, $(n-1)S^2/\sigma^2$ follows a chi-square distribution and is independent of $\bar{X}$, and the $t$ statistic follows a $T_{n-1}$ distrbution. I have added these details to my answer. Please tell me how any of this is wrong and what point I am missing. If you have downvoted my answer, please undo it. – Geoffrey Johnson Aug 28 '21 at 20:25
  • I hope your last few days have gone well. Geoffrey, I highly encourage you to ask a question on this side. Perhaps the title could be "To what extent is the $t$ test robust to nonnormality?". – user551504 Aug 31 '21 at 02:05