1

I am looking to simulate the t-distribution from first principles.

In particular, I want to understand how the distribution arises by comparing the mean of sample A of size $n_A$ (taken from population A) with the mean of sample B of size $n_B$ (from a distinct population B). Note that $n_A$ and $n_B$ are not necessarily equal.

The null hypothesis ($H_0$) of a t-test (independent samples) states that both samples come from the same population. I'm interpreting this as population A essentially being the same as population B.

To simulate the t-distribution, this is my plan (Python):

  1. create a large (N=10000) array of normally distributed values with a mean $10$ and standard deviation, $2$. It will be one population because the null hypothesis assumes that the samples indeed come from the same underlying population.

  2. iterate 1000 times as follows:

    • take a random sample of 20 elements (sample A), and another 10 elements (sample B) from the underlying population
    • calculate the t-statistic for this realisation of samples
    • record the t-statistic from each i$^\mathrm{th}$ iteration
  3. plot a histogram of all 1000 t-statistic scores

However, point (2b) is where I have difficulty - what is the equation to calculate the t-statisitc? I have found various resources on the interweb (re-arranged slightly), but they don't appear to be entirely consistent.

Shoffma5 (slide 16) $$t = \frac{\mu_A - \mu_B}{\sqrt{ \frac{1/n_A+1/n_B}{\nu} }}\frac{1}{\sqrt{ s_A^2\big(n_A-1\big) + s_B^2\big(n_B-1\big) }}$$

ucdavis and statisticshowto $$t = \frac{\mu_A - \mu_B}{\sqrt{ \frac{1/n_A+1/n_B}{\nu}}}\frac{1}{\sqrt{ \Big(\sum A^2 - \frac{(\sum A)^2}{n_A}\Big)^2 + \Big(\sum B^2 - \frac{(\sum B)^2}{n_B}\Big)^2 }}$$

where $\mu_A$ and $\mu_B$ are the respective means of samples A and B, and $s_A^2$ and $s_B^2$ are the respective variances.

What is the correct equation to use to calculate the t-statistic (independent samples)?

Ben
  • 341
  • 1
  • 8
  • 1. While the two formulas should be algebraically equivalent, *don't* use the second formula, since it's numerically unstable. 2. Your simulation is somewhat flawed. The only way you can have an actual normally-distributed parent population is if that population is infinite. 3. When simulating populations with equal variances, there's little point doing simulations having $\sigma\neq 1$, since any other choice is equivalent to scaling both $\sigma$ and the difference in means (e.g. $\sigma=2$ and $\mu_1-\mu_2=4$ is the same as $\sigma=1$ and $\mu_1-\mu_2=2$ for given sample size) – Glen_b Jun 24 '17 at 12:28
  • Thanks for the response Glen. 1 - can you expand on 'numerically unstable'? 2 - Sure, I get that the normal distribution is theoretical in nature since is it based on an infinite population; I was merely hoping to approximate infinity by taking a 'large' population which, for the illustrative purposes of the simulation, is good enough. 3 - fair point. – Ben Jun 24 '17 at 15:13
  • (in reverse order) 2. but it's *much* easier to do an exact infinite-population simulation quite directly. 1. See discussion in comments here: https://stats.stackexchange.com/questions/210483/whats-up-with-this-variance-computation See the answer here, which explains what the problem is: https://stats.stackexchange.com/questions/235004/is-it-possible-to-have-pearson-correlation-coefficient-values-1-or-values-1/235054 See also some of the comments under this answer: https://stats.stackexchange.com/a/235143/805 ...ctd – Glen_b Jun 25 '17 at 00:51
  • ctd... and these wikipedia links: http://en.wikipedia.org/wiki/Loss_of_significance and https://en.wikipedia.org/wiki/Variance#Formulae_for_the_variance ... you should not implement those mean of squares minus square of means forms on a computer. If you need a fast one-pass algorithm, you can find mentions of them at some of the above links (though a search for *one pass variance* should hit something) – Glen_b Jun 25 '17 at 00:51
  • Just going back to the question "what is the correct equation for the t-statistic?", can I point you in the direction of another of my questions? If you feel you could provide some input there, that would help me very much. See https://stats.stackexchange.com/questions/294725/so-many-ways-to-calculate-the-t-statistic-is-this-the-super-formula-i-need-t – Ben Jul 29 '17 at 09:47

0 Answers0