Understand t-test better using an experiment

Question

Consider the following simple python code:

import numpy as np
import scipy.stats as stats
np.random.seed(1000)#(seed)
for i in range(10):
    X = np.random.normal(0,1,2000000)
    Y = np.random.normal(0,1,2000000)
    print(stats.ttest_ind(X,Y))

Since both $X$ and $Y$ are sampled from the same normal distribution (same mean and same variance) I expected that the returned p-values would be rather high. However, in reality, on my machine, I get the following output:

(-0.63504239247902361, 0.52540080331521977)
(0.95766169154615455, 0.33823343778490922)
(0.77495702809122613, 0.43836509803793244)
(1.0889385249318819, 0.27618106157897093)
(0.97293351367380154, 0.33058640601636757)
(1.3556340433654583, 0.17521571107906755)
(0.58921373947134548, 0.55571793482831677)
(0.91391361607063426, 0.36076229350915234)
(0.62066063893057222, 0.5348229679159715)
(2.3510055029223147, 0.018722805219522423)

Notice that for $\alpha=0.05$ the last pair of samples would suggest that I should reject the null hypothesis, namely that $\mu_X = \mu_Y$.

As a sanity check I ran,

stats.ttest_ind(X,X)

which yields (0.0, 1.0) as expected. Furthermore, I ran also the following:

import numpy as np
import scipy.stats as stats
np.random.seed(1000)#(seed)
for i in range(10):
    X = np.random.normal(0,1,2000000)
    Y = np.random.normal(10,1,2000000) # Different mean for Y
    print(stats.ttest_ind(X,Y))

which yielded always zero p-values, as I expected.

The question: What is happening here? Why is this the behavior and what am I missing?

Although it's hard to determine exactly what your question is, these are matters that are thoroughly discussed at http://stats.stackexchange.com/questions/31. Perhaps you will find the answers there to be helpful. — whuber, Oct 21 '15 at 15:44

vdesai · Accepted Answer · 2015-10-21T15:08:35.073

Good question! I guess your real question is: why do I get type-I errors when I use 2 million samples. If you ponder about it, your observation is really dependent only on $\alpha$ (size of the test). For $\alpha = 0.05$, if you sample your test statistic 100 times, you would be having type-I error 5 times on average. In your case, for 10 samples you observed a type-I error once, which is possible (with probability $10*0.95^9*0.05 = 0.315$). A more relevant experiment might be:

import scipy.stats as stats
np.random.seed(1000)#(seed)
alpha = 0.05
n = 1000
count = 0
for i in range(n):
    X = np.random.normal(0,1,2000)
    Y = np.random.normal(0,1,2000)
    count += 1 if stats.ttest_ind(X,Y)[1] <= alpha else 0
count/float(n)

Output: .055 (inline with chosen $\alpha$)

Understand t-test better using an experiment

1 Answers1