3

For a better understanding of a $\chi^2$-test (and teaching), I want to make an MC approximation for the $\chi^2$-distribution on a simple example (2x2 contingency table). This is a typical situation for an independence test:

Assume we have two binary variables X and Y and we measure n-times an (X,Y)-pair. The number of observed X=0 is fixed, e.g. 400 and the number of observed X=1 is also fixed, e.g. 600. So, we have 1000 observations.

We want to test for independence. The data can be summarized in a table, e.g. we observe:

 |     || Y=0 | Y=1 ||   |
 ---------------------------
 | X=0 || 350  |  50  || 400  |
 | X=1 || 250  |  350 || 600  |
 |     ||      |      || 1000 |

To my knowledge, we can use Pearson $\chi^2$ test for independence with the number of degrees of freedom = 1.


Now we simulate the situation:

t=10 # make the cell entries larger 
data = np.array([[35,5],[25,35]])*t # not directly use in the simulation

# p(Y) estimated from data:
p_ = lambda d: (d.sum(axis=0))/d.sum()


#### simulation

def simulate_data(p, n1, n2):
    a = np.random.binomial(n=n1, p=p).reshape(1,2)
    b = np.random.binomial(n=n2, p=p).reshape(1,2)
    return np.concatenate((a,b),axis=0)

def get_expectation(p, n1, n2):
    a = (p*n1).reshape(1,2)
    b = (p*n2).reshape(1,2)
    return np.concatenate((a,b),axis=0)

sample_distribution_fix = []
sample_distribution_estimated = []


p = np.array([0.6, 0.4]) # assume that this is the true p(Y)=p(Y|X (independent of X)

sim_expected_fix = get_expectation(p, 40*t, 60*t)

for i in range(nb_simulations): 
    sim_data = simulate_data(p, 40*t, 60*t)

    # fix p
    sample_distribution_fix.append( ((sim_data-sim_expected_fix)**2/sim_expected_fix).sum() )

    # p estimated from the data / reduces the degrees-of-freedom for the test?
    sim_expected = get_expectation(p_(sim_data), 40*t, 60*t)#
    sample_distribution_estimated.append( ((sim_data-sim_expected)**2/sim_expected).sum() )

# plot the sample distributions

df=1
x = np.linspace(scipy.stats.chi2.ppf(0.3, df), scipy.stats.chi2.ppf(0.99, df), 100)
plt.plot(x, scipy.stats.chi2.pdf(x, df), 'r-', lw=2, alpha=0.6, label='$\chi^2$-pdf')
plt.hist(sample_distribution_fix, bins=100, density=True, label='sampling fixed p', alpha=0.6)
plt.hist(sample_distribution_estimated, bins=100, density=True, label='sampling estimated p', alpha=0.6)
plt.xlim(0,7)
plt.title("$\chi^2$ distribution")
plt.legend();

Note: There are two simulations. One with fixed p(Y) and. The other estimate p(Y) from the corresponding sample. This is the situation we have in practice (this reduces the degrees of freedom: -1).

enter image description here


If the row sum and the column sum is fixed (simulated with a hypergeometric simulation) I get the expected $\chi^2$-distribution by MC. However, this is not the data generating process. Note: Here, the constraint enforces that the estimated $p(Y)$ is the same as the true $p(Y)$. So, the simulation becomes much easier.

enter image description here

# https://en.wikipedia.org/wiki/Hypergeometric_distribution#Working_example
nb_simulations=100000

s00 = np.random.hypergeometric(ngood=40*t, nbad=60*t, nsample=60*t, size=nb_simulations)
s = np.ndarray((nb_simulations, 2, 2))
s[:,0,0] = s00
s[:,0,1] = 40*t-s00
s[:,1,0] = 60*t-s00
s[:,1,1] = 60*t-s[:,1,0]
sample_distribution = ((s - expected)**2/expected).sum(axis=1).sum(axis=1)

So, I have the following questions:

  • Is the assumption wrong, that we can use the $\chi^2$ test for the example above?
  • Is there an issue with the sampling (using two binominal distributions)? Or why can I not reproduce the $\chi^2$-distribution by MC?

Thanks.


P.S.: (5.12.2019)

Thanks, it was a bug in simulate_data_binominal. The correct way to simulate is:

def simulate_data_binominal(p, n1, n2, px=None):
    a = np.random.binomial(n=n1, p=p[1])
    b = np.random.binomial(n=n2, p=p[1])
    return np.array([[n1-a,a],[n2-b,b]])

Now the simulation looks good:

enter image description here

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
chris elgoog
  • 622
  • 5
  • 10
  • 1
    At https://stats.stackexchange.com/a/17148/919 I discuss the conditions required for the chi-squared test to apply and how to determine the degrees of freedom. – whuber Dec 04 '19 at 17:14
  • 1
    Thanks, for the link. For my limiting understanding, that means that there is a fractional degree of freedom, which fits the simulation well. However, the textbook knowledge says, use 1 in this case. – chris elgoog Dec 05 '19 at 09:28

0 Answers0