Simulations of Chi-Square Tests on 2 x 2 table without using the chi-square distribution

Question

I'd like to simulate the chi-square test without using the chi-square distribution on the following 2×2 table.

I made "chiq_2by2" function using the R (See #main function in the Box1,below). The "chiq_2by2" function itself seems to calculate the correct chi-square values from the given TA, TB, FA, and FB.

I tried to obtain the distributions of the chi-squared values by driving TA and TB with two different methods (See Box1, below):

The method using random value (See "#The method using random value" in the Box1) and,
The method based on round robin (See "The method based on round robin" in the Box1) .

However, the obtained distributions are far from the chi-square distribution with 1 degree of freedom. But, the distributions obtained by these two methods seem to be consistent.

I've made further modifications to the Box 1's code that fix the marginal total, but these still seems to be far from the chi-square distribution of degree of freedom 1.(See Box.2 and Fig.3)

My Question

How can we vary TA and TB, or somethings to obtain a chi-square distribution with 1 degree of freedom?

The distribution obtained by the method using random value is shown in Fig.1.
Fig.1

The distribution obtained by the method based on round robin is shown in Fig.2.
Fig.2

In the both Figures, red line represents chi-square distribution with 1 degree of freedom.

Box1:

#main function
chiq_2by2<-function(TA,TB,FA,FB){
  nA=TA+FA;nB=TB+FB; ntot=nA+nB
  nF=FA+FB;nT=TA+TB
  ETA=(nT*nA)/ntot;EFA=(nF*nA)/ntot
  ETB=(nT*nB)/ntot;  EFB=(nF*nB)/ntot
  
  ch=((TA-ETA)^2)/(ETA);ch=ch+((TB-ETB)^2)/(ETB)
  ch=ch+((FA-EFA)^2)/(EFA);ch=ch+((FB-EFB)^2)/(EFB)
  return(ch)
}


#The method using random value
A_tot=100;B_tot=50

numb=10000
sc1<-numeric(numb)
for(i in 1:numb){
  TA=floor(runif(1, min=0, max=A_tot));  FA=A_tot-TA
  TB=floor(runif(1, min=0, max=B_tot));  FB=B_tot-TB
  sc1[i]=chiq_2by2(TA,TB,FA,FB)
}

#The method based on round robin.
A_tot=100; B_tot=50
sc2<-numeric(A_tot*B_tot);cnt=0
for(i in 0:A_tot){
  for(j in 0:B_tot){
    TA=i;  FA=A_tot-TA
    TB=j;    FB=B_tot-TB
    cnt=cnt+1
    sc2[cnt]=chiq_2by2(TA,TB,FA,FB)   
  }
}

#Drawing Histograms and Distributions
par(mfrow=c(1,2))

hist(sc1 ,freq=F);curve(dchisq(x,1),col="red",add=T)
hist(sc2 ,freq=F,col="#edae00");curve(dchisq(x,1),col="red",add=T)

Fig.3

Box 2

#main function
chiq_2by2<-function(TA,TB,FA,FB){
  nA=TA+FA;nB=TB+FB; ntot=nA+nB
  nF=FA+FB;nT=TA+TB
  ETA=(nT*nA)/ntot;EFA=(nF*nA)/ntot
  ETB=(nT*nB)/ntot;  EFB=(nF*nB)/ntot
  
  ch=((TA-ETA)^2)/(ETA);ch=ch+((TB-ETB)^2)/(ETB)
  ch=ch+((FA-EFA)^2)/(EFA);ch=ch+((FB-EFB)^2)/(EFB)
  return(ch)
}

#The method using random value(2)
n_A=140
n_B=60
n_T=130
n_F=n_A+n_B-n_T

numb=10000
sc3<-numeric(0)

A_tot=n_A;B_tot=n_B
for(i in 1:numb){
  TA=floor(runif(1, min=0, max=A_tot));  FA=A_tot-TA
  TB=floor(runif(1, min=0, max=B_tot));  FB=B_tot-TB

  br1<-(TA+TB==n_T);br2<-(FA+FB==n_F)
  br3<-(TA+FA==n_A);br4<-(TB+FB==n_B)
  br=br1*br2*br3*br4
  
  if (br==1){
    cnt=cnt+1
    sc3=c(sc3,chiq_2by2(TA,TB,FA,FB))  
  }
}

#Round robin (2)
n_A=140
n_B=60
n_T=130
n_F=n_A+n_B-n_T


sc4<-numeric(0);cnt=0
A_tot=n_A; B_tot=n_B
for(i in 0:A_tot){
  for(j in 0:B_tot){
    TA=i;  FA=A_tot-TA
    TB=j;    FB=B_tot-TB
    
    br1<-(TA+TB==n_T);br2<-(FA+FB==n_F)
    br3<-(TA+FA==n_A);br4<-(TB+FB==n_B)
    br=br1*br2*br3*br4
    
    if (br==1){
    cnt=cnt+1
    sc4=c(sc4,chiq_2by2(TA,TB,FA,FB))  
    }
  }
}


#Round robin (3)
n_A=140
n_B=60
n_T=130
n_F=n_A+n_B-n_T

TAmax=min(n_T,n_A)


for(TA in 0: TAmax){
FA=n_A-TA;TB=n_T-TA;FB=n_B-TB
br1<-(FA>=0);br2<-(TB>=0);br3<-(FB>=0)
br=br1*br2*br3
if (br==0){TA_min=TA}
}
TA_min=TA_min+1


TA_max=TA_min
for(TA in TA_min: TAmax){
  FA=n_A-TA;TB=n_T-TA;FB=n_B-TB
  br1<-(FA>=0);br2<-(TB>=0);br3<-(FB>=0)
  br=br1*br2*br3
  if (br==1){TA_max=TA}
}
TA_max-TA_min

cnt=0
sc5<-numeric(TA_max-TA_min+1)
for(TA in TA_min: TA_max){
  FA=n_A-TA;TB=n_T-TA;FB=n_B-TB
  cnt=cnt+1
  sc5[cnt]=chiq_2by2(TA,TB,FA,FB)
}


#Drawing Histograms and Distributions
par(mfrow=c(2,2))
hist(sc3 ,freq=F);curve(dchisq(x,1),col="red",add=T)
hist(sc4 ,freq=F);curve(dchisq(x,1),col="red",add=T)
hist(sc5 ,freq=F);curve(dchisq(x,1),col="red",add=T)

You have violated some of the basic requirements needed for the chi-squared distribution to approximate the sampling distribution of the chi-squared statistic. You will see some familiar plots in my account of this problem at https://stats.stackexchange.com/a/17148/919. — whuber, Oct 27 '20 at 20:33
@whuber Thank you for your comment. Which specific basic requirements are needed to approximate the sampling distribution that you are violating? I thought it was going to be 2 degrees of freedom since it fluctuates between the two factors, but that doesn't seem to be the case. — Blue Various, Oct 28 '20 at 01:30
You aren't using ML (based on the counts) to estimate a parameter of a distribution and the estimates are not suitable for your data generation processes. Your plots show the chi-squared test is doing an excellent job of detecting those deviations from its null hypothesis. — whuber, Oct 28 '20 at 14:22
Thank you for your comment.　What the "ML (based on the counts)" you means? — Blue Various, Oct 29 '20 at 00:21

score 4 · Accepted Answer · answered Nov 01 '20 at 19:27

You have a contingency table. Under the null hypothesis where there is no relationship between column and row variable, each cell count can be estimated from its row * column probability as you have in the code.

When you simulated the data by using a random uniform distribution, you basically cut the counts without consideration for the row or column frequency, which obviously violates the chi-sq, as your plot shows and @whuber pointed out.

One way to do it, is to simulate the frequency of T (pT in the code below):

set.seed(111)
A_tot=100
B_tot=50
pT = runif(1)
[1] 0.5929813

We cut the random uniform distribution of length A_tot and B_tot according to this probability, and table:

Arow = table(cut(runif(A_tot),breaks=c(0,pT,1)))
Brow = table(cut(runif(B_tot),breaks=c(0,pT,1)))

M = rbind(Arow,Brow)
dimnames(M)=list(c("A","B"), c("T","F"))

   T  F
A 64 36
B 23 27

Then apply the chi function you have:

chiq_2by2(M["A","T"],M["B","T"],M["A","F"],M["B","F"])
[1] 4.433498

If we wrap the above and iterate:

set.seed(222)
numb = 1000
sc1<-numeric(numb)

for(i in 1:numb){
    pT = runif(1)
    Arow = table(cut(runif(A_tot),breaks=c(0,pT,1)))
    Brow = table(cut(runif(B_tot),breaks=c(0,pT,1)))
    
    M = rbind(Arow,Brow)
    dimnames(M)=list(c("A","B"), c("T","F"))
    
    sc1[i] = chiq_2by2(M["A","T"],M["B","T"],M["A","F"],M["B","F"])
}

hist(sc1,freq=FALSE,br=50)
curve(dchisq(x,1),col="red",add=T)

Thanks for the answer. To be honest; I don't quite understand why a good distribution is generated by using three types of random numbers, the threshold to determine T/F and the sampling of group A and group B. However, I think I have one step closer to understanding the nature of the chi-square test. — Blue Various, Nov 02 '20 at 05:55
Any uniform random number will produce a good form of the histogram, but using set.seed(222) seems to be the best form. Also, I think that nA=nB=10 seems to draw a better histogram for x-coordinate around 0.The fact that "marginal frequency is fixed" probably isn't essential either; this fact also surprised me. — Blue Various, Nov 02 '20 at 06:02

Simulations of Chi-Square Tests on 2 x 2 table without using the chi-square distribution

1 Answers1