1

I really need some advice about using the chi-squared test of independence. I want to use the bootstrap-chi-squared method for conditional independence testing. The problem is that the DOF is really small and sometimes it's even 0. How could I explain conditional independence if p-val is 0 and the bootstrapped p-value is 1? Does it mean that I cannot reject the null hypothesis (independence)? I find the p-value as follows:

pval = 1 - scipy.stats.chi2.cdf(chisq, dof)

Any suggestions?

  • 2
    I don't see the relevance of the causality tag, but you could use a chi-squared tag. Using an empirical distribution only makes sense if your sample is large enough to support reasonable estimates of it. There's an exact small sample independence test that conditions on the margins, or you could use a chi-square statistic by sampling the set of tables with those margins (R has this built in but its not all that hard to do from scratch if you're using Python) – Glen_b Dec 30 '21 at 02:37
  • 1
    Could you explain what you mean by "DOF" and how you compute it? For many bootstrapped or simulated versions of chi-squared tests, degrees of freedom are irrelevant. Moreover, you certainly wouldn't find the bootstrap p-value with a chi-squared formula! – whuber Dec 30 '21 at 16:07
  • @whuber In the bootstrap method, we resample the contingency table of the "expected" values and find the chi^2 value according to he updated table so DOF(degrees of freedom) is still used to find the p-value from the distribution table. – user3441553 Dec 30 '21 at 17:48
  • @Glen_b I have a big dataset but since I would like to prove "conditional independence", I have to run the independent test for each fixed condition and in this case, for each fixed condition, only a few samples would be there. So is it wise to use small sample independence test for each fixed condition? – user3441553 Dec 30 '21 at 18:03
  • @user3441553 Aren't the degrees of freedom in a contingency table test purely a function of the size of the table (i.e. not of the counts in each cell)? So a 2×2 contingency table always has 1 degree of freedom—specifically $\text{df} = (\text{No. rows} - 1) \times (\text{No. columns} - 1)$. – Alexis Dec 30 '21 at 18:08
  • That doesn't sound like a justifiable form of the bootstrap. What's the point of computing a p-value with the chi-squared distribution if you know at the outset the chi-squared distribution does not give reliable p-values? (After all, if it *did* give reliable p-values with your table, then why are you bootstrapping?) As far as degrees of freedom go, the usual formula is not always correct. See https://stats.stackexchange.com/a/17148/919 for details and an example. – whuber Dec 30 '21 at 18:32
  • @whuber According to this paper (https://pubmed.ncbi.nlm.nih.gov/24905809/), we should be able to use the bootstrap version of the chi^2 method when the cell counts in the "expected" table is less than 5 but I agree with you that maybe we should not consider chi^2 method since the data is small for each condition. So I think another option would be the Fisher exact test. – user3441553 Dec 30 '21 at 18:47
  • Although I don't have access to that paper, I'm pretty sure it would not use the kind of bootstrap you seem to be describing. A well-known example of a contingency table bootstrap is that incorporated in the `chisq.test` function in `R`: that might be worth studying. – whuber Dec 30 '21 at 18:58
  • Indeed, this is the same thing I mention at the end of my initial comment. It's pretty straightforward to implement sampling 2x2 tables with given margins under independence (but given it's right there, it's easy to just do it in R). I like to use larger simulation sizes than R's default though, to get more precise estimates of p-values. With a 2x2 tables lots of simulations is pretty fast, I don't see the point in avoiding waiting a few seconds. – Glen_b Dec 31 '21 at 00:27
  • I was able to take a look at the paper; in fact they do exactly this (though their algorithm for simulating from tables with the same margins under independence is ... *highly inefficient* if the marginal counts are large. This is an easily solved problem, I am astonished they we neither able to think of something better nor to find anything better. Don't use their algorithm; with a 2x2 table you're just sampling a hypergeometric in one cell and the rest are determined.) – Glen_b Dec 31 '21 at 00:57
  • @Glen_b Thanks for your responses. Do you mean that instead of bootstrapping I can use Monte-carlo or the Fisher test instead? – user3441553 Jan 03 '22 at 17:28

0 Answers0