Correlation or independence on contingency table for large N

Question

I have a dataset with about 35,000 individuals described by around 15 categorical variables.

I'm trying to study the independence / correlation between these 15 categorical variables. My first idea was to, for each pair of variables, create a contingency table and calculate the $\chi^2$. Then, study the overall difference in the statistic. However, because the population is so large, $\chi^2$ is always significant. I'm having difficulty interpreting and comparing the results for each pair of variables.

So, I can summarize my question as follows:

For large datasets, when I know $\chi^2$ will almost always be significant, is there an alternative test that will give more reasonable results?

I have two ideas, as well

I was thinking of taking many bootstrap samples of say 1K individuals. On each sample calculate the correlation, then average over all the bootstrap samples. The average should be a good representation of the overall sample, but I feel like I'm somehow cheating.
Can I simply compare the magnitudes of the $\chi^2$ test between the different pairs of variables? The degrees of freedom are different (the categories are of different sizes), which leads me to think this won't make sense.

I have some thoughts that I hope will help add context why you're making some decisions in about your research. Keep in mind that your $\chi^2$ results will also be sensitive to how you transformed the continuous variables into categories. Rebinning could change the kinds of results that you obtain. Generally it's best practice to keep the data in the original units, so that the texture of the data is retained. Also, it's unclear to me why you want to do a $\chi^2$ test when you have binary outcomes and continuous predictors -- isn't that what logistic and similar regressions are for? — Sycorax, Aug 22 '13 at 15:01
None of those reasons for binning are particularly persuasive. Binning lowers power *and* increases type I error. Missing data can be dealt with. In my view, binned models are harder to interpret, not easier. And data entry errors shouldn't just be shoved into a bin, they should be dealt with. — Peter Flom, Aug 22 '13 at 15:17
Well, since you have so many data points, it seems justifiable to reduce Type I and Type II error rates. At conventional rates, you've basically wasted resources by over-collecting data. — Sycorax, Aug 22 '13 at 15:20
And a follow up question for Mr Flom. Are you saying that with regression on categorical variables, correlation analysis is unnecessary? — Drew75, Aug 23 '13 at 06:14

score 0 · Accepted Answer · edited Jun 11 '20 at 14:32

Answering my own question (because no one gave an answer) based on another post.

Unless your observations have a cost benefit tradeoff of some kind (e.g. paying subjects) then there's not really any such things as too many. More observations give better parameter estimates.

The test you used handled 10,000,000 observations just fine.

Your "problem" isn't a problem at all. The estimate of a parameter becomes very good when N is very large to the point that any measurable deviation from no difference becomes a statistically significant difference. That doesn't mean the difference is meaningful or practically significant. That's a judgment call you'll have to make.

One way to help you is to calculate an effect size. Typically Cramer's V (ϕc) is used. Note that for goodness of fit you use the rows instead of smallest dimension and it is interpreted as the tendency toward a singular outcome. Cramer's V for your experiment is going to be an extraordinarily tiny number, suggesting the effect is very small.

But in this case, the numbers so obviously tell a tale of a very small effect I think that just showing the numbers is sufficient. What you would say is that the expected probabilities are nearly identical to the observed and leave it at that.

In summary: The Chi-squared will show significant differences because N is large. In this case it is best to look at the size of the test statistic rather than the p-value. Random sampling can reduce N, but it is an un-satisfactory solution.

Alternatively, calculate the confidence intervals.

Uh, this looks like pretty much a direct copy of the other answer, to the point that it includes information that's relevant not to your question but to the other one. Can you edit so that this answer at least relates to the specifics of your question? — Glen_b, Nov 13 '13 at 06:26
Correct. I feel bad answering my own question, but no one gave it a try. How about I keep that in quotes, then add my own bit of explaining? — Drew75, Nov 13 '13 at 18:01
you needn't feel bad for answering your own question; that's completely okay here - that's made explicit in the help. My concern was simply that your answer could have been replaced by a link to the other one. I'm sorry you didn't get a good answer when it came up originally. — Glen_b, Nov 13 '13 at 21:33

Correlation or independence on contingency table for large N

1 Answers1