Comparing N-way joint frequency distributions of different subsets?

Question

Say I have binary variables:

tall
basketball player
football player
vegan
programmer
student

We are also given subsets of a population.

My objective is that I want find out what combination of variables best describes how a subset is significantly different that the rest of the population.

For example, say in the entire population of 1000 people there are only 5 people that are vegan, NOT tall, and are basketball players. And all 5 of those people belong to subset A with a count of 20 people. Intuitively I know that {vegan, NOT tall, and basketball players} are an interesting combination of variables that distinguishes subset A from the rest of the population.

What types of statistical analysis should I look to for doing this in a systematic way?

score 1 · Answer 1 · answered Jun 10 '17 at 02:13

This is not my area, but something like logistic regression may work well, providing interpretable coefficients. (For more than 2 subsets, there is multi-class logistic regression.)

I believe there are tricks in R to specify the various interaction terms. However including all of these is probably ill advised, in terms of model over-fitting and interpretability. You could start with no interactions, then explore incrementally adding 2-way ... k-way interactions.

Comparing N-way joint frequency distributions of different subsets?

1 Answers1