1

Say I have binary variables:

  • tall
  • basketball player
  • football player
  • vegan
  • programmer
  • student

We are also given subsets of a population.

My objective is that I want find out what combination of variables best describes how a subset is significantly different that the rest of the population.

For example, say in the entire population of 1000 people there are only 5 people that are vegan, NOT tall, and are basketball players. And all 5 of those people belong to subset A with a count of 20 people. Intuitively I know that {vegan, NOT tall, and basketball players} are an interesting combination of variables that distinguishes subset A from the rest of the population.

What types of statistical analysis should I look to for doing this in a systematic way?

b_dev
  • 841
  • 5
  • 11

1 Answers1

1

This is not my area, but something like logistic regression may work well, providing interpretable coefficients. (For more than 2 subsets, there is multi-class logistic regression.)

I believe there are tricks in R to specify the various interaction terms. However including all of these is probably ill advised, in terms of model over-fitting and interpretability. You could start with no interactions, then explore incrementally adding 2-way ... k-way interactions.

GeoMatt22
  • 11,997
  • 2
  • 34
  • 64