Which method can I use to pinpoint features that separates a sub-group from a group

Question

Let's say I have a group (G) of 1000 individuals. I have complete knowledge of 200 demographic properties of each of them, from categorical (favorite drink) to numerical (age).

Let's say I invite them all for a tea party. A subgroup (g) of 20 people turns up.

Most probably, the 20 that turn up aren't a completely random group: Perhaps they like tea. Or perhaps they don't like staying home and watch TV. And I have a lot of data about them, so I should be able to find out what differentiates this group g from the big group G.

Questions:

Is there a method of statistics (or related disciplines) that could tell me which (if any) of the demographic properties that differentiates g from G? Checking the demographic propoerties one by one, I would know how to do. But efficiently checking them all... every time I invite to tea... That's my challenge.
Even if group g is actually completely random, I would probably find a number of the demographic properties, which significantly (e.g. at alpha=0.05) differentiates group g from group G, just because the number of demographic properties are very high. So, if I find (e.g.) that group g is more into tea than group G, how can I trust my finding? Do I have to have a very low alpha value?
Group G is actually just a (finely tuned stratified) sample of a population (P) of 1 mill individuals. And I'm not actually just interested to know if the 20 tea drinkers are different from G, but if they are different from P too. Does this affect the methods I use?

Btw, my preferred tool to handle these questions are R.

What you want is to differentiate g from not-g. – gung - Reinstate Monica Dec 23 '15 at 22:19 — gung - Reinstate Monica, Dec 23 '15 at 22:19

score 0 · Accepted Answer · edited Apr 13 '17 at 12:44

@gung is correct here. I believe you want to find differences in the 200 demographic variables between the g group and the not-g group. I would use a logistic regression model. Your response would be 0 for "not-g" and 1 for "g". To figure out which of the 200 demographic variables are different between the g and not-g, you could do variable selection with LASSO without having to adjust your alpha value or testing every demographic variable individually. Here is a good thread about LASSO with logistic regression.

Another option is to use a random forest to find the "importance" of your variables in differentiating between your two groups: g and not-g. Your responses would still be 0 and 1. The random forest will give you an "importance value" which describes how useful that variable is in differentiating your groups. Here is the randomForest package.

Thanks guys. Logistic regression seems a good candidate, now that you mention it. Can't believe I didn't see it myself. I'll have to read up on the LASSO method that you suggest. For example, I must find out how well it deals with correlated variables, since I expect that is going to be an issue. If it doesn't, I'll try the random forrest approach. Thanks a lot! — Tor, Dec 26 '15 at 19:59

Which method can I use to pinpoint features that separates a sub-group from a group

1 Answers1