2

Let's say I have a group (G) of 1000 individuals. I have complete knowledge of 200 demographic properties of each of them, from categorical (favorite drink) to numerical (age).

Let's say I invite them all for a tea party. A subgroup (g) of 20 people turns up.

Most probably, the 20 that turn up aren't a completely random group: Perhaps they like tea. Or perhaps they don't like staying home and watch TV. And I have a lot of data about them, so I should be able to find out what differentiates this group g from the big group G.

Questions:

  1. Is there a method of statistics (or related disciplines) that could tell me which (if any) of the demographic properties that differentiates g from G? Checking the demographic propoerties one by one, I would know how to do. But efficiently checking them all... every time I invite to tea... That's my challenge.

  2. Even if group g is actually completely random, I would probably find a number of the demographic properties, which significantly (e.g. at alpha=0.05) differentiates group g from group G, just because the number of demographic properties are very high. So, if I find (e.g.) that group g is more into tea than group G, how can I trust my finding? Do I have to have a very low alpha value?

  3. Group G is actually just a (finely tuned stratified) sample of a population (P) of 1 mill individuals. And I'm not actually just interested to know if the 20 tea drinkers are different from G, but if they are different from P too. Does this affect the methods I use?

Btw, my preferred tool to handle these questions are R.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Tor
  • 185
  • 7

1 Answers1

0

@gung is correct here. I believe you want to find differences in the 200 demographic variables between the g group and the not-g group. I would use a logistic regression model. Your response would be 0 for "not-g" and 1 for "g". To figure out which of the 200 demographic variables are different between the g and not-g, you could do variable selection with LASSO without having to adjust your alpha value or testing every demographic variable individually. Here is a good thread about LASSO with logistic regression.

Another option is to use a random forest to find the "importance" of your variables in differentiating between your two groups: g and not-g. Your responses would still be 0 and 1. The random forest will give you an "importance value" which describes how useful that variable is in differentiating your groups. Here is the randomForest package.

LindsayL
  • 616
  • 5
  • 9
  • 1
    Thanks guys. Logistic regression seems a good candidate, now that you mention it. Can't believe I didn't see it myself. I'll have to read up on the LASSO method that you suggest. For example, I must find out how well it deals with correlated variables, since I expect that is going to be an issue. If it doesn't, I'll try the random forrest approach. Thanks a lot! – Tor Dec 26 '15 at 19:59