Say we have a sample from two populations: A
and B
. Let's assume these populations are made of individuals and we choose to describe individuals in terms of features. Some of these features are categorical (e.g. do they drive to work?) and some are numerical (e.g. their height). Let's call these features: $X_1 \ldots X_n$. We collect hundreds of these features (e.g. n = 200), let's assume for simplicity, with no error nor noise across all individuals.
We hypothesize the two populations are different. Our goal is to answer the following two questions:
- Are they actually significantly different?
- What is significantly different between them?
Methods like decision trees (e.g. random forests) and linear regression analysis can help. For example, one could look at feature importance in random forests or the fitted coefficients in linear regression to understand what may distinguish these groups, and explore relationships between features and populations.
Before I go down this route, I want to get a sense of my options here, what's good and modern vs bad practice. Please note that my goal isn't prediction per se, but testing and finding any significant differences between the groups.
What are some principled approaches to address this problem?
Here are some concerns I have:
Methods like linear regression analysis may not fully answer (2), right? E.g. a single fit can help find some differences, but not all significant differences. For example, multi-collinearity may prevent us from finding how all features vary across groups (at least on a single fit). For the same reason, I would expect ANOVA can't provide a full answer to (2) either.
It's not entirely clear how a predictive approach would answer (1). For example, what classification/prediction loss function should we minimize? And how do we test whether or not the groups are significantly different once we have a fit? Finally, I worry that the answer I get to (1) may depend on the particular set of classification models I use.