Investigating differences between populations

Question

Say we have a sample from two populations: A and B. Let's assume these populations are made of individuals and we choose to describe individuals in terms of features. Some of these features are categorical (e.g. do they drive to work?) and some are numerical (e.g. their height). Let's call these features: $X_1 \ldots X_n$. We collect hundreds of these features (e.g. n = 200), let's assume for simplicity, with no error nor noise across all individuals.

We hypothesize the two populations are different. Our goal is to answer the following two questions:

Are they actually significantly different?
What is significantly different between them?

Methods like decision trees (e.g. random forests) and linear regression analysis can help. For example, one could look at feature importance in random forests or the fitted coefficients in linear regression to understand what may distinguish these groups, and explore relationships between features and populations.

Before I go down this route, I want to get a sense of my options here, what's good and modern vs bad practice. Please note that my goal isn't prediction per se, but testing and finding any significant differences between the groups.

What are some principled approaches to address this problem?

Here are some concerns I have:

Methods like linear regression analysis may not fully answer (2), right? E.g. a single fit can help find some differences, but not all significant differences. For example, multi-collinearity may prevent us from finding how all features vary across groups (at least on a single fit). For the same reason, I would expect ANOVA can't provide a full answer to (2) either.
It's not entirely clear how a predictive approach would answer (1). For example, what classification/prediction loss function should we minimize? And how do we test whether or not the groups are significantly different once we have a fit? Finally, I worry that the answer I get to (1) may depend on the particular set of classification models I use.

Benoit Sanchez · Answer 1 · 2017-09-11T09:11:14.990

5

Let's think the problem as follows.

Say $X=(X_1,X_2,..X_n)$ and $Y$ is a binary variable standing for the population : $Y=0$ means first population, $Y=1$ means second population. The null hypothesis can be expressed in several equivalent ways:

$H_0$: the populations are the same
$H_0$: the distribution of $X$ given $Y=0$ is the same as the distribution of $X$ given $Y=1$
$H_0$: $X$ and $Y$ are independent
$H_0$: for any function $f$ into $\{0,1\}$, $f(X)$ and $Y$ are independent

I don't know much about random forests, but they may be thought as an all purpose predictor that avoids over-fitting. If we idealize them quite a bit: it is something capable of detecting any kind of relationship between $Y$ and any kind of features $X$ without over-fitting.

It is possible to try something based on this. Split the original dataset into a training set and a test set. Then:

train a random forest $f$ that predicts $Y$ from $X$ on the training set.
make a simple chi-squared independence test (with risk $\alpha$) between $f(X)$ and $Y$ on the test set

This test is quite conservative. If the random forest is a poor method, at worst outputting a dumb $f(X)$, then it will reject $H_0$ with a probability less than $\alpha$ anyway (when $H_0$ is true). The over-fitting would not even be a problem since we use a test and a training set. However, the power of the test directly depends on the intelligence of the random forest method (or any predictor used).

Note that you can use several possible predictors: like plain old logistic regression first, then logistic regression with some cross features, then a few decision trees, then a random forest... But if you do so you should adjust $\alpha$ to the number of tests to avoid "false discoveries". See: Alpha adjustment for multiple testing

edited Sep 11 '17 at 09:11

answered Jul 18 '17 at 23:07

Benoit Sanchez

7,377
21
43

Thanks Benoit (+1). This looks applicable to question (1). Any ideas on how to tackle (2) with this or an alternative approach? – Amelio Vazquez-Reina Jul 19 '17 at 13:25
As pointed by DJohnson, RF is not interpretable. Logistic regression may be (with single features at least). It really depends on the predictor. Following an idea close to RF, it is possible to use many (random) decision trees (with well adjusted $\alpha$), and display the tree with smallest (=best) p-value. – Benoit Sanchez Jul 20 '17 at 10:42
Thanks. I like the suggestion of fitting random DTs and finding those with the most significant result in a chi-squared-like test. I assume you are referring to Bonferroni corrections when you mentioned using a well adjusted $\alpha$. How would this be different from using RFs and testing each tree? – Amelio Vazquez-Reina Jul 20 '17 at 14:23
Also, my hope with RFs is to identify features that capture differences (i.e. get at least a partial answer to (2)). They are not ideal for interpretability (although I assume one could do so by limiting their height). In either case, the same can be said about DTs right? Just making sure I understand your comment well. – Amelio Vazquez-Reina Jul 20 '17 at 14:27
Yes I refer to Bonferroni. With RF you create a single predictor by averaging many DTs. Then you make a single test with this average, not each of the DTs, resulting in $\alpha$ risk. With several DTs you make $n$ tests resulting in $1-(1-\alpha)^n$ risk (unless you use Bonferroni). This must be thought as a multiple test while (a single) RF averaging many DTs is a single test. – Benoit Sanchez Jul 20 '17 at 14:37
Right, yes, I was suggesting using RFs to fit those trees and then testing the RF trees separately (after all I assume one would want the bootstrapping and full tree growth that RFs do). Thanks. – Amelio Vazquez-Reina Jul 20 '17 at 14:43
Hmm, sorry I'm still confused. If we train a dummy method (say, a random classifier) we'll see a significant difference (strong result in the chi-squared independence test) between $f(X)$ and $Y$ and reach the wrong conclusion? In other words, how do we factor in the cross-validated quality of the fit in when doing the test? – Amelio Vazquez-Reina Jul 20 '17 at 16:29
A random classifier is independent of anything, including $Y$. We see a false positive (independence rejected) only with (very small) probability $\alpha$. – Benoit Sanchez Jul 20 '17 at 16:36
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/62543/discussion-between-benoit-sanchez-and-amelio-vazquez-reina). – Benoit Sanchez Jul 20 '17 at 17:49

score 3 · Answer 2 · answered Jul 19 '17 at 19:22

You don't say how many features are available in the data. Few, many, massive? Can we assume they are the same features between populations, all measured using the same tools, methods and modalities? If not, then you have a bigger problem where an errors-in-variables measurement model might work.

@benoitsanchez appears to have answered question #1).

Wrt #2), I'm not sure RFs can help. By using a more formal model such as one-way ANOVA applied to one feature at a time, a test of the difference between populations for features can be developed. By summarizing the results of those tests, based on the magnitude of the test as well as its significance, a descriptive profile of how the populations differ across features becomes possible. This is an admittedly ad hoc and heuristic solution which may not be rigorous enough for your tastes, preferences and training.

Not being good at Latex-type notation, let me simply describe how these tests might work: first, construct some kind of macro loop which passes all features through, one feature at a time. With each pass of the loop, the new feature becomes the target or DV with X consisting of a dummy variable for population as well as any control variables that are appropriate. Make sure that the same controls are used for each feature as well as that the underlying data is exactly the same for all ANOVAs, eliminating variation attributable to the vicissitudes of finite data samples. Aggregate the F-test values for the dummy variable for each feature. This will provide a standardized metric enabling comparison across features. F-tests are preferable to fitted betas since betas are not standardized, being expressed in the unit and std devs of each individual feature.

Your last comment, "I worry that the answer I get to (1) may depend on the particular set of classification/regression models that I use," is always true. The answers are quite likely to vary as a function of the model(s) used. It is also an expression of a commonly observed malaise among the more strongly theoretical and classically trained statisticians who aren't comfortable with or have trouble acknowledging the non-deterministic nature of applied statistical modeling. An excellent antidote for these symptoms is Efron and Hastie's recent book Computer Age Statistical Inference. They bring statistical modeling into the 21st c, an age of data science and machine learning, by candidly acknowledging the iterative, approximating, heuristic nature of all models possessing an error term. One doesn't have to be a Bayesian to recognize the truth inherent in this observation. Their's is a refreshing perspective that differs from the rigid determinism of classical, 20th c statistical practice which threw up its hands when, e.g., a cross-products matrix wouldn't invert and/or some pedantic model assumption wasn't met.

Thanks @DJohnson. When you said "Aggregate the F-test values for the dummy variable for each feature" what do yo mean exactly? i.e. what would you exactly do with this result? Also, what do you mean by betas in this context? Finally wouldn't this iterative approach be limited to no interactions? E.g. Using the original example, what if there is a significant difference in the "height of the individuals who drive to work?" — Amelio Vazquez-Reina, Jul 19 '17 at 19:38
Also, why would you proceed with a sequence of 1-way ANOVA tests as opposed to doing multi-way ANOVA? — Amelio Vazquez-Reina, Jul 19 '17 at 19:46
Good questions. In terms of the resulting descriptive profile, I was thinking of simply recording the F-test and associated significance or p-values for each feature and then ranking them from high to low. Since the F-test is a ratio of chi-squares and is, therefore, not symmetric, the population means could be added to the report to help in understanding the directionality of the results. Alternatively, a t-test could aid in this understanding. This profile would help in understanding both the magnitude or strength of features as a function of the underlying populations. — Mike Hunter, Jul 19 '17 at 20:44
As noted, control variables should be added as appropriate. These could include interactions, as long as they are consistently used across all models. Introducing additional factors would, by definition, would extend the model from one-way to multiple regression or ANOVA. — Mike Hunter, Jul 19 '17 at 20:46

Investigating differences between populations

2 Answers2

Linked