Using multiple linear regression to distinguish two datasets

Question

Consider two datasets, a study dataset with $n$ points and a control dataset with $n_c$ points, with $n$<$n_c$. Each point in each of the datasets is composed of the measurement of 4 independent variables and one dependent variable: $X_1$,$X_2$,$X_3$, $X_4$, and $Y$, respectively. I note that these variables are correlated.

I would like to evaluate the hypothesis that the study dataset has a different Y (in average or distribution) than that of the control dataset, after controlling for all independent variables $X_1$, $X_2$, $X_3$, $X_4$ simultaneously.

Following a previous discussion, I applied multiple regression analysis to the two datasets. The coefficients of the linear regression are different, unsurprisingly. Since the control dataset is larger than the study one, I wanted to make sure that the difference was not the result of small(er) number statistics. So from the $n_c$ control observations I randomly selected a subset of $n$ and repeated the regression analysis, 10k times. The difference for one of the coefficients, the one with the largest value, is quite significant, at 2.7$\sigma$ when assuming a Gaussian distribution.

Is this test conclusive in the sense that it proves that the datasets are different in what concerns Y? How would you suggest to do such a test? I played around with PCA but could not formulate the question in a concise fashion, but I am quite unhappy with the current dependence on the model assumption (linear).

score 3 · Accepted Answer · answered May 06 '14 at 19:06

3

I would just stack the two datasets into one dataset, create an indicator variable telling you which observation is a control and which not, and create one model which includes your $X_1$ till $X_4$, the indicator variable and the interaction terms between the indicator variable and your $X$s. The main effect of the indicator variable tells you whether the expected value of $Y$ is different between controls and non-controls after adjusting for the $X$s, and the interaction terms tell you whether or not the effects of the $X$s differ between controlls and non-controls. The tests that in most software appear next to these coefficients are the tests you are looking for.

answered May 06 '14 at 19:06

Maarten Buis

19,189
29
59

1

You mean do the regression, control for the $X_1$ till $X_4$ in both datasets by removing the obtained dependence and see if something stands out when looking at the $Y$ of the two groups? I am sorry but I do not understand what you mean by interaction terms; if these are the terms that relate the independent variables to the dependent one, this is teh comparison that delivers the 2.7 sigma significance... – pedrofigueira May 06 '14 at 19:17
The first step is to create _one_ dataset by stacking the two datasets you have. Interaction terms are just the product of the indicator variable with $X_1$, the indicator variable with $X_2$, etc. You can read more on interaction variables on [this site](http://www.ats.ucla.edu/stat/stata/faq/catcon.htm) – Maarten Buis May 06 '14 at 19:23
Thank you very much. I find the concept very interesting, but it took me a little while to get the idea, and am not fully comfortable with it. So what you suggest is that I stack the two datasets together and code an indicator variable that works like a moderator, if I understood your question. So as coefficients for the regression I would use $ind_0\,\times\,X_i$ + $ind_1\,\times\,X_i'$ using as $X_i$ and $X_i'$ from control and study, respectively. – pedrofigueira May 07 '14 at 13:41

score 1 · Answer 2 · answered May 08 '14 at 17:43

Following the suggestion made by Maarten, I stacked the two datasets together and did a linear regression in order to obtain a function of the type:

$Y$ = $\beta_0$ + $\beta_1X_1$ + $\beta_2X_2$ + $\beta_3X_3$ + $\beta_4X_4$ + $\beta_{off}M$

In which $\beta_{[0-4]}$ are the intercept and $X_i$ coefficients of the linear regression, $M$ a categorical moderator variable which is 0 for the members of the control group and 1 to the members of the study group, and $\beta_{off}$ its coefficient. The $\beta_{off}$ coefficient is what I want to study, if there is indeed an offset in the second group when the dependence of Y on the parameters $X_i$ is assumed to be the same for both groups.

The $\beta_{off}$ as delivered by the multiple linear regression is indeed different from zero, and bootstrapping on Y using the associated error bars show that it is significantly so.

Using multiple linear regression to distinguish two datasets

2 Answers2