Regression coefficients have a different sign for highly correlated predictors

Question

I'm working with some real world data and the regression models are yielding some counter-intuitive results.I know that in my data, the X1 and X2 predictors are highly correlated in the same direction and as they increase,the response Y too increases.

However, when I build a regression model with other predictors in the data, the regression coefficients of X1 and X2 change sign. I don't understand this. Is this due to multi-collinearity? If so, how to force them to be of the same sign? Can I not build a regression model with both X1 and X2 having the same sign of coefficients?

@DevonOliver, this is a perfectly reasonable question. Your comment, perhaps unintentionally, comes across as rather harsh. — gung - Reinstate Monica, Jul 26 '18 at 19:33
Most likely your X1 & X2 are correlated w/ some of the control variables such that the apparent relationship flips when the other variables are accounted for. It may help you to read my answer here: [Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression?](https://stats.stackexchange.com/a/78830/7290) — gung - Reinstate Monica, Jul 26 '18 at 19:37

Tyrel Stokes · Answer 1 · 2018-07-27T18:43:12.500

Any time you add an additional regressor and it is correlated both with a variable of interest and the residual from the previous regression the coefficient you estimate will change in expectation.

To give an example that might illustrate how this would work, consider a regression for yearly income for women as the outcome. Two regressors that are correlated and move in the same direction might be mother's education and father's income. Typically the higher the mother's education, the greater earnings someone might have, same for father's income. Additionally since people tend to select into similar social-economic classes someone with a higher income is likely to marry someone with higher education and vice-versa. These variables are a special case of what you describe above.

Let's say when we regress income on the two education measures the coefficients are small and positive.

Now let's consider another variable, Mother's yearly income.

I am making this up as an example, but trying to be plausible. Let's say that the higher income that the mother makes, the greater income the daughter makes even conditioning on or taking into account the father's income and mother's education effect. We could imagine this is because of role modeling, seeing your mother make money might make a daughter more likely to pursue career advancement or to become the breadwinner of her household.

For argument sake we could imagine that maybe a mother's income and the father's income are negatively correlated. We could imagine that in households where one of the parent's have a high income, the other might reduce their work hours and take on relatively more domestic responsibilities.

Taking all of this as given, if we add mother's income to the regression, the estimated coefficient of mother's education will go down and the estimated coefficient of father's income will go up asymptotically. Consider the math below.

Missing Variable simple case:

Let the ground truth be

$Y = \alpha + X_1\beta_1 + X_2\beta_2 +\mu$

For simplicity, $\mu \sim N(0,1)$

If we estimate a regression with missing variables

$Y = \alpha + X_1\beta_1 + \upsilon$

$plim \hat{\beta_1} = \beta_1 + \beta_2\frac{COV(X_1,X_2)}{Var(X_1)}$

So the missing variable bias is related to sign of the correlation between $X_1$ and $X_2$ as well as the true coefficient for $X_2$ on Y. This is a simple case where we originally only estimate one variable, but the principle applies to estimating more than one variable.

In our example, we said that the impact of the missing variable, mother's income, on Y (daughter's income) conditional on mother's education and father's income was positive ($\beta_3 > 0$). We then said that the correlation (will have same sign as covariance) between mother's income and mother's education is positive, but the correlation between mother's income and father's income is negative. The missing variable biases are thus of opposite sign. If the original estimates are sufficiently close to zero, one of their signs could flip and the other might not.

To your point about making the signs the same, there is an easy way... add the variables together (estimate $Y = \alpha + (X_1 + X_2)\beta_{1} + X_3\beta_3 + \upsilon_1$). This is the same as forcing them to share the same coefficient, which will obviously also be the same sign. I wouldn't recommend doing this unless you have a good theoretical reason to do this, but it is available as an option. This is a better idea if the point is to test the hypothesis that B1 = B2 for example.

Regression coefficients have a different sign for highly correlated predictors

1 Answers1