Multiple regression standardized coef changed direction even with low VIF

Question

I have an outcome with predictors a, b, c

The correlations of these variables are

                 outcome          a            b        c    
outcome          1.0000000     -0.3330094  -0.5882250 -0.2778692
a               -0.3330094      1.0000000   0.4222888  0.7404057
b               -0.5882250      0.4222888   1.0000000  0.7030850
c               -0.2778692      0.7404057   0.7030850  1.0000000

I would have assumed that the coefficients of the standardized variables would have all been negative in multiple regression, but that is not what happened as shown below:

term          estimate  std.error
intercept       0           0.08  
a             -.47          0.13
b             -.86          0.12
c              .61          0.16

To my surprise c was positive. My first thought was to check collinearity with car::vif(). My results were as follows

term    vif
a       2.3
b       2.0
c       3.7

I usually use the rule-of-thumb that a vif of 5 indicates collinearity. In this case I don't see collinearity present.

How do I explain the change in the sign of term c?

Whenever you use a rule of thumb, you really shouldn't be surprised when it fails. Rules of thumb are just guidance that works ok some of the time (hopefully more often than not, though that is often debatable). They are no substitute for real experimentation and understanding. — Matthew Drury, Apr 05 '18 at 19:33
@MatthewDrury thanks. Are you saying that a vif of 3.7 is meaningful? How would you explain the sign change? Do I trust the regression or the correlation? — Alex, Apr 05 '18 at 19:44
One reason we do multiple regression is precisely because the bivariate regressions between the response and the individual regressors tell us nothing whatsoever about how the other regressors might be related to (or mediate or influence) those regressions. If you aren't going to believe the multiple regression results because they differ from the bivariate regressions, then what's the point of doing them? — whuber, Apr 05 '18 at 20:11
@whuber Thanks for the comment. I agree with everything you said. It just makes it difficult to understand. If, for example, you believe that the outcome is negatively related to `c` and a bivariate regression also confirms this, then it is difficult to change a previous paradigm with the multiple regression show something directionally different. May I ask how you would explain this change in sign? — Alex, Apr 05 '18 at 20:38

score 1 · Answer 1 · answered Apr 05 '18 at 21:39

The situation you are likely encountering is sometimes known as suppression, although different fields use different labels. Paulhus, Robins, Trzesniewski, and Tracy (2004; Two replicable suppressors of situations in personality research, Multivariate Behavioral Research, 39, 301-326) provide a good explanation of suppression and illustrate it with real data. Here’s a link.

https://www.researchgate.net/publication/228079393_Two_Replicable_Suppressor_Situations_in_Personality_Research

Friedman and Wall (2005; Graphical views of suppression and muticollinearity in multiple linear regression, The American Statistician, 59, 127-136) offer a more mathematical treatment, and graphical display, of suppression.

https://www.researchgate.net/publication/4741124_Graphical_Views_of_Suppression_and_Multicollinearity_in_Multiple_Linear_Regression

Briefly explained, the relationship between X1 and Y can change – be enhanced, weakened, or even change signs – once one, or multiple, control variables are partialed through regression analysis or partial correlations. In short, the bivariate correlation fails to take into account important confounding variables that must be controlled before seeing the truer nature of the relation between X1 and Y.

Below I present an example I discovered when creating data for instructional purposes. Many with college experience will be familiar with the Scholastic Aptitude Test or SAT, a test some college and universities in the USA require for admission. The SAT has several subsections, and one is the mathematics SAT, or math SAT for short.

Most in education would argue that in secondary schools (e.g., high school) we expect that smaller student-to-faculty ratios (i.e. class size) would result in better achievement, so there should be a negative correlation between student-to-faculty ratios and math SAT scores. In addition, some argue higher teacher salaries attract between teachers, and if better teachers are in the classroom student achievement should be higher, so there should be a positive correlation between teacher salary and math SAT scores.

To test these hypotheses, mean SAT scores, student-to-faculty ratios, and teacher salaries were collected from various online sources (all easy to find) for each of the 50 states. The correlations among these variables are presented below, and are exactly opposite of what was predicted above – math SAT is positively correlated with student-to-faculty ratio and negatively correlated with teacher salary.

             |    sat_m    ratio   salary
-------------+---------------------------
       sat_m |   1.0000
       ratio |   0.0954   1.0000
      salary |  -0.4013  -0.0011   1.0000

Why are the correlations counter to expectations? The culprit is failure to control for the percentage of students in each state who take the SAT. There is great variation in the proportion of students who take the SAT across states. For example, only 4% of Mississippi students sit for the SAT (i.e., those seeking admission to colleges outside of Mississippi), but 81% of students in Connecticut take the SAT. The SAT is not required by all colleges, and it appears to be both a regional and state preference where some prefer the SAT and others prefer a competitor, the ACT.

Below are the zero-order correlations for Math SAT, ratio, salary, and percent of students who took the SAT (sat_percent).


             |    sat_m    ratio   salary sat_pe~t
-------------+------------------------------------
       sat_m |   1.0000
       ratio |   0.0954   1.0000
      salary |  -0.4013  -0.0011   1.0000
 sat_percent |  -0.8694  -0.2131   0.6168   1.0000

Below are regression results predicting math SAT without sat_percent, and then a second analysis with sat_percent included.

Without percent of SAT takers per state:
------------------------------------------------------------------------------
       sat_m |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       ratio |   1.684839   2.357346     0.71   0.478    -3.057529    6.427206
      salary |  -2.714475    .899092    -3.02   0.004    -4.523215    -.905735
       _cons |   574.9211   50.89845    11.30   0.000     472.5266    677.3155
------------------------------------------------------------------------------

With percent of SAT takers per state:
------------------------------------------------------------------------------
       sat_m |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       ratio |  -2.261135   1.213925    -1.86   0.069    -4.704639    .1823688
      salary |   1.654988    .574706     2.88   0.006     .4981648    2.811811
 sat_percent |  -1.573511   .1306034   -12.05   0.000    -1.836402    -1.31062
       _cons |   544.7062   25.36266    21.48   0.000     493.6538    595.7586
------------------------------------------------------------------------------

When percent of SAT takers is controlled, the regression coefficients for both ratio and salary now are consistent with expectations: the greater the ratio of students to faculty (i.e., the larger the class size), the lower the math SAT mean; and the higher teacher salary, the greater math SAT mean.

score 0 · Answer 2 · answered Apr 05 '18 at 21:52

There is not enough information to give a definitive answer, but from the information given, this seems like partial correlation sign reversal and suppression.

In the case presented, multivariate regression has c being positive but a bivariate regression has c being negative - noted in a comment. This is a sign (pardon the pun) of partial correlation sign reversal and suppression.

In addition there is a negative correlation of a, b and c to the outcome, but positive correlation to each other. That is another lead. When combined with the sign reversal between multivariate and bivariate, the evidence becomes strong.

The explanation the OP is asking for, as @whuber pointed out, of the multivariate model is c has a positive effect when taking into account the effects of a and b. a, b and c are working together to explain outcome. But you may want to examine why, understand it and perhaps control for it.

I hate to put in links but without the data, I cannot prove my points.

Here is one page on this site that discusses this issue more in depth.

The wikipedia page gives some tests that may help.

Andrew Gelman has a post.

Multiple regression standardized coef changed direction even with low VIF

2 Answers2

Linked