0

How can I explain that a value, first positive correlated with the target, has a negative influence on the target in my model?

I have built a linear model in R. Had 30 variables, and 60 observations.

First step: feature selection (processed with K10 CV, and searched for minimum MSE Second step: build the LM with the best number of features.

reg.best=regsubsets(target ~., data=data, nvmax=30)
coef(reg.best ,17) #17 was the number with least errors. 

model <- lm(target ~ X1 + X2 + X2 + X(n=17) , data = data)

If I put the 'best' features in X,the result looks good (small P-value, all variables significant and a Adjusted R2 of 0.95.

The thing that I can't explain to the business:

cor(data) 
gives a correlation between X1 and Y: 0.82(!)

summary(model) 
       Estimate       Std. Error t value       Pr(>|t|)    
X1 -0.9100893836144  0.1611999070127  -5.646 0.000001044953 ***

How can I explain that a value, first positive correlated with the target, has a negative influence on the target in my model?

R overflow
  • 235
  • 3
  • 11
  • 1
    In a world where you *only* have the covariates in your original model (X17), when your value X1 goes up by 1, your value of target goes down by 0.91. Though the values are positively related in the real-world, which contains an infinite number of variables, in the limited scope of your model, the program only identified the best-fit beta value which turned out to be negative. This could mean your model does not have high predictive value (ability to predict future values). – ERT Jul 06 '18 at 17:03
  • 3
    Please see https://stats.stackexchange.com/questions/31841/coefficients-change-signs/32237#32237, *inter alia,* and then consult any other likely hits after searching our site: https://stats.stackexchange.com/search?q=+significant+regression+change+sign – whuber Jul 06 '18 at 20:31

2 Answers2

2

Because in a linear model, the coefficient for any X_i is the conditional effect of X_i on Y, keeping every thing else constant.

As an exaggerate example:

y - bad outcome of some disease

x1 - seeing the doctor

x2 - some blood test proportional to severity of disease

.

if you do cor(y, x_1) you'd get a positive result: seeing the doctor correlates with bad outcomes. (because only people who are doing bad go to the doctor). But if you do y ~ x1 + x2, then the coefficient of x1 should be negative: conditional on severity of the disease (x2) then seeing the doctor is actually helpful and should reduce the probability of the bad outcomecome.

f.g.
  • 142
  • 7
1

One possible explanation for what you are observing is confounding. Generally, confounding occurs when a variable that is associated with the target(outcome) and predictor of interest is not include in the analysis. Confounding may result in biased coefficient estimates or even reverse direction of association. In your case, when you omitted the other variables from the regression, there is a positive association between Y and X1 (your predictor of interest). However, once you accounted for what could be potential confounding variables, you obtain an inverse relationship between Y and X1 adjusting for other confounders. Please find two basic explanations of confounding here:

  1. https://www.psychologyinaction.org/psychology-in-action-1/2011/10/30/what-is-a-confounding-variable
  2. https://www.r-bloggers.com/example-9-20-visualizing-simpsons-paradox/

The R-example above shows a switch in the direction of association when additional variables are added. To get further insight into your specific problem, here are some ad-hoc approaches I have implemented before to identify potential confounding variables:

  1. Examine the scatter plots of each of the variables in your final model against X1. Try to see if there are variables that are strongly positively correlated with X1 but negatively correlated with the outcome (target variable).
  2. If you final model is very simple (i.e. not too many predictors), start with a model of Y and X1 and then build up-to your final model by sequentially adding one variable at a time. Try to observe if the relationship between Y and X1 suddenly flips or changes at the addition of a specific variable. This may be one way to think about which of the variables in your final model is strongly confounding the relationship between Y and X1.
user3487564
  • 451
  • 3
  • 8