1

I am fairly new to this, sorry if I am not clear.

I have two models with logY. The first model has a slightly bigger sample than the second model and accounts for 4 main drivers (and some control variables). In the first model:

X1 = significant
X2 = significant
X3 = significant
X4 = insignificant

The second model as 2 more variables (same control variables) and the sample is slightly smaller:

X1 = significant
X2 = significant
X3 = INSIGNIFICANT
X4 = insignificant
X5 = insignificant
X6 = significant

Thus after including the two variables, X3 becomes insignificant in the second model.

First, I checked if it had something to do with the sample, so I run the regression of the first model on the same sample as the second model. X3 didn't become insignificant. Thus I included X5 and X6 one by one and found that when I include X6 my X3 becomes insignificant. Therefore I assumed that there might be some correlation problems and checked for that, but the correlation between X6 and X3 is "only" -0.5. If it is not a correlation problem, then what can it be?

All my X are 0/1 dummy variables & I am using STATA (new to it)

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
Jennifer4
  • 23
  • 1
  • 7
  • 3
    I think this question is within the span of https://stats.stackexchange.com/questions/27257/significant-predictors-become-non-significant-in-multiple-logistic-regression and https://stats.stackexchange.com/questions/3549/why-is-it-possible-to-get-significant-f-statistic-p-001-but-non-significant-r – jld May 01 '17 at 14:33
  • In those cases, correct me if I'm wrong, there are multicollinearity problems. I don't seem to have this problem – Jennifer4 May 01 '17 at 15:09
  • There are plenty of other questions before. Sometimes this may just be chance variation - remember a p-value is a random number and there is no particular magic meaning imparted when a value just below 0.05 changes to just above 0.05. – Björn May 01 '17 at 18:20
  • yes but unfortunately the p value went from .006 to .388, that is a very big increase. – Jennifer4 May 01 '17 at 18:48
  • also I checked for degrees of freedom (F|17,330|=60.09) vs. (F|19, 304|= 45.04) – Jennifer4 May 01 '17 at 19:09

1 Answers1

-1

Remember that the $t$ tests for linear regression test hypotheses of the form $$H_{0,j} : \beta_j = 0 \big\vert \beta_{-j} \hspace{1cm} \textrm{ vs. } \hspace{1cm} H_{1,j} : \textrm{ not } H_{0, j}.$$

Let's say we add a predictor $X_{p+1}$ and all of a sudden $X_1$ goes from significant to insignificant. The most likely explanation is that $X_{p+1}$ explains $X_1$ quite well, so after accounting for the effect of $X_{p+1}$ $X_1$ contributes nothing of value. So this is a direct result of correlation between predictors, and you've seen that you do have correlation between your predictors. The standard diagnostics for investigating this are variance inflation factors and looking at the eigenvalues of $X^T X$. This will be much more informative than computing pairwise correlations.

Note that if your $X$ matrix is orthogonal then you can add and delete predictors (assuming orthogonality is preserved) without affecting any of the other $\beta_j$s.

Update

Consider the following example:

set.seed(123)
n <- 100
x0 <- rnorm(n)
x1 <- rnorm(n)
x3 <- rnorm(n)
eps <- rnorm(n)

x2 <- x3 + rnorm(n, 0, 2)

print(cor(x2,x3))
# 0.4144069  # less than your example of 0.5

dat <- data.frame(x0=x0, x1=x1, x2=x2, x3=x3)

y <- 1 + x0 + x1 + x3 + eps

mod_no3 <- lm(y ~ x0 + x1 + x2, data = dat)
mod_3 <- lm(lm(y ~ x0 + x1 + x2 + x3, data = dat))

print(summary(mod_no3))  # p-val for x2 is 0.01, definitely significant
print(summary(mod_3))  # p-val for x2 is .8, not even a little significant

car::vif(mod_3)  # none of these are even 1.5
#       x0       x1       x2       x3 
# 1.064094 1.023699 1.282272 1.216728 
jld
  • 18,405
  • 2
  • 52
  • 65
  • Thanks for your reply. I checked for VIF in the regression, but they were all 2 or less so no problems. Not familiar with eigenvalues so it is a bit unclear what you mean – Jennifer4 May 01 '17 at 15:03
  • @Jennifer4 what's your sample size? Also, what are the actual p-values? Is the variable going from extremely significant to completely insignificant, or is it more like .045 to .055? – jld May 01 '17 at 15:28
  • The sample size of the first model (where X3 is significant) is 348, the sample size of the second model is 324. I took the same sample as the second model for the first model and X3 stayed significant. The P value in the original first model was .006, in the second model .388 – Jennifer4 May 01 '17 at 17:51
  • The p value in the first model with the reduced sample (i.e. same as the second model) is .055 – Jennifer4 May 01 '17 at 17:59
  • I mean .026 not .055 – Jennifer4 May 01 '17 at 19:08
  • @Jennifer4 i've added an example. My point with this is that multicollinearity can have pretty pronounced effects even without being super obvious. My example has obvious collinearity but the collinearity diagnostics are all less convincing than in your example. Basically, my point is: don't be too quick to rule out the effect of correlation. You know you do have correlation in your data and it's very possible that's what this is, without resorting to more subtle things like covariates correlated with residuals and things like that. – jld May 02 '17 at 02:21
  • Also, I'd really appreciate clarification on the downvote... – jld May 02 '17 at 02:24
  • Thank you it has become more clear now, I was taught that collinearity will only show problems around 0.8, but that may not be the case then! Also about the downvote... did I do that? That wasn't my intention, I am new to the site and perhaps I've clicked without knowing (don't know what the button was for until now) – Jennifer4 May 02 '17 at 08:54
  • @Jennifer4 glad this was helpful! It's surprising how easy it is to get your results affected by correlated predictors, and it can be the case that no pairs are highly correlated but some linear combinations of them are in which case pairwise correlations wouldn't reveal anything at all. And you didn't downvote, you need 125 rep to be able to do that. – jld May 02 '17 at 16:13
  • Yes I wasn't expected it either. The two variables are moderately correlated to each other. When I drop X3, my X6 remains significant, when I drop X6, X3 becomes significant. Is it possible/recommended to keep both, while acknowledging the "problem" ? I want to continue the analysis by splitting the sample and see what happens – Jennifer4 May 03 '17 at 14:21
  • @Jennifer4 if you're using the model only for predictions then you'll want them both. But for inference, you'll want to appeal to the science of the problem. If it makes sense to keep both in the model then do so, and it's ok to have a model with scientifically interesting but not significant terms. Also, in your analysis you don't need to have only one model. Be wary of hunting for an optimal feature set: this can lead to overfitting and will bias your p-values, and in general you're best off just fitting the models that scientifically make the most sense and working with that – jld May 05 '17 at 20:17