Factors a, b and c walk into a regression ... (multicollinearity puzzle)

Question

I have three factors a, b, and c. In the three univariate models

y ~ a
y ~ b
y ~ c

a and b are insignificant and c is significant. But in the multivariate model

y ~ a + b + c

a and b are significant and c is insignificant.

I understand how adding a factor can make another factor significant (say by controlling for a bunch of variance that was obscuring its relationship). And I understand how adding a factor can make another factor insignificant (say by capturing all of its variance). But I don't understand how the factor that makes others significant can be made insignificant by them. What's the way to think or visualize to make this make obvious?

I believe that over the years I have covered this topic extensively, emphasizing visual and intuitive explanations. See https://stats.stackexchange.com/a/37715/919, https://stats.stackexchange.com/a/32237/919, https://stats.stackexchange.com/a/28493/919, https://stats.stackexchange.com/a/24529/919, and https://stats.stackexchange.com/a/34813/919, among others. — whuber, Sep 01 '17 at 14:33

ampanmdagaba · Accepted Answer · 2017-09-01T13:57:30.150

Are you sure you are using marginal sum of squares, and not sequential sum of squares? Because if you use sequential, and c is a linear combination of a and b, then of course putting it third would make it insignificant.

If you use marginal sum of squares, it is also possible. Say, $y=k_1a+k_2b$, and $c=k_3a+k_4b+noise$. If the level of this noise is high enough, you may still improve your explanation by adding $a$ as a factor on top of $b$ and $c$, or $b$ as a new factor on top of $a$ and $c$. But adding $c$ on top of $a$ and $b$ would not improve anything, as it would contain no new useful information compared to $a$ and $b$.

EDIT: user ahstat below provided a better response. I will only improve on it a bit: you don't seem to need fancy non-linear functions like min() and max() to make the linear combination of $a$ and $b$ much more powerful than any of them together. Here's R code:

    n = 1000
    x1 = rnorm(n) 
    x2 = rnorm(n)
    x3 = rnorm(n)*100
    a = x1+x3
    b = x1-x3
    c=x1+x2
    y=x1

    summary(aov(y~a))
    summary(aov(y~b))
    summary(aov(y~c))
    summary(aov(y~a+b+c))

The scenario that I was trying to solve is a bit stronger: I wanted to find a combination where y~a+b would not be significant as well, but y~a+b+c would make $a$ and $b$ significant, but not $c$ (in case of sequential sum of squares). I'm still not sure whether it's only possible if $c$ tips $a$ just over the threshold for significance, by reducing the residual, or whether one could build some sort of a noisy 4-dimensional saddle with negative correlations where this pattern would always be the case.

Let's say it was sequential sum of squares: then of course being third would make c insignificant, but how could c added third make a and b significant? As for marginal SS, your example is great and concrete. I can see here how c added to a and b would become insignificant but, again, how did it simultaneously make a and b significant? — enfascination, Sep 01 '17 at 04:11
The only idea I have is to make c a linear combination of a and b (so that it would be insignificant after a+b), but also make it carry a little of explanatory value, so that the residual after a+b+c would be smaller than for a alone. Then if a alone gives you something like p=0.06, reducing the residual may increase F-value just enough to make a significant once c is added. But overall I'm inclined to believe that your n is rather low and so the wild dance of numbers just happened to create this strange situation. — ampanmdagaba, Sep 01 '17 at 05:32

score 1 · Answer 2 · answered Sep 01 '17 at 06:32

You can have a linear combination of $a$ and $b$ significant to fit $y$ if there is some specific nonindependent noise while retrieving $a$ and $b$. For example, for some noise $\varepsilon$, if:

$ a = max(y,0) + \varepsilon,$

$ b = min(y,0) - \varepsilon,$

then, $a+b = y$ exactly.

Here is an example in R.

For individual regressions, you observe $\text{Pr}(>|t|) = 0.639$ for $a$; $\text{Pr}(>|t|) = 0.617$ for $b$; $\text{Pr}(>|t|) <2.10^{-16}$ for $c$ (because $c$ is constructed as $y + \text{noise}$).

But together, you have both $a$ and $b$ significant ($\text{Pr}(>|t|) <2.10^{-16}$), and there is no variance remaining to make $c$ useful ($\text{Pr}(>|t|) = 0.319$ for coefficient $c$).

set.seed(1111)
N = 1000
y = rnorm(N)
eps = N*rnorm(N)
a = sapply(y, function(x){max(x,0)}) + eps
b = sapply(y, function(x){min(x,0)}) - eps
c = y + rnorm(N, 0, 4)

reg_a = lm(y ~ a)
summary(reg_a)
plot(a, y)
abline(reg_a)

reg_b = lm(y ~ b)
summary(reg_b)
plot(b, y)
abline(reg_b)

reg_c = lm(y ~ c)
summary(reg_c)
plot(c, y)
abline(reg_c)

reg = lm(y ~ a + b + c)
summary(reg)

Factors a, b and c walk into a regression ... (multicollinearity puzzle)

2 Answers2