Why does an insignificant regressor become significant if I add some significant dummy variables?

Question

I'm doing a linear regression with cluster robust SE and I have the following conceptual problem:
I have five regressors, of which four are statistically significant, while the remaining regressor is not.

When I put $K$ dummy variables in the model in order to control for effects not captured by the $5$ initial explanatory variables, I saw that:

Some dummy variables were statistically significant
The regressor that initially was not significant becomes significant.

What is the reason for the second result? What does it mean?

score 17 · Accepted Answer · answered Jul 05 '15 at 19:40

What you have described is a classic example of the phenomenon "confounding." For the sake of argument, suppose you want to know what factors affect the price of a car, and the original model you fitted was:

$Price_i=MPG^*_i + Weight_i + Length_i + GearRatio_i$

*$MPG$ is how many miles per gallon the car gets

The regression results are as follows:

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  4,    69) =   10.93
       Model |   246385405     4  61596351.2           Prob > F      =  0.0000
    Residual |   388679991    69  5633043.35           R-squared     =  0.3880
-------------+------------------------------           Adj R-squared =  0.3525
       Total |   635065396    73  8699525.97           Root MSE      =  2373.4

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |   -90.8697   82.54167    -1.10   0.275    -255.5358    73.79643
      weight |   5.330082   1.259779     4.23   0.000     2.816892    7.843272
      length |  -112.6501   39.26864    -2.87   0.005    -190.9889   -34.31134
  gear_ratio |   1747.338   940.8806     1.86   0.068    -129.6674    3624.343
       _cons |   7909.196   6803.245     1.16   0.249    -5662.907     21481.3
------------------------------------------------------------------------------

$Weight$ and $Length$ are significantly associated with price at the 5% level, whereas $GearRatio$ is significant at the 10% level. In this example, I will use 10% as the significant level often used in econometrics instead of the customary 5% in statistics/biostatistics.

Now suppose you realize that the country of origin of the car might have something to do with the price, so you enter "Country of origin" ($Country$)--a variable with 4 categories: 1. USA, 2. Japan, 3. Germany, and 4. France/Italy--into your model as dummy variables with "USA" as the reference/omitted category. The resulting model is as follows:

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  7,    66) =    7.05
       Model |   271664993     7  38809284.6           Prob > F      =  0.0000
    Residual |   363400404    66  5506066.72           R-squared     =  0.4278
-------------+------------------------------           Adj R-squared =  0.3671
       Total |   635065396    73  8699525.97           Root MSE      =  2346.5

---------------------------------------------------------------------------------
          price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
            mpg |  -43.63664   88.87729    -0.49   0.625    -221.0859    133.8126
         weight |   5.627906   1.277128     4.41   0.000     3.078037    8.177775
         length |  -108.6306   40.96925    -2.65   0.010    -190.4283   -26.83285
     gear_ratio |   1036.988   1011.416     1.03   0.309     -982.369    3056.344
                |
        country |
        Germany |   1474.478   786.7092     1.87   0.065    -96.23774    3045.193
          Japan |   1508.771   931.8605     1.62   0.110    -351.7485    3369.291
   France/Italy |   1513.169   1660.423     0.91   0.365    -1801.972    4828.311
                |
          _cons |   6825.621   6936.845     0.98   0.329    -7024.236    20675.48
---------------------------------------------------------------------------------

When we added $Country$ into the model, $GearRatio$ was no longer significant at the 10% level and $MPG$ became even more not significant (p was 0.28 in the original model, and became 0.63 after adding $Country$). We also note that the only significant category of $Country$ was $Germany$.

How do we interpret these results?

Recall that dummy variables are entered into the model as a set as $(N-1)$ dummy variables where $N$ is the number of categories in the original variable. Recall also that dummies are interpreted relative to the excluded (reference) category. It is therefore normal for some dummy variables not to be significant in the model if the difference between that category and the reference category is not significant. In our example, German cars are on average USD 1,474.48 more expensive than American cars, whereas Japanese and French/Italian cars are both not significantly different from American cars in terms of $Price$. If you want to know whether the effect of the construct you entered as dummy variables was significant or not, you will need to do an F-test of the joint significance of your dummies, as the p-value given in the model only tells you if the given category was different from the reference or not, and not whether the $Country$ as a whole is significantly associated with $Price$:

test Germany Japan FranceItaly

( 1)  Germany     = 0
( 2)  Japan       = 0
( 3)  FranceItaly = 0

          F(  3,    66) =    1.53
               Prob > F =    0.2148

It turns out $Country$ as a whole is not a significant predictor of price (p=0.21), although German cars are significantly more expensive than American cars in this model.

We also noted that some variables that were significant ($GearRatio$) became non-significant after adding $Country$. This means that in the model where we omitted $Country$, the parameter estimate for $GearRatio$ "absorbed" the effect of $Country$. That is, $Country$ is significantly associated with $GearRatio$ and $Price$, and failing to control for $Country$ biased the parameter estimate of $GearRatio$, making it seem more significant than it really is. That is, the "significant" effect of $GearRatio$ on $Price$ we saw in the original model is actually reflecting the effect of $Country$ on $Price$. $GearRatio$, as it turns out, has nothing to do with the $Price$ of a car.

Of course, the reverse can be true too: You CAN have something that was not significant become significant after adding variables to the model. The logic behind it is the same. The originally-not-significant variable was significantly associated with the omitted variable and reflects the effect of the omitted variable in addition to its own effect (plus some other unobservables, which we will ignore for the sake of argument). When you add the omitted variable (the dummies) into the model, the originally-not-significant variable no longer captures the partial effect of the omitted variable but now reflects the "true" effect of that variable...which, it turns out, is significantly associated with the outcome.

(Data: Stata built-in dataset "1978 Automobile Data" from http://www.stata-press.com/data/r13/auto.dta)

This is a great explanation of confounding and how it *might* explain what Luca observed, but I wouldn't conclude that this is what is happening without examining the data. See Glen_b's explanation below for another possible cause of this result. — Cliff AB, Jul 06 '15 at 01:02

Glen_b · Answer 2 · 2015-07-06T06:14:41.157

11

There are two main reasons this can happen.

Adding a significant regressor - whether related to dummies or not - reduces the mean square residual. By reducing estimate of error variance, other regressors may become more significant even if the parameter estimate hardly changes, by reducing the denominator of their t-statistic (equivalently, their partial F). This one needn't involve any dependence between regressors at all.
Adding a regressor can also change the numerator of the t-statistic, by changing the parameter estimate, due to dependence between regressors, which can move coefficients either toward or away from zero; it will also alter the denominator (so it's not as simple as just considering the numerator). Sometimes the overall effect can be to make a previously insignificant regressor significant.

edited Jul 06 '15 at 06:14

answered Jul 05 '15 at 23:26

Glen_b

257,508
32
553
939

Can you provide any academic references to back up your excellent answer (+1), a textbook or journal article so I can explore this in more detail would be really beneficial. Thanks in advance! – user30609 Feb 27 '20 at 08:34
1

I answered simply by considering the test statistic. I don't know of any text that directly does that (though almost any decent text that covers regression theory could be helpful in coming to understand enough to do the same oneself); there may well be some out there though; it's easy enough to demonstrate. No stats journal worth reading would publish anything so trivial. Perhaps one might find a journal article in some application area (but a degree of caution is necessary in such circumstances; not all journal have sufficiently high standards). – Glen_b Feb 27 '20 at 11:06
Thanks!!!!!!!!!! – user30609 Feb 27 '20 at 12:27

Why does an insignificant regressor become significant if I add some significant dummy variables?

2 Answers2

Linked