Multicollinearity, plm, and omitting variables

Question

I'm fitting a fixed effect model with plm and know that I'm dealing with multi-collinearity between two of the independent variables. I working on identifying multicolliearity in models as a practice and have identified the variable with alias(), then verified with vif(). I was also able to use kappa() to show a very large conditional number verifying the multicollinearity.

My question is why does plm() ommit this multicolliearity variable from the coefficients? There is no output clarifying why and I couldn't find anything in the documentation. Stata automatically omits this variable and I'm curious if plm() does a check and then omits.

Multicollinearity variable dfmfd98

Reproducible example :

dput :

data <- 
structure(list(lexptot = c(8.28377505197124, 9.1595012302023, 
8.14707583238833, 9.86330744180814, 8.21391453619232, 8.92372556833205, 
7.77219149815994, 8.58202430280175, 8.34096828565733, 10.1133857229336, 
8.56482997492403, 8.09468633074053, 8.27040804817704, 8.69834992618814, 
8.03086333985764, 8.89644392254136, 8.20990433577082, 8.82621293136669, 
7.79379981225575, 8.16139809188569, 8.25549748271241, 8.57464947213076, 
8.2714431846277, 8.72374048671495, 7.98522888221012, 8.56460042433047, 
8.22778847721461, 9.15431416391622, 8.25261818916933, 8.88033778695326
), year = c(0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 
1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 
1L), dfmfdyr = c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0), dfmfd98 = c(1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 0, 0, 0, 0), nh = c(11054L, 11054L, 11061L, 11061L, 
11081L, 11081L, 11101L, 11101L, 12021L, 12021L, 12035L, 12035L, 
12051L, 12051L, 12054L, 12054L, 12081L, 12081L, 12121L, 12121L, 
13014L, 13014L, 13015L, 13015L, 13021L, 13021L, 13025L, 13025L, 
13035L, 13035L)), .Names = c("lexptot", "year", "dfmfdyr", "dfmfd98", 
"nh"), class = c("tbl_df", "data.frame"), row.names = c(NA, -30L
))

Regression code :

library(plm)
lm <- plm(lexptot ~ year + dfmfdyr + dfmfd98 + nh, data = data, model = "within", index = "nh")
summary(lm)

Output :

Oneway (individual) effect Within Model

Call:
plm(formula = lexptot ~ year + dfmfdyr + dfmfd98 + nh, data = data, 
    model = "within", index = "nh")

Balanced Panel: n=15, T=2, N=30

Residuals :
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-4.75e-01 -1.69e-01  4.44e-16  1.69e-01  4.75e-01 

Coefficients :
        Estimate Std. Error t-value Pr(>|t|)  
year     0.47552    0.23830  1.9955  0.06738 .
dfmfdyr  0.34635    0.29185  1.1867  0.25657  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    5.7882
Residual Sum of Squares: 1.8455
R-Squared      :  0.68116 
      Adj. R-Squared :  0.29517 
F-statistic: 13.8864 on 2 and 13 DF, p-value: 0.00059322

Yes, plm silently drops perfect collinear variables. Unfortunately, perfect collinearity is not always easy to see, esp. for the FE and RE models as they involve data transformation. In the recent development of `plm` there is a function `detect_lin_dep` to detect perfect collinear variables even after the data transformation. It has a doc with some examples where linear dependence turns up after the data transformation https://r-forge.r-project.org/R/?group_id=406 — Helix123, Mar 30 '16 at 08:28

score 4 · Accepted Answer · edited Apr 13 '17 at 12:44

I was a bit curious about your question and performed a brief research. I think that the problem you're experiencing might be due to one of the following reasons (these are just some ideas):

Your variable dfmfd98 is indeed highly collinear and, thus, plm code just drops the output (I've tried to find that segment of code, but couldn't so far - the code is IMHO not that trivial - you can try by yourself: https://github.com/rforge/plm/tree/master/pkg/R).

I doubt that it has something to do with collinearity, but you might want to consider specifying that particular variable via a dummy variable (I've tried that as well, using factor(), but the output was still missing. This is unlike similar approach I've tried for lm(), which provides the expected output (for simplicity, I didn't create a dummy variable):

lm2 <- lm(lexptot ~ year + dfmfdyr + factor(dfmfd98) + nh, data = data)
summary(lm2)

Call:
lm(formula = lexptot ~ year + dfmfdyr + factor(dfmfd98) + nh, 
    data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77581 -0.23704 -0.01301  0.17039  1.16883 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       9.183e+00  1.347e+00   6.818 3.81e-07 ***
year              4.755e-01  2.660e-01   1.788    0.086 .  
dfmfdyr           3.463e-01  3.258e-01   1.063    0.298    
factor(dfmfd98)1 -1.774e-01  2.361e-01  -0.751    0.459    
nh               -7.343e-05  1.072e-04  -0.685    0.500    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4206 on 25 degrees of freedom
Multiple R-squared:  0.4769,    Adjusted R-squared:  0.3933 
F-statistic: 5.699 on 4 and 25 DF,  p-value: 0.002112

Using dummy variables for categorical predictors in multiple linear regression models is widely written about, but you might find this nice bog post interesting, as it contains examples with plm and in the econometrics context (in fact, the whole series on econometrics with R might be of your interest - for example this initial blog post on panel data methods). Also, check the following hopefully relevant discussions on Cross Validated: this and this.

P.S. Perhaps, you're familiar with fixed/random effects terminology issues in econometrics, but it was new to me. In case you're curious, you can read about it in Section 7.2 of this JSS paper.

Thanks for your response. I have looked through the `plm()` code as well and wasn't able to figure it out. The website you posted for "surviving graduate econometrics" is a great resource and I've combed through that many times. In particular to a cross post, I believe "A possible reason might be that your dummies do not vary over time" may be a problem." But I can't confirm through `plm()` and given such a small example above I'm not sure this is the issues as well. — Amstell, Mar 14 '15 at 06:31
@Amstell: You're welcome. I'm glad you like my answer. I think the code in question is not necessarily located within `plm()` code, but potentially some other functions and modules. I've looked around, but could find it. — Aleksandr Blekh, Mar 14 '15 at 06:55

Multicollinearity, plm, and omitting variables

1 Answers1