Why, oh why, does R automatically eliminate variables from my dataset when running lm?

Question

First of all, forgive my moronosity (new word I made up).

I am running R on data from trees in burn areas. Simple, straightforward dataset. And yet, R for some reason hates chestnut oak. My species are NRO, WO, and CHO. It will not report on CHO, no matter what I try. What gives? Examples of what I see in the console:

data.frame':    72 obs. of  4 variables:
 $ site  : Factor w/ 24 levels "Arbutus Ridge 1 Low/Moderate",..: 1 1 1 2 2 2 3 3 3 4 ...
 $ burn  : Factor w/ 4 levels "Control","High",..: 2 2 2 4 4 4 3 3 3 4 ...
 $ sp    : Factor w/ 3 levels "CHO","NRO","WO": 1 2 3 1 2 3 1 2 3 1 ...
 $ seedba: num  NA NA NA 1.85 NA ...

Call:
lm(formula = seedba ~ burn + sp + burn * sp, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1287 -1.0779 -0.3294  0.1242 12.4723 

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)          0.7720     1.8144   0.426    0.675
burnHigh             0.1979     2.7715   0.071    0.944
burnLow              1.1491     2.7715   0.415    0.683
burnModerate         2.1360     3.1426   0.680    0.505
spNRO                1.1394     2.5659   0.444    0.662
spWO                -0.2964     2.5659  -0.116    0.909
burnHigh:spNRO      -1.4967     4.1901  -0.357    0.725
burnLow:spNRO        4.4422     3.6887   1.204    0.243
burnModerate:spNRO  -2.5252     5.1318  -0.492    0.628
burnHigh:spWO            NA         NA      NA       NA
burnLow:spWO        -1.3814     4.9133  -0.281    0.782
burnModerate:spWO   -2.6117     5.1318  -0.509    0.617

Residual standard error: 3.629 on 19 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.4196,    Adjusted R-squared:  0.1141 
F-statistic: 1.373 on 10 and 19 DF,  p-value: 0.2644

It still won't show CHO:

Call:
lm(formula = seedba ~ -1 + burn + sp + burn * sp, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1287 -1.0779 -0.3294  0.1242 12.4723 

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)
burnControl          0.7720     1.8144   0.426    0.675
burnHigh             0.9700     2.0951   0.463    0.649
burnLow              1.9212     2.0951   0.917    0.371
burnModerate         2.9080     2.5659   1.133    0.271
spNRO                1.1394     2.5659   0.444    0.662
spWO                -0.2964     2.5659  -0.116    0.909
burnHigh:spNRO      -1.4967     4.1901  -0.357    0.725
burnLow:spNRO        4.4422     3.6887   1.204    0.243
burnModerate:spNRO  -2.5252     5.1318  -0.492    0.628
burnHigh:spWO            NA         NA      NA       NA
burnLow:spWO        -1.3814     4.9133  -0.281    0.782
burnModerate:spWO   -2.6117     5.1318  -0.509    0.617

Residual standard error: 3.629 on 19 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.5712,    Adjusted R-squared:  0.323 
F-statistic: 2.301 on 11 and 19 DF,  p-value: 0.05338

It's because you can't have both an intercept and ALL of the tree types; they are perfectly multicollinear in combination and the linear regression will have no unique solution. Try modifying the call to `lm` as follows: `lm(seedba ~ -1 + burn + sp + burn*sp, data=data)`. The `-1` tells `lm` not to use an intercept. — jbowman, Sep 25 '18 at 16:28
This is a FAQ. Please see our threads on [dummary variable coding in regression.](https://stats.stackexchange.com/search?q=dummy+variable+coding+regression) — whuber, Sep 25 '18 at 17:08
Depending on your objectives, you may want to use *car::Anova* and *emmeans::emmeans*. — Sal Mangiafico, Sep 25 '18 at 18:09

score 5 · Answer 1 · answered Sep 25 '18 at 16:47

5

R automatically includes an intercept in regression. If you include an intercept and all categories of a dummy variable, then you have succumbed to the dummy variable trap. R is smart enough to notice when you have fallen to the dummy variable trap, so it will omit a category for you.

If you want to include all levels of a categorical predictor, then you will have to omit the intercept using the -1 or the +0 argument: lm(seedba ~ -1 + burn + sp + burn*sp, data=data). See ?formula for more information.

answered Sep 25 '18 at 16:47

Sycorax

76,417
20
189
313

Thanks very much. Unfortunately, when I run it with the above, it still won't show CHO. I edited my original question to include the output. – Matt Aldrovandi Sep 25 '18 at 20:27
2

If excluding the intercept still results in this behavior, it's possibly because there is still perfect multi-collinearity in your data. On the other hand, R handles `NA` values by deleting the corresponding rows from the analysis, and your printout says that some observations were removed due to missingness. Do instances of CHO have `NA` values? Also, the blob of text you added is hard to read. Please use code formatting. – Sycorax Sep 25 '18 at 20:36
Sorry about that, will do. Yes, I still have NA values. Should I switch them to something? Worried about messing up the data if I make them zeroes. – Matt Aldrovandi Sep 25 '18 at 21:27
Working with missing data is its own topic in statistics; we have a number of posts about [tag:missing-data] and [tag:data-imputation]. – Sycorax Sep 25 '18 at 21:35

Why, oh why, does R automatically eliminate variables from my dataset when running lm?

1 Answers1