Order of variables in R lm model

Question

Is the order of variables in an R model supposed to be significant? For some reason, the two models below result in different coefficients associated with fm and yr (which are supposed to model fixed effects associated with fm and yr respectively):

set.seed(0)

fm = c(rep("A", 5), rep("B", 5), rep("C", 5), rep("D", 5))
yr = rep(c(1985,1986,1987,1988,1989), 4)
d = data.frame(yr, fm, y=rnorm(length(yr)), x=rnorm(length(yr)))

out_1 = lm(y~x+factor(yr)+factor(fm)-1, data= d)
out_2 = lm(y~x+factor(fm)+factor(yr)-1, data=d)

Metrics · Accepted Answer · 2013-07-03T14:01:41.660

Updated:

There is only one intercept in the equation. The intercept consists of the observations related to factor A and year 1985 (which is the case for model 3). However, in your first case, you are omitting factor A (and thus it acts as base), where as in the model 2 you are using year 1985 as the base. So the coefficients must be different because you are comparing with the different bases. If you want to run the fixed effect, then there should be only 7 dummies (4 for years, and 3 for fm as in out_3).

summary(out_1)

Call:
lm(formula = y ~ x + factor(yr) + factor(fm) - 1, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5538 -0.5543 -0.1142  0.3762  2.1349 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
x                0.1723     0.4289   0.402    0.696
factor(yr)1985   0.7434     0.7129   1.043    0.319
factor(yr)1986   0.2433     0.7349   0.331    0.747
factor(yr)1987   0.5862     0.7015   0.836    0.421
factor(yr)1988   1.1247     0.6945   1.619    0.134
factor(yr)1989   1.0779     0.6981   1.544    0.151
factor(fm)B     -0.8163     0.7025  -1.162    0.270
factor(fm)C     -1.0703     0.7171  -1.493    0.164
factor(fm)D     -1.2179     0.7067  -1.723    0.113

Residual standard error: 1.095 on 11 degrees of freedom
Multiple R-squared: 0.3347, Adjusted R-squared: -0.2096 
F-statistic: 0.615 on 9 and 11 DF,  p-value: 0.7628 

> summary(out_2)

Call:
lm(formula = y ~ x + factor(fm) + factor(yr) - 1, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5538 -0.5543 -0.1142  0.3762  2.1349 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
x               0.17232    0.42894   0.402    0.696
factor(fm)A     0.74338    0.71286   1.043    0.319
factor(fm)B    -0.07289    0.69443  -0.105    0.918
factor(fm)C    -0.32696    0.69275  -0.472    0.646
factor(fm)D    -0.47447    0.75862  -0.625    0.544
factor(yr)1986 -0.50005    0.77809  -0.643    0.534
factor(yr)1987 -0.15720    0.82354  -0.191    0.852
factor(yr)1988  0.38128    0.78301   0.487    0.636
factor(yr)1989  0.33454    0.77855   0.430    0.676

Residual standard error: 1.095 on 11 degrees of freedom
Multiple R-squared: 0.3347, Adjusted R-squared: -0.2096 
F-statistic: 0.615 on 9 and 11 DF,  p-value: 0.7628 



 out_3 = lm(y~x+factor(fm)+factor(yr), data=d)
    > summary(out_3)

Call:
lm(formula = y ~ x + factor(fm) + factor(yr), data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2632 -0.2938 -0.1132  0.2488  1.2838 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)   
(Intercept)    -1.04018    0.49474  -2.102  0.05935 . 
x               0.03699    0.22976   0.161  0.87501   
factor(fm)B     0.06037    0.48535   0.124  0.90325   
factor(fm)C    -0.27198    0.49027  -0.555  0.59017   
factor(fm)D     0.34259    0.50111   0.684  0.50833   
factor(yr)1986  1.14448    0.62217   1.839  0.09296 . 
factor(yr)1987  1.39348    0.54603   2.552  0.02690 * 
factor(yr)1988  1.95007    0.54562   3.574  0.00436 **
factor(yr)1989  1.34118    0.57869   2.318  0.04075 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.7643 on 11 degrees of freedom
Multiple R-squared: 0.5898, Adjusted R-squared: 0.2915 
F-statistic: 1.977 on 8 and 11 DF,  p-value: 0.146

There is something weird with the output from out_3. It has different figures for RSS etc, and also it should have the same coefficients with half the coefficients of the other models (maybe it is computed for a different data 'd'). — Sextus Empiricus, Dec 17 '20 at 14:11

Thomas Bilach · Answer 2 · 2021-02-08T00:10:22.987

# install.packages("tidyverse")
library(dplyr)
library(tibble)

set.seed(0)

d <- tibble(
  fm = c(rep("A", 5), rep("B", 5), rep("C", 5), rep("D", 5)),
  yr = rep(c(1985, 1986, 1987, 1988, 1989), 4),
  y = rnorm(length(yr)), 
  x = rnorm(length(yr))
  ) %>%
  mutate(shock = ifelse(yr == 1988, 1, 0))

In light of more recent posts on this topic, I thought I would highlight that R uses variable ordering to break collinearity. For example, factor(fm) - 1 in the first run and factor(yr) - 1 in the second run may give the impression that we are explicitly deciding which factor level to drop; this is not the case. Software makes compromises based upon variable ordering. Here is your first model:

summary(lm(y ~ -1 + x + factor(fm) + factor(yr), data = d))

where I moved the -1 to the right of the tilde to avoid any confusion. Here, the -1 removes the intercept (i.e., column of 1's). The next variable, proceeding from left to right, is the firm fixed effects denoted by factor(fm); all levels will now be estimated. The third variable is the year fixed effects denoted by factor(yr). One level (i.e., 1985) must be dropped. Remember that all columns representing your firm dummies sum to unity, so R is going to drop a year effect to break the collinearity. See the abridged output below:

Call:
lm(formula = y ~ -1 + x + factor(fm) + factor(yr), data = d)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
x               0.17232    0.42894   0.402    0.696
factor(fm)A     0.74338    0.71286   1.043    0.319
factor(fm)B    -0.07289    0.69443  -0.105    0.918
factor(fm)C    -0.32696    0.69275  -0.472    0.646
factor(fm)D    -0.47447    0.75862  -0.625    0.544
factor(yr)1986 -0.50005    0.77809  -0.643    0.534
factor(yr)1987 -0.15720    0.82354  -0.191    0.852
factor(yr)1988  0.38128    0.78301   0.487    0.636
factor(yr)1989  0.33454    0.77855   0.430    0.676

Now suppose we swap the firm and year effects. The model has the same structure as before:

summary(lm(y ~ -1 + x + factor(yr) + factor(fm), data = d))

where we still introduce -1 inside of the lm() function but now the firm fixed effects are third in precedence. As a consequence, software will exclude a firm. As before, your series of year dummies sum to unity, so now a firm effect is dropped from estimation. See the abridged output below:

Call:
lm(formula = y ~ -1 + x + factor(yr) + factor(fm), data = d)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
x                0.1723     0.4289   0.402    0.696
factor(yr)1985   0.7434     0.7129   1.043    0.319
factor(yr)1986   0.2433     0.7349   0.331    0.747
factor(yr)1987   0.5862     0.7015   0.836    0.421
factor(yr)1988   1.1247     0.6945   1.619    0.134
factor(yr)1989   1.0779     0.6981   1.544    0.151
factor(fm)B     -0.8163     0.7025  -1.162    0.270
factor(fm)C     -1.0703     0.7171  -1.493    0.164
factor(fm)D     -1.2179     0.7067  -1.723    0.113

Ordering wouldn't matter in this context if the intercept was included. It should be clear that one of the firm dummies and one of the year dummies will be collinear with the column of 1's appended to the model matrix. Run a third model but on this trial do not drop the intercept. It should result in 3 separate firm effects and 4 separate year effects. I am not sure why all estimates of x in the previous answer aren't equivalent (well noted by @SextusEmpiricus); they should be similar.

A even better demonstration of how R breaks collinearity is when your model incorporates a redundant predictor at the level of $i$ or $t$. Suppose you suspect a macro level shock hit firms in 1988. You model the shock by including a time dummy specific to all entities in your panel. I will run one model with the shock preceding the time effects, then I will follow with a second run including the shock after the time effects. Here is the first run:

Call:
lm(formula = y ~ x + factor(fm) + shock + factor(yr), data = d)

Coefficients: (1 not defined because of singularities)
               Estimate Std. Error t value Pr(>|t|)
(Intercept)      0.7434     0.7129   1.043    0.319
x                0.1723     0.4289   0.402    0.696
factor(fm)B     -0.8163     0.7025  -1.162    0.270
factor(fm)C     -1.0703     0.7171  -1.493    0.164
factor(fm)D     -1.2179     0.7067  -1.723    0.113
shock            0.3813     0.7830   0.487    0.636
factor(yr)1986  -0.5001     0.7781  -0.643    0.534
factor(yr)1987  -0.1572     0.8235  -0.191    0.852
factor(yr)1988       NA         NA      NA       NA
factor(yr)1989   0.3345     0.7785   0.430    0.676

Note the 1988 time shock precedes the year fixed effects in variable ordering. One year effect must be dropped to break the collinearity; R goes to the time effects. Now I will swap the ordering and re-estimate the model:

Call:
lm(formula = y ~ x + factor(fm) + factor(yr) + shock, data = d)

Coefficients: (1 not defined because of singularities)
               Estimate Std. Error t value Pr(>|t|)
(Intercept)      0.7434     0.7129   1.043    0.319
x                0.1723     0.4289   0.402    0.696
factor(fm)B     -0.8163     0.7025  -1.162    0.270
factor(fm)C     -1.0703     0.7171  -1.493    0.164
factor(fm)D     -1.2179     0.7067  -1.723    0.113
factor(yr)1986  -0.5001     0.7781  -0.643    0.534
factor(yr)1987  -0.1572     0.8235  -0.191    0.852
factor(yr)1988   0.3813     0.7830   0.487    0.636
factor(yr)1989   0.3345     0.7785   0.430    0.676
shock                NA         NA      NA       NA

Here, R drops shock as it is specified after the time effects. Remember, the year fixed effects represent the common shocks in each year affecting all firms, and so the time dummy modeling the shock in 1988 is redundant with the year fixed effects.

This is yet another example of when ordering matters.

Order of variables in R lm model

2 Answers2

Linked