categorical variables multicollinear

Question

I have a data set

dat = data.frame(y = c(.3,.5,.3,.6,.8,.9,.3,.6),
                group1= c(1,1,2,2,1,1,2,2),
                group2 =c("a","a","b","b","c","c","e","e")
                )

l =lm(data = dat, y~ factor(group1)+factor(group2))
l
model.matrix(l)

The coef for group2 e is NA

Coefficients:
    (Intercept)  factor(group1)2  factor(group2)b  factor(group2)c  factor(group2)e  
      4.000e-01        5.000e-02        3.786e-18        4.500e-01               NA

here is model.matrix(l)

  (Intercept) factor(group1)2 factor(group2)b factor(group2)c factor(group2)e
1           1               0               0               0               0
2           1               0               0               0               0
3           1               1               1               0               0
4           1               1               1               0               0
5           1               0               0               1               0
6           1               0               0               1               0
7           1               1               0               0               1
8           1               1               0               0               1
attr(,"assign")

Is that because factor(group1)2 = factor(group2)b + factor(group2)e?

To fix this ---I would remove one either group1 or group2 form the model correct?

If you were to remove `group2`, you would lose information. – whuber Aug 15 '17 at 21:38 — whuber, Aug 15 '17 at 21:38

KenHBS · Accepted Answer · 2017-08-15T21:40:40.117

The NA stems from perfect multicolinearity in your data, yes.

I think you could still retain both groups, by including only interaction dummies. By that, I mean a dummy for every combination of (group1-value, group2-value). Based on your data, the possible options are (1, b), (1,d), (2, a) and (2, c)).

Drawback of this approach is, however, that every cell will be dealt with independently and 'switching' the value for group1 from 1 to 2 will have a different marginal effect, depending on the value of group2. In your particular case, this approach will lead to a simple 'within-cell' average of the cells for which you have data available (the other cells are not identified). Interpretation of the output is easier when we remove the intercept (also to avoid another version of the perfect multicolinearity problem: the dummy trap):

dat$cells <- paste0(dat$group1, dat$group2)
lm(formula = y ~ factor(cells) - 1, data = dat)

# Call:
# lm(formula = y ~ factor(cells) - 1, data = dat)
#
# Coefficients:
# factor(cells)1a  factor(cells)1c  factor(cells)2b  factor(cells)2e  
#            0.40             0.85             0.45             0.45

Since the second group predicts the first one, why retain the first group as a variable at all? — whuber, Aug 15 '17 at 21:39
In this dataset, yes. I'm not sure whether the perfect multicollinearity is a result of `group2` predicting `group1` or just a lack of data and bad luck — KenHBS, Aug 15 '17 at 21:42
This is just a toy dataset. My real data is the case where they are perfect colinear so I will remove one. Thank you. — user3022875, Aug 15 '17 at 21:43

categorical variables multicollinear

1 Answers1