1

Let's say that you're fitting a model to predict some score as a linear function of two categorical inputs. Something like the following data set:

score   x   y
---------------------
46.12   A   B
42.61   A   F
39.14   B   A
57.72   B   C
48.65   B   D
56.29   B   E
45.45   B   F
42.43   C   B
52.68   C   E
43.03   D   B
52.54   D   C
44.32   E   B
54.43   E   C

For interpretability, I would like to treat each input category as an indicator variable. That is, each data point has 3 input dimensions: the intercept, the x category, and the y category (all with value 1). However, the default coding behavior for lm() is to fold the first value of the first variable in with the intercept... which throws away the xA predictor in my example:

Call:
lm(formula = score ~ x + y, data = d)

Residuals:
         1          2          3          4          5          6          7          8          9         10         11         12         13 
 4.281e-01 -4.281e-01  8.882e-16  1.171e-01 -4.441e-16 -5.452e-01  4.281e-01 -5.452e-01  5.452e-01  2.086e-01 -2.086e-01 -9.143e-02  9.143e-02 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.1562     1.0316  36.019 4.71e-05 ***
xB            1.9838     0.7379   2.689 0.074504 .  
xC           -2.7167     0.8324  -3.264 0.047005 *  
xD           -2.8705     0.8212  -3.495 0.039610 *  
xE           -1.2805     0.8212  -1.559 0.216828    
yB            8.5357     0.9824   8.688 0.003209 ** 
yC           18.4629     0.9439  19.561 0.000292 ***
yD            9.5100     1.0195   9.328 0.002609 ** 
yE           17.6952     0.9569  18.492 0.000345 ***
yF            5.8819     0.9569   6.147 0.008662 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7209 on 3 degrees of freedom
Multiple R-squared:  0.9964,    Adjusted R-squared:  0.9856 
F-statistic:  92.5 on 9 and 3 DF,  p-value: 0.001657

Of course I can achieve what I want with other machine learning libraries like scikit-learn, but they don't provide standard errors or p-values out-of-the-box. Is there a way to force R to explicitly code using every categorical value in the model (in addition to the intercept)?

EDIT: To be clear, I have multiple categorical inputs, and I want to keep the intercept. I did a pretty thorough search of this site (and others) and haven't found a discussion of how to do this yet. Also, the categories are not ordinal.

burr
  • 348
  • 2
  • 6
  • A good list of very closely related threads can be found with [a suitable search of this site](http://stats.stackexchange.com/search?q=dummy+categorical+intercept+reference). – whuber Jun 03 '14 at 20:42
  • See edit. I have found many threads on this site (and others), but none that address this particular situation (yet). – burr Jun 04 '14 at 16:37

2 Answers2

2

If you have a single categorical regressor, you can estimate the coefficients for all levels if you drop the intercept. Otherwise, your model is not identifiable, unless you add restrictions.

One such restriction is that the coefficients of different levels of a categorical variable add up to zero, and you can achieve that in R by stating in function lm() that you want contrasts of type sum; see the help page for lm and compare:

   > lm(weight ~ group)

Call:
lm(formula = weight ~ group)

Coefficients:
(Intercept)     groupTrt  
      5.032       -0.371  

with:

    > lm(weight ~ group - 1 , contrasts=list(group=contr.sum))

Call:
lm(formula = weight ~ group - 1, contrasts = list(group = contr.sum))

Coefficients:
groupCtl  groupTrt  
   5.032     4.661  
F. Tusell
  • 7,733
  • 19
  • 34
0

You can fit your model by:

fit.0 <- lm(score ~ 0 + x + y)  # 0: intercept = 0

or equivalently

fit.1 <- lm(score ~ x + y - 1)  # -1: get rid of the intercept

For example:

> fit.0 <- lm(score ~ 0 + x + y)
> summary(fit.0)

Call:
lm(formula = score ~ 0 + x + y)

Residuals:
         1          2          3          4          5          6          7 
 4.281e-01 -4.281e-01  6.661e-16  1.171e-01 -1.665e-16 -5.452e-01  4.281e-01 
         8          9         10         11         12         13 
-5.452e-01  5.452e-01  2.086e-01 -2.086e-01 -9.143e-02  9.143e-02 

Coefficients:
   Estimate Std. Error t value Pr(>|t|)    
xA  37.1562     1.0316  36.019 4.71e-05 ***
xB  39.1400     0.7209  54.293 1.38e-05 ***
xC  34.4395     1.0316  33.385 5.91e-05 ***
xD  34.2857     1.0465  32.764 6.25e-05 ***
xE  35.8757     1.0465  34.283 5.46e-05 ***
yB   8.5357     0.9824   8.688 0.003209 ** 
yC  18.4629     0.9439  19.561 0.000292 ***
yD   9.5100     1.0195   9.328 0.002609 ** 
yE  17.6952     0.9569  18.492 0.000345 ***
yF   5.8819     0.9569   6.147 0.008662 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7209 on 3 degrees of freedom
Multiple R-squared:  0.9999,    Adjusted R-squared:  0.9998 
F-statistic:  5873 on 10 and 3 DF,  p-value: 3.294e-06

You should just be aware that the R-squared is not the same you got (0.9964) because R uses another formula when you fit a model with no intercept (see here).

However, you can get the same R-squared by

> R_squared <- cor(score, predict(fit.0))^2
> R_squared
[1] 0.9964092

(see here.)

Sergio
  • 5,628
  • 2
  • 11
  • 27