Why does removing the constant term prevent the dummy variable trap?

Question

I understand that if you have a dummy variable with $m$ categories that you should include $m-1$ categories in order to avoid perfect collinearity between regressors. However I don't understand why removing the constant term prevents the issue of perfect collinearity, because all of the categories will still sum to 1.

For example: Suppose we have a dummy variable 'gender', with the categories being male and female. If we include both categories in the model without an intercept, the model is given by $y = \beta_1male + \beta_2female$.

Don't we still have the issue of perfect collinearity because $male + female = 1$? We could rewrite this as $male = 1 - female$ so the regressors are still perfectly collinear because there is an exact linear relationship between them. How has removing the intercept helped us?

(I am yet to learn about matrix algebra, so please provide answers that do not require knowledge of matrix algebra if possible).

ttnphns · Answer 1 · 2020-12-19T03:26:07.217

When we perform linear regression with the constant term (intercept), we actually are moving the origin (the anchoring point which the prediction line will come through) to the data cloud centroid (the mean). Both X variable(s) and the Y variable get centered.

Let us take your example with predictor gender making two X dummies, female and male. When they are centered (and scaled), their scalar product is $-1$. Below I'm showing the two dummies - original on the left and them after centering (and also scaling), on the right.

  Original dummies   Standardized (centered-then-scaled) dummies
   female    male           female      male 
       1        0            .802       -.802 
       1        0            .802       -.802 
       1        0            .802       -.802 
       1        0            .802       -.802 
       0        1          -1.069       1.069 
       0        1          -1.069       1.069 
       0        1          -1.069       1.069 
    scalar prod. = 0        scalar prod. = -1

   Scaled dummies
    female    male
     .500     .000 
     .500     .000 
     .500     .000 
     .500     .000 
     .000     .577 
     .000     .577 
     .000     .577
    scalar prod. = 0

One of the things we need in regression is to compute scalar product between the predictors.

Scalar product of centered variables is the covariance and of centered & scaled ones is the correlation. And it equals $-1$. This is the mark of their collinearity: both vectors, male and female, lie on the common straight line (and face opposite directions). Since they are collinear, one of them is redundant as a predictor, for them two span only 1D space. (It is like on the second pic here, except that X1 and X2 vectors are directed oppositely in our case.)

But when we perform linear regression without the constant term, we leave origin on its place. We force the prediction line to pierce the anchor there where it was, not at the data centroid. We don't center, neither Y nor X variables.

Since we thus don't center the dummies, we compute the scalar product between the dummies as they were, raw. But look, their scalar product is $0$, not $-1$. If we scale them and compute scalar product - it is then called cosine similarity - it'll still be $0$. The $0$ means the two vectors, male and female, are orthogonal; they do not at all lie on one same line. That means they are not collinear, they form a 2D space.

That is why we may enter both dummies as predictors in linear regression containing no intercept.

score 1 · Answer 2 · answered Dec 19 '20 at 07:37

Don't we still have the issue of perfect collinearity because + = 1?

You would as the levels sum to unity. But software will invariably force a referent. In R, for example, the intercept is whichever level is next in order.

How has removing the intercept helped us?

It is almost never a good idea to manually remove the intercept. You're forcing the linear relationship (approximation) to go through the origin. Here is your model:

$$ y = \beta_0 + \beta_1 \text{Gender} + \epsilon, $$

where setting the condition that $\beta_0 = 0$ is presupposing that the expected value of your outcome $y$ when $\text{Gender} = 0$ is naught. You also may find your $R^{2}$ go through the roof, which has nothing to do with a better model fit. I suppose we should try a few models to see this in action. Here is some fake data:

library(dplyr)

set.seed(13)

n <- 20

fake_df <- data.frame(
  gender = sample(c("male", "female"), size = n, replace = TRUE),
  y = rnorm(n, 100, 20)
  ) %>%
  mutate(male = ifelse(gender == "male", 1, 0))

First, try regressing y on gender. The level denoting females is absorbed into the intercept. Note the value of each category is appended to the variable name in the model summary. See the output below:

summary(lm(y ~ gender, data = fake_df))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   97.028      5.885  16.486 2.63e-12 ***
genderm       -7.190      9.948  -0.723    0.479    
--- 
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.22 on 18 degrees of freedom
Multiple R-squared:  0.0282,    Adjusted R-squared:  -0.02579 
F-statistic: 0.5224 on 1 and 18 DF,  p-value: 0.4791

Dropping the intercept, we now force a referent. Note we can achieve the same result by replacing 0 with a -1. I suppose specifying 0 after the tilde is more explicit given this demonstration. See below:

summary(lm(y ~ 0 + gender, data = fake_df))

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
genderf   97.028      5.885   16.49 2.63e-12 ***
genderm   89.838      8.020   11.20 1.52e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.22 on 18 degrees of freedom
Multiple R-squared:  0.9567,    Adjusted R-squared:  0.9518 
F-statistic: 198.6 on 2 and 18 DF,  p-value: 5.403e-13

Incorporating separate categorical variables for each level without dropping the intercept results in redundancies; one level must be discarding. However, manually dropping the intercept forces a referent (i.e., male or female), which consequently shifts the origin.

In practice, you should dummy code your variables. It will clearly show what each variable denotes. The following output is equivalent to the first model:

summary(lm(y ~ male, data = fake_df))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   97.028      5.885  16.486 2.63e-12 ***
male          -7.190      9.948  -0.723    0.479    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.22 on 18 degrees of freedom
Multiple R-squared:  0.0282,    Adjusted R-squared:  -0.02579 
F-statistic: 0.5224 on 1 and 18 DF,  p-value: 0.4791

Here, male is a dummy variable. In words, males equal 1, 0 otherwise. Once we drop the intercept, we assume a restriction that we cannot know for certain is true. We assume we know the expected value of $y$ given a person is female (i.e., male == 0) is zero. We cannot know this in practice. See below:

summary(lm(y ~ 0 + male, data = fake_df))

Coefficients:
     Estimate Std. Error t value Pr(>|t|)   
male    89.84      31.32   2.868  0.00984 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 82.87 on 19 degrees of freedom
Multiple R-squared:  0.3021,    Adjusted R-squared:  0.2654 
F-statistic: 8.226 on 1 and 19 DF,  p-value: 0.009845

In sum, while we removed the collinearity problem by dropping the intercept, we introduced a whole new set of problems. Again, we often don't know the intercept is equal to 0. If it isn't, then your residuals will be biased. By leaving in the intercept, we ensure the mean of the residuals is zero.

Why does removing the constant term prevent the dummy variable trap?

2 Answers2