Don't we still have the issue of perfect collinearity because + = 1?
You would as the levels sum to unity. But software will invariably force a referent. In R, for example, the intercept is whichever level is next in order.
How has removing the intercept helped us?
It is almost never a good idea to manually remove the intercept. You're forcing the linear relationship (approximation) to go through the origin. Here is your model:
$$
y = \beta_0 + \beta_1 \text{Gender} + \epsilon,
$$
where setting the condition that $\beta_0 = 0$ is presupposing that the expected value of your outcome $y$ when $\text{Gender} = 0$ is naught. You also may find your $R^{2}$ go through the roof, which has nothing to do with a better model fit. I suppose we should try a few models to see this in action. Here is some fake data:
library(dplyr)
set.seed(13)
n <- 20
fake_df <- data.frame(
gender = sample(c("male", "female"), size = n, replace = TRUE),
y = rnorm(n, 100, 20)
) %>%
mutate(male = ifelse(gender == "male", 1, 0))
First, try regressing y
on gender
. The level denoting females is absorbed into the intercept. Note the value of each category is appended to the variable name in the model summary. See the output below:
summary(lm(y ~ gender, data = fake_df))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 97.028 5.885 16.486 2.63e-12 ***
genderm -7.190 9.948 -0.723 0.479
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.22 on 18 degrees of freedom
Multiple R-squared: 0.0282, Adjusted R-squared: -0.02579
F-statistic: 0.5224 on 1 and 18 DF, p-value: 0.4791
Dropping the intercept, we now force a referent. Note we can achieve the same result by replacing 0
with a -1
. I suppose specifying 0
after the tilde is more explicit given this demonstration. See below:
summary(lm(y ~ 0 + gender, data = fake_df))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
genderf 97.028 5.885 16.49 2.63e-12 ***
genderm 89.838 8.020 11.20 1.52e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.22 on 18 degrees of freedom
Multiple R-squared: 0.9567, Adjusted R-squared: 0.9518
F-statistic: 198.6 on 2 and 18 DF, p-value: 5.403e-13
Incorporating separate categorical variables for each level without dropping the intercept results in redundancies; one level must be discarding. However, manually dropping the intercept forces a referent (i.e., male or female), which consequently shifts the origin.
In practice, you should dummy code your variables. It will clearly show what each variable denotes. The following output is equivalent to the first model:
summary(lm(y ~ male, data = fake_df))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 97.028 5.885 16.486 2.63e-12 ***
male -7.190 9.948 -0.723 0.479
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.22 on 18 degrees of freedom
Multiple R-squared: 0.0282, Adjusted R-squared: -0.02579
F-statistic: 0.5224 on 1 and 18 DF, p-value: 0.4791
Here, male
is a dummy variable. In words, males equal 1, 0 otherwise. Once we drop the intercept, we assume a restriction that we cannot know for certain is true. We assume we know the expected value of $y$ given a person is female (i.e., male == 0
) is zero. We cannot know this in practice. See below:
summary(lm(y ~ 0 + male, data = fake_df))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
male 89.84 31.32 2.868 0.00984 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 82.87 on 19 degrees of freedom
Multiple R-squared: 0.3021, Adjusted R-squared: 0.2654
F-statistic: 8.226 on 1 and 19 DF, p-value: 0.009845
In sum, while we removed the collinearity problem by dropping the intercept, we introduced a whole new set of problems. Again, we often don't know the intercept is equal to 0. If it isn't, then your residuals will be biased. By leaving in the intercept, we ensure the mean of the residuals is zero.