Fitting a Logistic Regression Without an Intercept

Question

Based on the answer here: Significance of categorical predictor in logistic regression I tried adding a "-1" to my model to fit it without an intercept, and see the correlations directly.

It looks like adding the "-1" only helps for the first of the variables, and doesn't help if there is more than one categorical value. I tried running it on "overweight ~ race + diet -1 " and then reversing the order of race and diet.

If race is 1st in the formula, then all 4 races show up as significant.

glm(formula = overweight ~ race + diet - 1, family = "binomial",
    data = data)

Coefficients:
           Estimate Std. Error z value Pr(>|z|)
race1   -1.17569    0.07916 -14.851  < 2e-16 ***
race2   -1.77863    0.08446 -21.058  < 2e-16 ***
race3   -1.85692    0.06967 -26.651  < 2e-16 ***
race4   -1.21037    0.07175 -16.869  < 2e-16 ***
diet2   -1.15341    0.09676 -11.921  < 2e-16 ***
diet3   -14.21256  315.57607  -0.045 0.964078
diet4   -1.36219    0.08796 -15.486  < 2e-16 ***
diet5   -2.03216    0.58765  -3.458 0.000544 ***
diet6   -14.09964  186.44637  -0.076 0.939719

When diet is first race1 is not included in the model, and race4's z value is not significant.

glm(formula = overweight ~ diet + race - 1, family = "binomial",
    data = data)

Coefficients:
           Estimate Std. Error z value Pr(>|z|)
diet1   -1.17569    0.07916 -14.851  < 2e-16 ***
diet2   -2.32910    0.10598 -21.978  < 2e-16 ***
diet3   -15.38825  315.57607  -0.049    0.961
diet4   -2.53788    0.09839 -25.794  < 2e-16 ***
diet5   -3.20785    0.59015  -5.436 5.46e-08 ***
diet6   -15.27533  186.44638  -0.082    0.935
race2   -0.60294    0.10888  -5.538 3.06e-08 ***
race3   -0.68123    0.09790  -6.959 3.44e-12 ***
race4   -0.03468    0.09804  -0.354    0.724

I also tried subtracting 1 from each of the categorical variables, but that didn't add diet1 into the model

glm(formula = overweight ~ race -1 + diet - 1, family = "binomial",
    data = data) 

Coefficients:
           Estimate Std. Error z value Pr(>|z|)
race1   -1.17330    0.07915 -14.823  < 2e-16 ***
race2   -1.77969    0.08445 -21.073  < 2e-16 ***
race3   -1.85552    0.06968 -26.628  < 2e-16 ***
race4   -1.21214    0.07176 -16.892  < 2e-16 ***
diet2   -1.15544    0.09675 -11.943  < 2e-16 ***
diet3   -14.21292  315.57904  -0.045 0.964077
diet4   -1.36182    0.08796 -15.482  < 2e-16 ***
diet5   -2.01937    0.58772  -3.436 0.000591 ***
diet6   -14.09991  186.44215  -0.076 0.939717

Is there a way to fit multiple categorical variables while keeping all the "categories" in the model? Is there a reason why this shouldn't be done?

In this case, I expect race4 to be statistically significant, but when race1 is being used as the reference race4 is not statistically significant. Is there a way to avoid this?

Adrian · Accepted Answer · 2015-05-28T07:45:27.973

To answer your question "Is there a reason why this shouldn't be done?":

Are you familiar with the concept of linear dependence? The columns of your $X$ matrix must be linearly independent, otherwise there will be multiple coefficient vectors that produce the same fit.

An example:

set.seed(123987)
link <- function(x) exp(x) / (1 + exp(x))
curve(link(x), -5, 5)  # Maps R to [0, 1]
n <- 100
df <- data.frame(x=runif(n, -0.5, 0.5))  # A continuous predictor, x
df$f_1 <- factor(sample(letters[1:3], size=n, replace=T), levels=letters[1:3])  # Factor
    colors <- c("green", "purple", "blue")
    df$f_2 <- factor(sample(colors, size=n, replace=T), levels=colors)  # A second factor
df$y <- 1 * (runif(n) < link(rnorm(n) + df$x +
                             ifelse(df$f_1=="a", -1, ifelse(df$f_1=="b", 1, 2)) +
                             ifelse(df$f_2=="green", -0.5, ifelse(df$f_2=="purple", 0, 5))))
stopifnot(setequal(unique(df$y), c(0, 1)))

fit <- glm(y ~ x + f_1 + f_2, data=df, family=binomial("logit"))
coefficients(fit)  # Constant, x, f_1b, f_1c, f_2purple, f_2blue

X <- matrix(1, nrow=n, ncol=length(fit$coefficients))  # Manually create X matrix
    X[, 2] <- df$x
## No column for "a"
X[, 3] <- 1*(df$f_1 == "b")
    X[, 4] <- 1*(df$f_1 == "c")
## No column for "green"
X[, 5] <- 1*(df$f_2 == "purple")
    X[, 6] <- 1*(df$f_2 == "blue")
colnames(X) <- c("constant", "x", "f_1b", "f_1c", "f_2green", "f_2purple")
Y <- matrix(df$y, ncol=1)
colnames(Y) <- "y"
fit2 <- glm(Y ~ 0 + X, family=binomial("logit"), data=list(Y, X))  # X already includes const

all(coefficients(fit) == coefficients(fit2))  # True

# What happens if we drop the constant and put all levels of f_1 and f_2 in our matrix X?
X <- matrix(NA, nrow=n, ncol=length(fit$coefficients) + 1)
    X[, 1] <- df$x
X[, 2] <- 1*(df$f_1 == "a")
    X[, 3] <- 1*(df$f_1 == "b")
X[, 4] <- 1*(df$f_1 == "c")
    X[, 5] <- 1*(df$f_2 == "green")
X[, 6] <- 1*(df$f_2 == "purple")
    X[, 7] <- 1*(df$f_2 == "blue")
colnames(X) <- c("x", "f_1a", "f_1b", "f_1c", "f_2green", "f_2purple", "f_2blue")

## The problem with this matrix is that the columns are linearly dependent
X[, 2] + X[, 3] + X[, 4]  # Gives a vector of all 1s -- do you understand why?
X[, 5] + X[, 6] + X[, 7]  # Gives a vector of all 1s, for the same reason
zero_vector <- X[, 2] + X[, 3] + X[, 4] - (X[, 5] + X[, 6] + X[, 7])
all(zero_vector == 0)  # True

In the example above, I first generate some simple example data. I use glm to fit a logistic regression with a constant (and one omitted level for each factor). I then show you how to manually generate the X matrix for that model. Then I generate a new X, which includes all factor levels, and explicitly show you that its columns are linearly dependent.

If you have one factor, you can drop the constant in your model and estimate coefficients for all factor levels. (This produces the exact same fit either way; it's just the interpretation of the coefficients that changes -- in one case your coefficient is an average for that factor level, in the other it's the difference relative to the baseline, excluded level.)

But when you have two factors, it doesn't make sense to try and estimate coefficients for all levels of both factors: that will create linearly dependent columns in your X. You always have to drop one level from one factor (or two levels, one from each factor, if you include a constant).

There is another aspect of your question which is about statistical significance. I think you slightly misunderstand the meaning of the coefficients in your model, and how the interpretation changes depending on whether or not you've included a constant.

An example would be helpful. Do you mean that 'race' and 'diet' need to be linearly independent? — Elks, May 28 '15 at 07:23

Fitting a Logistic Regression Without an Intercept

1 Answers1

Linked