Suddenly, all covariates and factors become very significant

Question

I am running a multinomial logistic regression model in R using the command multinom. My response variable, $y$, is a factor with three levels of length 40, 25 and 5 each level. I have several factors and covariates to model it, say $x_1,\ldots,x_n$.

At a first step, I wrote $y\sim x_1+x_2$ in the multinom function. As a result, I obtained some "usual" p-values for the levels of $x_1$ and $x_2$: 0.05, 0.61, 0.92... Then I added a third factor/covariate, $x_3$: $y\sim x_1+x_2+x_3$. Suddenly, nearly all p-values were exactly 0, even for levels of $x_1$ and $x_2$ that had not been significant before adding $x_3$.

What is the reason for that? If I keep adding more factors in the model, then all p-values become $0$.

I joined the second and third levels of $y$ (the ones of length 25 and 5) to have a dichotomic variable, so that I could use the glm function with family binomial. In this glm model, I started writting $y\sim x_1+x_2$, $y\sim x_1+x_2+x_3$, etc, and the problem I mentioned above did not appear.

So, is this a problem of the multinom function? Is this a problem of the third level of $y$, which has a very small size?

Edit: Here is an example:

y <- c(1, 1, 1, 1, 1 ,1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 2 ,2 ,2 ,2 ,2 ,1 ,1 ,2 ,3 ,2 ,1 ,2 ,1 ,1 ,1 ,2 ,1 ,1 ,3 ,2 ,1 ,1 ,1, 2, 1, 1, 2, 2, 2, 1, 3, 2, 1,1, 2, 1 ,1 ,1 ,1 ,2 ,2 ,2 ,2, 2, 2, 2, 1, 2, 3, 1, 1, 1, 1, 1, 2)
x1 <- c(1 ,2, 1, 1, 4 ,1, 1, 3 ,1 ,1 ,3 ,1, 1 ,1, 1, 2, 2, 4, 2, 2 ,1 ,2 ,1, 3, 2, 2 ,2 ,4 ,4, 1, 4, 1, 1, 3, 2, 1 ,1, 1, 2, 1, 1, 2, 2, 1, 1, 3, 2 ,1,1, 2, 4, 1, 1, 1, 2, 2, 4, 2, 1, 2, 2, 1, 1, 3, 1, 1, 1, 1, 1, 2)
x2 <- c(1 ,1 ,1 ,1 ,1 ,1 ,1 ,2 ,1 ,1 ,2 ,1 ,1 ,1 ,1 ,1 ,2 ,1 ,1 ,1 ,1 ,1 ,1 ,3, 2 ,1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 2, 1,1 ,2 ,2 ,1 ,1 ,1 ,3 ,1 ,1 ,2 ,1 ,1 ,1 ,1 ,1 ,2 ,1 ,1 ,2 ,1 ,1 ,1 )
x3 <- c(1530, 15120,   891,  1452 ,15000,  6498  , 400 ,47880,  4608,  8976 ,34040,  7980  ,1152,   637, 17820 ,25839,70875  ,1260 ,28210 ,18200,  2184,   936,  1056 ,90804 ,28290  ,6480 ,14040,   550,  8712,  2028 ,14000,  2640,2560 ,41184, 41984,  8448,  1200,  3136,  6090  ,7581,  4788, 10710,   396,  7260,  2816 ,29799 ,21793, 24960,6072, 23064,  4680 ,25296,  6160,   672, 16875, 35360, 10450, 43200, 17204 ,21576 ,17820,  6650,  5130,  4104,4104, 12768,  9504,  1188 ,23460 ,16500)
y <- as.factor(y)
x1 <- as.factor(x1)
x2 <- as.factor(x2)

library(nnet)

# Two explanatory factors:
model <- multinom(y ~ x1+x2)
summary(model)
stad <- summary(model)$coefficients / summary(model)$standard.errors
p_values <- 2 * (1 - pnorm(abs(stad),0,1))
p_values
> p_values
   (Intercept)          x12       x13        x14       x22       x23
2 8.350025e-05 1.315633e-05 0.9798046 0.06366562 0.6952859 0.9892148
3 9.169932e-01 9.957820e-01 0.9753906 0.99891637 0.9992514 0.9889991

# Three explanatory factors/covariates:
model <- multinom(y ~ x1+x2+x3)
summary(model)
stad <- summary(model)$coefficients / summary(model)$standard.errors
p_values <- 2 * (1 - pnorm(abs(stad),0,1))
p_values
> p_values
  (Intercept) x12 x13 x14 x22 x23          x3
2           0   0   0   0   0   0 0.004544893
3           0   0   0   0   0   0 0.700799167

How many observations are in your dataset? I suspect your design matrix is not full rank. — StatsStudent, Jun 20 '17 at 15:39
Sorry I missed that. Can you also state how many levels there are of each x value? Some output would be helpful. — StatsStudent, Jun 20 '17 at 15:42
Could you maybe share your code with us? Just to make sure you have no typos? — Vasilis Vasileiou, Jun 20 '17 at 15:54
Note that your coefficients are tending to infinity I suspect separation or the Hauck-Donner effect. Try searching this site for those terms. — mdewey, Jun 20 '17 at 16:19
@mdewey Is there a way to solve this problem? For example, is it correct to just compare the models $y\sim x_1+x_2$ and $y\sim x_1+x_3$ and not take into account $y\sim x_1+x_2+x_3$? And why if I join levels 2 and 3 of y and perform a glm in R instead of a multinom the "perfect separation" problem does not appear? — user39756, Jun 20 '17 at 16:39
That's more of an answer than I have time to write, but I'll point you to the questions tagged for the Hauck-Donner effect: https://stats.stackexchange.com/questions/tagged/hauck-donner-effect — Aaron left Stack Overflow, Jun 20 '17 at 18:40

score 1 · Answer 1 · answered Jun 20 '17 at 16:29

1

Looks likely to be perfect separation. There's just not a lot of data in some of those blocks. And while you are fitting an additive model, which this plot doesn't quite capture, it does seem quite likely that adding that third variable is causing this. Note also how huge your estimates are. For more on perfect separation, see for example, Will the p value become useless in such case: logistic regression with perfect separation?

library(ggplot2)
d <- data.frame(y=y, x1=x1, x2=x2, x3=x3)
ggplot(d) + aes(x3, y) + geom_point() + facet_grid(x1~x2)

answered Jun 20 '17 at 16:29

Aaron left Stack Overflow

6,808
29
51

Thank you for your answer. Is there a way to solve this problem? For example, is it correct to just compare the models $y\sim x_1+x_2$ and $y\sim x_1+x_3$ and not take into account $y\sim x_1+x_2+x_3$? And why if I join levels 2 and 3 of y and perform a glm in R instead of a multinom the "perfect separation" problem does not appear? – user39756 Jun 20 '17 at 16:36

Suddenly, all covariates and factors become very significant

1 Answers1