(All of the following is done in R, code to reproduce the dataset is given at the end of this post.)
I have a simulated data set, generated in the following way:
- Make 10 categories and label them 1-10.
- Assign a probability value to each of the categories, so that the first two have probability 0, and the rest have probability drawn uniformly from the interval $[0, 1]$.
- For each category, draw 50 samples from the Bernoulli distribution with success probability equal to the category's assigned probability.
Thus, the dataset has 500 observations, a sample is shown below:
> dt
names probs response
1: 1 0.0000000 0
2: 1 0.0000000 0
3: 1 0.0000000 0
4: 1 0.0000000 0
5: 1 0.0000000 0
---
496: 10 0.9446753 0
497: 10 0.9446753 1
498: 10 0.9446753 1
499: 10 0.9446753 1
500: 10 0.9446753 1
Now, I fit a logistic regression with formula response ~ name
.
My understanding of this is that the predicted value assigned to each category is just the mean of the response for that category. This holds true.
However, the regression coefficients are all insignificant:
Call:
glm(formula = response ~ names, family = "binomial", data = dt)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.53727 -0.45904 -0.00013 0.54922 2.14597
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.857e+01 9.224e+02 -0.020 0.984
names2 8.728e-11 1.305e+03 0.000 1.000
names3 2.174e+01 9.224e+02 0.024 0.981
names4 1.762e+01 9.224e+02 0.019 0.985
names5 1.762e+01 9.224e+02 0.019 0.985
names6 1.881e+01 9.224e+02 0.020 0.984
names7 2.022e+01 9.224e+02 0.022 0.983
names8 1.637e+01 9.224e+02 0.018 0.986
names9 2.038e+01 9.224e+02 0.022 0.982
names10 2.132e+01 9.224e+02 0.023 0.982
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 692.50 on 499 degrees of freedom
Residual deviance: 343.65 on 490 degrees of freedom
AIC: 363.65
Why does this happen and what does it mean? I have seen other questions dealing with some insignificant categories, where the advice then is that the significance of addition of the categorical variable should be tested using likelihood ratio or a chi-square test. In this case the chi-square test shows that adding names
to the null model improves the fit.
Code to generate data and models in R:
library(dplyr)
library(data.table)
df <- data.frame(names = factor(1:10))
set.seed(0)
df$probs <- c(0, 0, runif(8, 0, 1))
df$response = lapply(df$probs, function(i){
rbinom(50, 1, i)
})
dt <- data.table(df)
dt <- dt[, list(response = unlist(response)), by = c('names', 'probs')]
lm0 <- glm(data = dt, formula = response ~ 1, family = 'binomial')
summary(lm0)
lm1 <- glm(data = dt, formula = response ~ names, family = 'binomial')
summary(lm1)
anova(lm0, lm1, test = 'Chisq')