How can all of these categories be insignificant in my logistic regression

Question

(All of the following is done in R, code to reproduce the dataset is given at the end of this post.)

I have a simulated data set, generated in the following way:

Make 10 categories and label them 1-10.
Assign a probability value to each of the categories, so that the first two have probability 0, and the rest have probability drawn uniformly from the interval $[0, 1]$.
For each category, draw 50 samples from the Bernoulli distribution with success probability equal to the category's assigned probability.

Thus, the dataset has 500 observations, a sample is shown below:

> dt
     names     probs response
  1:     1 0.0000000        0
  2:     1 0.0000000        0
  3:     1 0.0000000        0
  4:     1 0.0000000        0
  5:     1 0.0000000        0
 ---                         
496:    10 0.9446753        0
497:    10 0.9446753        1
498:    10 0.9446753        1
499:    10 0.9446753        1
500:    10 0.9446753        1

Now, I fit a logistic regression with formula response ~ name.

My understanding of this is that the predicted value assigned to each category is just the mean of the response for that category. This holds true.

However, the regression coefficients are all insignificant:

Call:
glm(formula = response ~ names, family = "binomial", data = dt)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.53727  -0.45904  -0.00013   0.54922   2.14597  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.857e+01  9.224e+02  -0.020    0.984
names2       8.728e-11  1.305e+03   0.000    1.000
names3       2.174e+01  9.224e+02   0.024    0.981
names4       1.762e+01  9.224e+02   0.019    0.985
names5       1.762e+01  9.224e+02   0.019    0.985
names6       1.881e+01  9.224e+02   0.020    0.984
names7       2.022e+01  9.224e+02   0.022    0.983
names8       1.637e+01  9.224e+02   0.018    0.986
names9       2.038e+01  9.224e+02   0.022    0.982
names10      2.132e+01  9.224e+02   0.023    0.982

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 692.50  on 499  degrees of freedom
Residual deviance: 343.65  on 490  degrees of freedom
AIC: 363.65

Why does this happen and what does it mean? I have seen other questions dealing with some insignificant categories, where the advice then is that the significance of addition of the categorical variable should be tested using likelihood ratio or a chi-square test. In this case the chi-square test shows that adding names to the null model improves the fit.

Code to generate data and models in R:

library(dplyr)
library(data.table)


df <- data.frame(names = factor(1:10))
set.seed(0)
df$probs <- c(0, 0, runif(8, 0, 1))
    df$response = lapply(df$probs, function(i){
  rbinom(50, 1, i)  
})



dt <- data.table(df)

dt <- dt[, list(response = unlist(response)), by = c('names', 'probs')]


lm0 <- glm(data = dt, formula = response ~ 1, family = 'binomial')
summary(lm0)
lm1 <- glm(data = dt, formula = response ~ names, family = 'binomial')
summary(lm1)

anova(lm0, lm1, test = 'Chisq')

firstly, i am not sure you have used right code for creating your sample of 500 obs. — Bach, Feb 25 '16 at 06:12

How can all of these categories be insignificant in my logistic regression

0 Answers0

Linked