0

Let's say that I generate observations from logistic regression:

n <- 1000 #n. of observations
p <- 5 #n. of covariates
u <- dnorm(runif(n*p, min = 0, max = 1))
x <- matrix(u, nrow = n, byrow = TRUE)
b <- c(2, 1, 2, 3, 6, -3)
prob <- 1/(1 + exp(-b%*%t(cbind(1,x))))
y <- rbinom(n,1,prob) 

But now if I fit a model:

df = data.frame(y, x)
mod <- glm(y ~ . , data = df, family  = "binomial")
coefficients(mod)

Coefficients I get are:

(Intercept)          X1          X2          X3          X4          X5 
 -7.0004768  16.5132458  -0.6912072  11.7006775   1.5552072   7.0992499 

It aswell might be a very dumb question showing lack of knowledge, but why are my coefficients returned by glm different than these from b? A shoot in the dark is that while generating y I rely upon probability given by a binomial distribution, so there is a room for uncertanity. Am I right? In the sense that drawing from binom(4, 0.5) I could get e.x both (1,1,0,0) and (0,1,0,0).

amal
  • 15
  • 4
  • Running the code a couple of times revelas that the sampling variability of the coefficients is rather large. Additionally, nearly all the y values are 1, meaning there isn't much to learn here ( there are at times only 3 zero observations). Increasing `n` by a couple orders of magnitude rectifies the problem. I show how to simulate some responses [here](https://stats.stackexchange.com/questions/519198/unable-to-get-correct-coefficients-for-logistic-regression-in-simulated-dataset/519215#519215). – Demetri Pananos Apr 25 '21 at 12:55
  • @Demetri Pananos so am I at least partly right? That it is due to the fact that we draw y from Binomial distribution so y is not deterministic and may vary? – amal Apr 25 '21 at 13:05
  • yes, you're right. Its just that the model data are not the best because the prevalence of the outcome is so high. – Demetri Pananos Apr 25 '21 at 13:54

0 Answers0