1

I am trying to run a simulation in logistic regression but got trapped. Why I am only get ~71% accuracy even using ground truth of coefficients for prediction?

set.seed(0)
n <- 1e5
p <- 5
X <- matrix(rnorm(n*p), ncol=p)
beta <- runif(p)
y <- rbinom(n,1,prob = plogis(X %*% beta))

Note we can get the estimation of beta by using glm. The estimation is pretty close when data size is large.

> glm(y~X-1,family="binomial")$coefficients
        X1         X2         X3         X4         X5 
0.68415400 0.59206451 0.29157944 0.84165069 0.08466564 

> beta
[1] 0.68309592 0.60590097 0.30353578 0.83300563 0.07931528

But, here suppose we are using the ground truth beta.

Here is prediction using ground truth and the confusion matrix

table(y,plogis(X %*% beta)>0.5)
y   FALSE  TRUE
0 35499 14425
1 14456 35620
Haitao Du
  • 32,885
  • 17
  • 118
  • 213

1 Answers1

2

Well, because a probability of 0.7 still implies a probability for the other class of 0.3. Or, put differently, because your $y$ are still sampled from a binomial distribution:

y <- rbinom(n,1,prob = plogis(X %*% beta))

If you don't sample your $y$, but deterministically set them depending on whether your probability exceeds 0.5,

y.new <- plogis(X %*% beta)>0.5

then you get

table(y.new,plogis(X %*% beta)>0.5)

y.new   FALSE  TRUE
  FALSE 49955     0
  TRUE      0 50045

(Which is really not overly surprising, since it's equivalent to table(plogis(X %*% beta)>0.5,plogis(X %*% beta)>0.5).)

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • thank you very much! you answer triggered another strange [question](https://stats.stackexchange.com/questions/282804/can-i-simulate-logistic-regression-without-randomness) from me, could you help me with it also? – Haitao Du May 31 '17 at 17:10