Why am I getting low performance using "ground truth of coefficients" for prediction?

Question

I am trying to run a simulation in logistic regression but got trapped. Why I am only get ~71% accuracy even using ground truth of coefficients for prediction?

set.seed(0)
n <- 1e5
p <- 5
X <- matrix(rnorm(n*p), ncol=p)
beta <- runif(p)
y <- rbinom(n,1,prob = plogis(X %*% beta))

Note we can get the estimation of beta by using glm. The estimation is pretty close when data size is large.

> glm(y~X-1,family="binomial")$coefficients
        X1         X2         X3         X4         X5 
0.68415400 0.59206451 0.29157944 0.84165069 0.08466564 

> beta
[1] 0.68309592 0.60590097 0.30353578 0.83300563 0.07931528

But, here suppose we are using the ground truth beta.

Here is prediction using ground truth and the confusion matrix

table(y,plogis(X %*% beta)>0.5)
y   FALSE  TRUE
0 35499 14425
1 14456 35620

score 2 · Accepted Answer · answered May 31 '17 at 06:48

Well, because a probability of 0.7 still implies a probability for the other class of 0.3. Or, put differently, because your $y$ are still sampled from a binomial distribution:

y <- rbinom(n,1,prob = plogis(X %*% beta))

If you don't sample your $y$, but deterministically set them depending on whether your probability exceeds 0.5,

y.new <- plogis(X %*% beta)>0.5

then you get

table(y.new,plogis(X %*% beta)>0.5)

y.new   FALSE  TRUE
  FALSE 49955     0
  TRUE      0 50045

(Which is really not overly surprising, since it's equivalent to table(plogis(X %*% beta)>0.5,plogis(X %*% beta)>0.5).)

thank you very much! you answer triggered another strange [question](https://stats.stackexchange.com/questions/282804/can-i-simulate-logistic-regression-without-randomness) from me, could you help me with it also? — Haitao Du, May 31 '17 at 17:10

Why am I getting low performance using "ground truth of coefficients" for prediction?

1 Answers1

Linked