Simulating a logistic regression in R

Question

I'm trying to simulate data for a logistic regression experiment to predict $50$ students pass/fail outcome on a math course from their GRE quant. scores.

GRE quant. is known to be normally distributed with a $\mu$ of $153$ and $\sigma =9$.

However, my logistic function results in Inf, I'm wondering how to fix this problem?

 n = 50               # number of students
 x = rnorm(n, 153, 9) # GRE Quant. scores of students
B0 = 150              # Average GRE Quant. score of test takers
B1 = 5                # Capable of increasing prob. of passing math course
 p = exp(B0 + B1*x)/(1+exp(B0 + B1*x)) # logistic function ????? The problem is HERE
 y = rbinom(n, 1, p)  # pass/fail outcome

Where did your values of the parameters `B0` and `B1` come from? You're getting infinities because you're passing very large values into `exp`, which comes from your very large intercept `B0` added to the large values of `x * B1` (on average, `5 * 153`). — Matthew Drury, Jul 01 '20 at 17:43

score 1 · Answer 1 · answered Jul 01 '20 at 17:47

well the problem to me seems to be related to your "latent" index. if you create a variable ystar=b0+b1*x and look at its distribution, you will see that it will have a very large value, so that P is always 1. (technically, i think the value would be so large, that exp(ystar) is stored as "infinite")

Try to restart your problem setting values to b0 and b1 such that the distribution of ystar is within "sensible values". (i would say between -5 to +5)

HTH

score 1 · Accepted Answer · answered Jul 01 '20 at 17:59

Your justification for beta-naught, B0, is that 150 is the "Average GRE Quant. score of test takers", but that's not what an intercept is. The intercept is the log of the odds of passing the math course when their quantitative GRE score is 0 (note that it isn't possible to actually score 0, so this is a score 130 points worse than missing everything). In other words, your posited data generating process isn't remotely sensible.

One thing that can be done with regression models to make the intercept more relevant is to center your X data. That is, subtract off the mean from every data point. This won't really have any effect on the model or the information it provides, but does change the meaning of the estimated intercept. Namely, the intercept will become the log of the odds of passing the math course for those whose quantitative GRE score is typical. That will probably benefit you here. From that point, I would think about the probability of passing that you believe (or want to simulate) for average test takers. Then convert that probability to a log odds and use that value for your simulation.

Here are some further readings that may help you:

score 0 · Answer 3 · answered Jul 01 '20 at 22:11

More generally, the Monte Carlo inversion technique assigning a random generated Uniform random deviate $ {U_i}$ to the Logistic Cumulative Distribution Function (CDF), a function of ${x_i}$ (with known ${\mu,s}$) and solving for ${x_i}$ can be employed here.

So, upon setting ${U_i}$ to the Logistic CDF:

${ U_i = F(x_i; \mu, s) = \frac {1}{1 + Exp((x_i -\mu)/s}}$

And, on solving for $ {x_i}$:

${ x_i = \mu + s \ln (U_i/(1 - U_i)) }$

which is the cited quantile function for the Logistic distribution.

Simulating a logistic regression in R

3 Answers3