5

I'm trying to simulate a logistic regression. My goal is showing that if Y=1 is rare, than the intercept is biased. In my R script I define the logistic regression model through the latent variable's approach (see for example pp. 140 http://gking.harvard.edu/files/abs/0s-abs.shtml):

x   <- rnorm(10000)

b0h <- numeric(1000)
b1h <- numeric(1000)

for(i in 1:1000){
  eps <- rlogis(10000)
  eta <- 1+2*x+eps
  y   <-numeric(10000)
  y   <- ifelse (eta>0,1,0)

  m      <- glm(y~x,family=binomial)
  b0h[i] <- coef(m)[1]
  b1h[i] <- coef(m)[2]
}

mean(b0h)
mean(b1h)
hist(b0h)
hist(b1h)

The problem here is that I don't know how to force the observations y to be balanced before (50:50), then unbalanced (90:10). As we can see with the function table(), in my script the proportion of ones is random.

table(y)

How to solve this problem?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Luca Dibo
  • 467
  • 1
  • 4
  • 19
  • If you are _truly_ simulating from a logistic regression model then the proportion _has_ to be random. – Xi'an Dec 09 '14 at 10:57
  • Thanks for your answer @Xi'an. I know that the proportion has to be random, but I just want to show that if the class is skewed (one class is rare), than if I use logistic regression the intercept is biased (this is what I think to have understood from http://gking.harvard.edu/files/abs/0s-abs.shtml. How to force the class to be skewed in my Rscript? – Luca Dibo Dec 09 '14 at 11:15
  • Then you have to sample from another model. – Xi'an Dec 09 '14 at 11:46
  • Related: [How to simulate artificial data for logistic regression?](http://stats.stackexchange.com/q/46523/) – gung - Reinstate Monica Dec 09 '14 at 19:23
  • Isn't the MLE of the intercept always biased, regardless of the coefficients (except in very special circumstances)? It doesn't matter whether $Y=1$ is rare or not. – whuber Dec 09 '14 at 19:30
  • @whuber yes, I think that all the coefficient vector is biased in small sample when you estimate it with MLE. But it is asymptotically unbiased. However, when the class is skewed, then the intercept is biased also in large sample; but the others coefficients are unbiased. This is what I think to have understood from King and Zeng's paper, and this is what I'm trying to show in R – Luca Dibo Dec 09 '14 at 22:59
  • 1
    You're never going to show this with a simulation! All you could possibly discover is that a certain amount of bias persists in the largest simulations your computer can handle. I am having a hard time understanding why the intercept should be biased asymptotically--you actually contradict yourself with this statement (since "asymptotically unbiased" implies the bias becomes unmeasurable with sufficiently large samples)--so I would guess it might have something to do with the nature of the asymptotics. – whuber Dec 09 '14 at 23:02
  • @whuber, could you please explain me why I cannot show with simulation that in large sample the intercept coefficient of the logistic regression is not unbiased? My idea would be: First, setting a true value of the intercept. Then I want simulating the logistic regression a thousand times. Eventually, I'll take the mean of the thousand intercepts and I will check if the mean of the intercepts is equal or not to the true intercept that I had previously set. I know that in linear regression this is possible because I wrote the script to do it, why with logic model it is not possible for you? – Luca Dibo Dec 09 '14 at 23:19

1 Answers1

4

Logistic regression doesn't really have an error term. Alternatively, you can think of the response distribution (the binomial) as having its random component intrinsically 'built-in' (for more, it may help to read my answer here: Difference between logit and probit models). As a result, I think it is conceptually clearer to generate data for simulations directly from a binomial parameterized as the logistic transformation of the structural component of the model, rather than use the logistic as a sort of error term.

From there, if you want the long run probability that $Y = 1$ to be $.5$ or $.1$, you just need your structural component to be balanced around $0$ (for $.5$), or $-2.197225$ (for $.1$). I got those values by converting the response probability to the log odds:
$$ \log(\text{odds}(Y=1)) = \frac{\exp(Pr(Y = 1))}{(1+\exp(Pr(Y = 1))} $$ The most convenient way to do this will be to use those values for your intercept ($\beta_0$) and have your slope be $0$. (Alternatively, you can use any two parameter values, $\beta_0$ and $\beta_1$, that you like such that, given your $X$ values, the log mean odds equals, e.g., $-2.197225$.) Here is an example with R code:

lo2p = function(lo){      # this function will perform the logistic transformation
  odds = exp(lo)          #   of the structural component of the data generating
  p    = odds / (1+odds)  #   process
  return(p)
}

N     = 1000              # these are the true values of the DGP
beta0 = -2.197225
beta1 = 0

set.seed(8361)            # this makes the simulation exactly reproducible
x     = rnorm(N)
lo    = beta0 + beta1*x
p     = lo2p(lo)          # these will be the parameters of the binomial response

b0h   = vector(length=N)  # these will store the results
b1h   = vector(length=N)
y1prp = vector(length=N)  # (for the proportion of y=1)

for(i in 1:1000){         # here is the simulation
  y        = rbinom(n=N, size=1, prob=p)
  m        = glm(y~x, family=binomial)
  b0h[i]   = coef(m)[1]
  b1h[i]   = coef(m)[2]
  y1prp[i] = mean(y)
}

mean(b0h)                 # these are the results
# [1] -2.205844
mean(b1h)
# [1] -0.0003422177
mean(y1prp)
# [1] 0.100036
hist(b0h)
hist(b1h)

enter image description here

enter image description here

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • You don't want to control the mean log odds to achieve a given proportion of successes; you want to control the (weighted) mean *probability*. The two means are not quite equivalent. – whuber Dec 09 '14 at 19:32
  • @whuber, hmmm I think I see your point: because of the intervening non-linear transformation, the mean of p(Y=1) for a large sample of simulated datasets could diverge from lo(Y=1) because of the asymmetry in the transformations of the datasets in the tails. Is that what you're getting at? (It does seem to work awfully well w/ lo(Y=1)=.1, but maybe that's not extreme enough.) – gung - Reinstate Monica Dec 09 '14 at 19:44
  • Yes, that's what I mean. Because the logistic transformation is nearly linear for values close to zero (around $-1$ to $1$), the difference really shows up only for more extreme values--precisely the case you are discussing in your answer! For instance, let $p_1=0.01$ and $p_2=0.19$. Although their mean is $0.1$, their mean log odds corresponds to a probability of only $0.046$. Rare events have more "leverage" in the averaging when it is done in terms of log odds. – whuber Dec 09 '14 at 19:55
  • @whuber, do you think I should delete this answer? – gung - Reinstate Monica Dec 09 '14 at 19:59
  • No; I think it would be fine just to modify the (incidental) remarks you make about mean log odds. In fact, if you were to change "mean log odds" to "log mean odds" everything would look fine. Incidentally, using `replicate` would streamline your code a little (because you wouldn't have to preallocate vectors to hold the results). – whuber Dec 09 '14 at 20:21
  • 1
    @whuber, I changed that phrasing, let me know if you think it needs more. BTW, I am aware of `replicate`, but I don't tend to use such here; my goal is to make my R code as simple & transparent as possible, even if clunky, in the hopes that even those who don't know R could get the gist. – gung - Reinstate Monica Dec 09 '14 at 21:24
  • Sure, there are matters of taste involved here. But in the spirit of being transparent, I would encourage you to provide obviously meaningful names to variables or at least include comments describing what they represent. You might also want to eliminate the inconsistency of sometimes using "N" and sometimes using "1000" for the number of replications, which will bite anybody who tries to modify the code by increasing `N`. (This problem would not affect the `replicate` version of the code. The moral is that syntactic elegance frequently leads to higher reliability.) – whuber Dec 09 '14 at 22:46