Generate logistic regression error for binomial response for data simulation

Question

I want to simulate data to test a logistic regression. The main problem I have is the error term when I generate the binomial response. Although, I know (see other posts here and here) we are modelling the mean so there is no error in the link function, I'm getting a warning that my GLM fits perfectly the data:

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

But I want to generate data that is more realistic (e.g. when looking at survival of individuals in biology, we do not expect a perfect separation 000011111 but more something like 001011101). Here is what I have so far:

set.seed(1345)
n.1 =100
mean.1 = 10
sd.1 = 5
sd.erro = 0.5

x.1 = rnorm(n.1, mean.1, sd.1)
x = c(x.1)
n = length(x)
mn = mean(x)
sg = sd(x)

z = 1 - 1*x + 1*(x-mean(x))^2 + rnorm(length(x),
                                      mean = 0,
                                      sd = sd.erro)
pr = 1/(1+exp(-z))
y = rbinom(length(x),1,pr) 
df = data.frame(y = y,
                x = x,
                x.2 = x^2)

  hist(x)
  hist(residuals(lm(z~x)), breaks = 30)
  par(mfrow = c(2,1))
  plot(z~x)
abline(v = c(mn,mn+sg,mn-sg), lty = 2)

# As you can see there is a perfect separation in the reponse: 
  plot(y~x)

# Generate the warning 
  model = glm(y~ x + x.2, 
              data = df,
              family="binomial")

  newx = data.frame(x = seq(min(x),
                            max(x),
                            length.out = n))
  newx$x.2 = newx$x^2
  newx$y = predict(model,
                   newx, 
                   type="response") 
  logi = function(x) {exp(x)/(1+exp(x))}
  newx$y.logi = logi(newx$y)
  lines(x = newx$x,
        y = newx$y, 
        col = "red",
        lwd=2, 
        ylim = c(0,1))

So the idea here is where should I add an error term that will generate a more variable response. Is it only the sd.errothat I need to modify?

score 1 · Accepted Answer · answered Aug 14 '18 at 22:00

Yes if you increase the error in the normal distribution that will add more noise into the simulated system and reduce the perfect separation.

basically what is happening now if given $x, x^2$ features you can create a curve that separates all 0s on one side and all 1s on the other.

Increasing the standard deviation in the normal will create some that cross the boundary line but you may need to increase sd.erro to something greater than 1. I would try 5 and see if it works. If you must have a smaller error you'll need to do something else, like manually adding a point on the "wrong" side of the decision boundary so that you have some error.

Generate logistic regression error for binomial response for data simulation

1 Answers1