1

I'm trying to build a logistic regression using glm() to match a given equation using a randomly generated dataset, and for some reason, the result is way off.

Here's the code:

rm( list = ls() )
set.seed( 8675309 )

x <- 10:50
y <- exp( -3.428013 + 0.053240 * x )

plot( x, y, type='l' )

n_per_x <- 10000
p       <- rep( y, n_per_x )
x_new   <- rep( x, n_per_x )
y_new   <- ifelse( runif( length( p ) ) <= p, 1, 0 )
my_data <- data.frame( X = x_new, Y = y_new )

for ( i in x ) {
  yp = sum( my_data$Y[ my_data$X  == i ] )
  points( i, yp / n_per_x, pch = 16, col = 'red' )
}

my_model <- glm( Y ~ X, family = 'binomial', data = my_data )
print( summary( my_model ) )
y2 = predict( my_model, data.frame( X = x ) )

lines( x, exp( y2 ), col = 'red' )

Here is the result:

Output plot from R showing original equation, sample distribution, and fit line.

The plot shows the original equation as a black line, the fraction of random samples for each value of X that are 1 as red dots, and the logistic fit line for the random sample in red.

Why is the fit so poor? Unfortunately, I don't feel like I know enough to know whether I've just made a dumb mistake in R or I've fundamentally misunderstood logistic regression.

What's confusing me is that the fit curve doesn't seem to move at all above very small values of n. If it was just randomness, I would expect it to converge as n gets larger. But it doesn't. The standard error gets smaller and smaller as the number of samples increases (expected), but the coefficients barely move at all (not expected), rapidly leaving the actual values way outside the confidence interval.

Thanks for any insight into where I've gone wrong!

jdavid
  • 21
  • 3
  • Because it is a bad model to fit the underlying data. If you try a `family=poisson` that is becomes a much better fit. – Dave2e May 07 '20 at 19:58
  • @Dave2e By trying that, I can see that you're right, the fit is much more in line with what I expect. But why? The data is binomial (modeling success outcomes in independent trials). I (mostly) understand the relationship between the binomial and poisson distributions generally, but I haven't grasped the decision criteria for choosing one or the other here. –  May 07 '20 at 20:11
  • Thanks @Dave2e, that is certainly the correct answer. I'll give it some more thought to make sure I fully understand why that is. –  May 07 '20 at 20:31
  • @Dave2e, I'm not sure I agree. There may be a log vs logit-link issue going on here ... – Ben Bolker May 07 '20 at 20:34
  • 1
    @BenBolker Can confirm that predict( my_model, data.frame( X = x ), type = 'response' ) and exp( predict( my_model, data.frame( X = x ) ) ) give markedly different results for family = 'binomial' and the same results for family = 'poisson'. But I think you're correct that I should have been using type = 'response' the whole time. –  May 07 '20 at 20:51
  • The magic incantation here appears to be: family( my_model )$linkinv to get the correct inverse link function regardless of what family is used. Using exp() was a gross oversimplification on my part. –  May 07 '20 at 21:00
  • 1
    I haven't looked at this carefully, but it was my impression that you were using an exponential for the dependence of the mean on x, thus `glm(..., family=binomial(link="log"))` would be more appropriate for matching your simulation – Ben Bolker May 07 '20 at 22:22
  • @BenBolker It's actually based on the output of a previous logit regression, I just clipped that part out and substituted a formula based on the coefficients of the prior regression for this sample since I can't post the data. So it may be that log would fit the substitute, but logit is better for the original. That said, I'm working at the limits of my current knowledge of both R and stats here, so I'm very wobbly on all of this. Really appreciate the feedback from both you and Dave2e. – jdavid May 09 '20 at 00:32
  • See this related question for some additional background: https://stats.stackexchange.com/questions/466212/what-is-the-objective-function-to-optimize-in-glm-with-gaussian-and-poisson-fami/466262#466262 – Dave2e May 13 '20 at 20:13

1 Answers1

1

Based on the very helpful feedback in the comments, here is the corrected code:

rm( list = ls() )

set.seed( 8675309 )

x <- 10:50
y <- exp( -3.428013 + 0.053240 * x )

plot( x, y, type='l' )

n_per_x <- 10000
p       <- rep( y, n_per_x )
x_new   <- rep( x, n_per_x )
y_new   <- ifelse( runif( length( p ) ) <= p, 1, 0 )
my_data <- data.frame( X = x_new, Y = y_new )
y_p     <- 0

for ( i in 1 : length( x ) ) {
  y_p[ i ] = sum( my_data$Y[ my_data$X  == x[ i ] ] ) / n_per_x
}
points( x, y_p, pch = 16, col = 'red' )

my_model <- glm( Y ~ X, family = 'binomial', data = my_data )
print( summary( my_model ) )
y2 <- predict( my_model, data.frame( X = x ), type = 'response' )

lines( x, y2, col = 'red' )

The main issue here was that exp() is not the correct inverse link function for logit regression. Using type = 'response' as an argument to predict() causes the correct inverse link to be invoked automatically on the returned values, causing the resulting values to fit very nicely.

(Using family = 'poisson' in the glm() function works around this issue because it defaults to a log link function, so exp() was the correct inverse, but in this case I did specifically want the best possible binomial logit fit for my real data.)

jdavid
  • 21
  • 3