3

So I'm playing around with logistic regression in R, using the mtcars dataset, and I decide to create a logistic regression model on the 'am' parameter (that is manual or automatic transmission for those of you familiar with the mtcars-dataset).

Call:
glm(formula = am ~ mpg + qsec + wt, family = binomial, data = mtcars)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-4.484e-05  -2.100e-08  -2.100e-08   2.100e-08   5.163e-05  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)    924.89  883764.07   0.001    0.999
mpg             20.65   18004.32   0.001    0.999
qsec           -55.75   32172.52  -0.002    0.999
wt            -111.33  103183.48  -0.001    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4.3230e+01  on 31  degrees of freedom
Residual deviance: 6.2903e-09  on 28  degrees of freedom
AIC: 8

Number of Fisher Scoring iterations: 25

Now, at first sight this looks like a terrible regression, right? The standard errors are HUGE, the z-values are all close to zero and the corresponding probabilities are all close to one. HOWEVER, the residual deviance is extremely small!

I decide to check how well the model does as a classification model by running:

pred <- predict(logit_fit, data.frame(qsec = mtcars$qsec, wt = mtcars$wt, mpg = mtcars$mpg), type = "response") # Make a prediction of the probabilities on our data
mtcars$pred_r <- round(pred, 0) # Round probabilities to closest 0 or 1
table(mtcars$am, mtcars$pred_r) # Check if results of classification is any good.

Indeed, the model perfectly predicts the data:

     0  1
  0 19  0
  1  0 13

Have I completely misunderstood how to interpret model data? Am I overfitting massively or what's going on here? What's going on?

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
  • 5
    The phenomenon is called *separation*. Wald standard error estimates will be badly wrong, because the likelihood is plateauing rather than peaking. – Scortchi - Reinstate Monica Mar 13 '15 at 13:10
  • 6
    This is also called the Hauck-Donner effect. – Frank Harrell Mar 13 '15 at 13:20
  • 2
    This has been the subject of quite a few questions on site. For example, see [here](http://stats.stackexchange.com/questions/102695/high-p-values-for-logistic-regression-variable-that-perfectly-separates/). try [this search](http://stats.stackexchange.com/search?q=perfect+separation); you could also search for Hauck-Donner. – Glen_b Mar 13 '15 at 13:56
  • Welcome to the site. You should find the answer to your question at the linked thread. Please read it. If you still have a question afterwards, come back here & edit your Q to state what you've learned & what you still need to know. Then we can provide the information you need without duplicating material elsewhere that already didn't help you. – gung - Reinstate Monica Mar 13 '15 at 18:47

0 Answers0