5

I understand the outlier impact for linear regression with squared loss. @gung had a beautiful answer in this post to explain the concept of leverage and residual.

My question is How does outlier impact logistic regression? Does the same concept apply? (that we want to have a closer look at high leverage/residual points?)

For example, R, plot(glm(am~wt,mtcars,family="binomial")) is telling me Toyota Corona has high leverage and residual, should I take a closer look?

enter image description here


I found this post that says logistic regression is robust to outliers but did not discuss leverage and residual. Is it correct?

enter image description here

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • 7
    The second illustration is extremely confusing--in some instructive ways. First, it does not exhibit any outlying responses. We might understand the rightmost point to be a (somewhat) high-leverage one, but that's all. Second, the fit is obviously wrong: this is a case of *complete separation.* As such (a) there is no unique fit and (b) no matter what solution is chosen (among the correct ones), the regressor value for that rightmost point has *no influence on the solution whatsoever*! – whuber May 16 '17 at 13:43

1 Answers1

6

Outliers may have the same essential impact on a logistic regression as they have in linear regression: The deletion-diagnostic model, fit by deleting the outlying observation, may have DF-betas greater than the full-model coefficient; this means that the sigmoid-slope of association may be of opposite direction. Separately, the inference may not agree in the two models, suggesting one commits a type II error, or the other commits a type I error.

This point underscores the problem of suggesting that, when outliers are encountered, they should summarily be deleted.

The implication for logistic regression data analysis is the same as well: if there is a single observation (or a small cluster of observations) which entirely drives the estimates and inference, they should be identified and discussed in the data analysis. DF-beta residual diagnostics is an effective numerical and graphical tool for either type of model which is easy to interpret by statisticians and non-statisticians alike.

There are some differences to discuss. In linear regression, it is very easy to visualize outliers using a scatter plot. The scaled vertical displacement from the line of best fit as well as the scaled horizontal distance from the centroid of predictor-scale X together determine the influence and leverage (outlier-ness) of an observation. For a logistic model, the mean-variance relationship means that the scaling factor for vertical displacement is a continuous function of the fitted sigmoid slope. Farther out in the tails, the mean is closer to either 0 or 1, leading to smaller variance so that seemingly small perturbations can have more substantial impacts on estimates and inference. However, whereas a Y value in linear regression may be arbitrarily large, the maximum fitted distance between a fitted and observed logistic value is bounded. Does that mean that a logistic regression is robust to outliers? Absolutely not.

AdamO
  • 52,330
  • 5
  • 104
  • 209