0

I want to know how to find and remove outliers from my Logistic Regression. I have tried using formula from Faraway, but I don't know is it applicable for logistic regression or not For example my code is

library(vcd)
library(ggplot2)
library(dplyr)
library(MLmetrics)
library(pROC)
d=read.delim("http://dnett.github.io/S510/Disease.txt")
d$disease=factor(d$disease)
d$ses=factor(d$ses)
d$sector=factor(d$sector)

finalmodel=glm(disease~age+sector, family=binomial(link=logit), data=d)

For finding the outliers I am using this code from Faraway

library(faraway)
i_n = influence(finalmodel)$hat # calculate the influence of data points with leverage
i_n
which.max(i_n)
# R code
halfnorm((i_n))

halfnorm(rstudent(finalmodel)) #jacknife residuals 

Please help me if u know is it right or not. And how do I remove the outliers from my data? Thanks!

  • Are you talking about outliers in the marginal distribution of predictors? – Demetri Pananos Dec 20 '21 at 06:24
  • 1
    In my opinion, trying to remove outliers from binary regression is very rarely a sensible thing to do since the response variable can only be 0 or 1. – Gordon Smyth Dec 20 '21 at 06:47
  • 1
    @GordonSmyth Given that OP mentions influence points and leverage I assume OP is referring to outliers in X, not in Y. – user2974951 Dec 20 '21 at 06:56
  • 2
    @user2974951 I don't think the term "outlier" can properly be applied to predictors/X variable. In any case, the `rstudent` function computes residuals in Y rather than in X. Happy to leave this question to you but I don't think that what OP is doing is sensible. – Gordon Smyth Dec 20 '21 at 07:06
  • It could be argued that the correct way to remove outliers is to 1) consider doing so and then 2) not remove outliers, so why do you want to remove points from your data? – Dave Dec 20 '21 at 11:29
  • @user2974951 Influential points are not the same as "outliers" in the X space. The term "outlier" refers to an observation that comes from a different process other than that proposed by the mathematical model. Binary regression does not make any assumptions about the X variables so they cannot be outliers. In the OP's dataset, the predictor variables in this case are categorical factors, so one cannot compute numerical distances between X observations and hence an X value cannot be "extreme". – Gordon Smyth Dec 24 '21 at 06:45

1 Answers1

0

I don't know which package the influence function is from, however if it's intended use is for linear regression then it is not suitable for logistic regression.

Given that you mention influence points and leverage I assume your goal is to find "outliers" in your predictors (X) and not necessarily your Y. "Outliers" in your Y can be identified by looking at the residuals plot, however identifying "outliers" in X is a little more complicated, especially if you have high-dimensional data.

Note: I use the term "outlier" to refer to unusual (or as I've recently begun to call them, extreme) values either in your Y variable or your X variables, as I see no distinction between the two. For ex. Wikipedia defines an outlier as a data point that differs significantly from other observations.

Anyway, for logistic regression there exists Pregibon leverage, which can be used to detect outliers in your predictors (in a similar fashion to linear regression), while you can use Pearson and/or deviance residuals to check for Y outliers. See also:

Using the Hat Matrix to detect influential observations in logistic regression
Information out of the hat matrix for logistic regression
How to calculate the hat matrix for logistic regression in R?

Edit: I've tried to answer your question on how to detect "outliers", and not how to remove them, as you should probably not do that.

user2974951
  • 5,700
  • 2
  • 14
  • 27
  • Thank you for your answer! Do you have any references if I want to check for Y outliers? Thanks! – Jasmine Helen Dec 20 '21 at 12:20
  • @JasmineHelen There are some nice answers on this site, see [Diagnostics for logistic regression?](https://stats.stackexchange.com/questions/45050/diagnostics-for-logistic-regression?noredirect=1&lq=1) and [What do the residuals in a logistic regression mean?](https://stats.stackexchange.com/questions/1432/what-do-the-residuals-in-a-logistic-regression-mean). – user2974951 Dec 20 '21 at 12:29