18

I have this R code for linear regression:

fit <- lm(target ~ age+sales+income, data = new)

How to identify influential observations based upon cook's distance and removing the same from data in R ?

mdewey
  • 16,541
  • 22
  • 30
  • 57
user3459010
  • 331
  • 2
  • 3
  • 5
  • Here is a nice example, which also gives an introduction how to use robust regression to deal with data that contains influential points: http://www.ats.ucla.edu/stat/r/dae/rreg.htm In the future, you should try to do a bit more research before asking a question. – Roland Jul 31 '15 at 08:16
  • 4
    Despite the focus on `R`, I think there is a meaningful statistical question here, since various criteria have been proposed to identify "influential" observations using Cook's distance--and some of them differ greatly from each other. (In my experience, the `rlm` function referenced by @Roland--with whose code I am intimately familiar--neither identifies nor assesses problems associated with highly influential observations *that have small residuals*, so I would not presume to conclude you haven't done your research.) – whuber Jul 31 '15 at 12:29
  • 3
    @Roland- i dont know what makes you feel that i haven't done my research before posting this!! i came across this link which you shared but it was of no use to me! In future you should better response with solution in terms of giving me proper code rather than giving links to such useless articles! – user3459010 Aug 02 '15 at 09:48
  • Some discussion in [How to read Cook's distance plots?](http://stats.stackexchange.com/q/22161/17230) & [Cook's distance cut-off value](http://stats.stackexchange.com/q/87962/17230). – Scortchi - Reinstate Monica Aug 03 '15 at 08:44

1 Answers1

28

This post has around 6000 views in 2 years so I guess an answer is much needed. Although I borrowed a lot of ideas from the reference, I made some modifications. We will be using the cars data in base r.

library(tidyverse)

# Inject outliers into data.
cars1 <- cars[1:30, ]  # original data
cars_outliers <- data.frame(speed=c(1,19), dist=c(198,199))  # introduce outliers.
cars2 <- rbind(cars1, cars_outliers)  # data with outliers.

Let's plot the data with outliers to see how extreme they are.

# Plot of data with outliers.

plot1 <- ggplot(data = cars1, aes(x = speed, y = dist)) +
        geom_point() + 
        geom_smooth(method = lm) +
        xlim(0, 20) + ylim(0, 220) + 
        ggtitle("No Outliers")
plot2 <- ggplot(data = cars2, aes(x = speed, y = dist)) +
        geom_point() + 
        geom_smooth(method = lm) +
        xlim(0, 20) + ylim(0, 220) + 
        ggtitle("With Outliers")

gridExtra::grid.arrange(plot1, plot2, ncol=2)

Comparison 1

We can see that the regression line has a poor fit after introducing the outliers. Therefore, let's us Cook's Distance to identity them. I am using the traditional cut-off of $\frac{4}{n}$. Notice that cut-off value just helps you to think about what's wrong with the data.

mod <- lm(dist ~ speed, data = cars2)
cooksd <- cooks.distance(mod)

# Plot the Cook's Distance using the traditional 4/n criterion
sample_size <- nrow(cars2)
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance")  # plot cook's distance
abline(h = 4/sample_size, col="red")  # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4/sample_size, names(cooksd),""), col="red")  # add labels

Cook's Distance Plot

There are many ways to deal with outliers as noted in the Reference. Now, I just want to simply remove them.

# Removing Outliers
# influential row numbers
influential <- as.numeric(names(cooksd)[(cooksd > (4/sample_size))])

# Alternatively, you can try to remove the top x outliers to have a look
# top_x_outlier <- 2
# influential <- as.numeric(names(sort(cooksd, decreasing = TRUE)[1:top_x_outlier]))

cars2_screen <- cars2[-influential, ]

plot3 <- ggplot(data = cars2, aes(x = speed, y = dist)) +
        geom_point() + 
        geom_smooth(method = lm) +
        xlim(0, 20) + ylim(0, 220) + 
        ggtitle("Before")
plot4 <- ggplot(data = cars2_screen, aes(x = speed, y = dist)) +
        geom_point() + 
        geom_smooth(method = lm) +
        xlim(0, 20) + ylim(0, 220) + 
        ggtitle("After")

gridExtra::grid.arrange(plot3, plot4, ncol=2)

Before and After Comparison

Hooray, we have successfully removed the outliers~

Excellent Reference:Outlier Treatment

JetLag
  • 1,105
  • 2
  • 9
  • 22
  • 4
    `# Detecting outliers in cars dataset; fit – Mohit Sharma Nov 18 '18 at 15:37
  • Cook's distance is built into `plot.lm`, so an easier way to obtain the first plot is just `plot(mod, which=4)`. (Another form of the plot is also available, see `?plot.lm`.) – nth Oct 15 '21 at 15:00