Cook's distance and $R^2$

Question

I am currently running a linear regression and calculating its $R^2$

After that. I calculate the Cook's distance of all points and throw away from the analysis all of the points with a distance higher than $d_i >\frac{4}{\text{No. observations}}$.

To my surprise the $R^2$ is worse. How is this possible?

For any least-squares model and any positive $\epsilon$, you can add a single extreme observation to reach an R-squared of $1-\epsilon$. So there is no surprise. But @user603 is right: Why would one want to blindly delete influential observations? — Michael M, Dec 26 '13 at 23:34
I don't think the OP deletes these observations as much as sets them aside (as not well described by a linear fit). What puzzles me however is that if OP is trying to flag some observations as influential, then, the methods chosen is [pathologically infective](http://stats.stackexchange.com/questions/15426/whether-to-delete-cases-that-are-flagged-as-outliers-by-statistical-software-whe/50780#50780) for this purpose. — user603, Dec 27 '13 at 00:21
I'm seeing that Cook's distance is not the way to go to identify and remove outliers. Do you think Bonferroni is better? Hopefully I want to build a machine that automatically detects and removes outliers if that is possible to do at all. — Wilmer E. Henao, Dec 27 '13 at 14:44

Glen_b · Accepted Answer · 2017-03-07T10:13:57.437

One shouldn't necessarily expect to find that $R^2$ improves by deleting an influential outlier; $R^2$ has a numerator and a denominator, and both are impacted by points with high Cook's distance.

It's easy to pick up a somewhat mistaken conception of $R^2$; this may lead you to have an expectation of $R^2$ that isn't the case.

As I mentioned, $R^2$ has a numerator and a denominator; adding an influential outlier will greatly increase the variation in the data (increasing the denominator). You might expect that would reduce $R^2$ -- but at the same time, if the point is sufficiently influential, almost all of that additional variation in the data will be explained by a line going through, or nearly through the outlier.

This may be easiest to see with an example.

Consider the following data:

no influential outlier

This has an $R^2$ of 91.6%

Now add a highly influential outlier to the above data:

    x       y
  100 -100.00

influential outlier

This has an $R^2$ of 96.4%

While the denominator of the $R^2$ increased from 88.07 to 10137, the numerator increased from 80.68 to 9769 - most of the variation in the data (over 90% of it!) is contributed by one observation, and that one is fitted quite well; this drives $R^2$.

To see that the fit to the rest of the data is actually much worse, simply compare their residuals; that lack of fit does very little to pull down $R^2$.

This example demonstrates not only that it can happen that $R^2$ can increase by adding an influential outlier, but shows how it can happen. (Conversely, if we start with the second data set and delete the influential outlier, $R^2$ will go down.)

It should serve as a cautionary tale - beware of interpreting $R^2$ as fit in any intuitive sense; it does measure a kind of fit, but it's a very particular measure of it, and the behaviour of that measure may not match your personal intuition.

Cook's distance and $R^2$

1 Answers1