2

I am currently running a linear regression and calculating its $R^2$

After that. I calculate the Cook's distance of all points and throw away from the analysis all of the points with a distance higher than $d_i >\frac{4}{\text{No. observations}}$.

To my surprise the $R^2$ is worse. How is this possible?

mdewey
  • 16,541
  • 22
  • 30
  • 57
Wilmer E. Henao
  • 237
  • 1
  • 3
  • 13
  • 1
    what are you trying to achieve with this procedure? – user603 Dec 26 '13 at 23:09
  • 3
    For any least-squares model and any positive $\epsilon$, you can add a single extreme observation to reach an R-squared of $1-\epsilon$. So there is no surprise. But @user603 is right: Why would one want to blindly delete influential observations? – Michael M Dec 26 '13 at 23:34
  • 1
    I don't think the OP deletes these observations as much as sets them aside (as not well described by a linear fit). What puzzles me however is that if OP is trying to flag some observations as influential, then, the methods chosen is [pathologically infective](http://stats.stackexchange.com/questions/15426/whether-to-delete-cases-that-are-flagged-as-outliers-by-statistical-software-whe/50780#50780) for this purpose. – user603 Dec 27 '13 at 00:21
  • I'm seeing that Cook's distance is not the way to go to identify and remove outliers. Do you think Bonferroni is better? Hopefully I want to build a machine that automatically detects and removes outliers if that is possible to do at all. – Wilmer E. Henao Dec 27 '13 at 14:44

1 Answers1

9

One shouldn't necessarily expect to find that $R^2$ improves by deleting an influential outlier; $R^2$ has a numerator and a denominator, and both are impacted by points with high Cook's distance.

It's easy to pick up a somewhat mistaken conception of $R^2$; this may lead you to have an expectation of $R^2$ that isn't the case.

As I mentioned, $R^2$ has a numerator and a denominator; adding an influential outlier will greatly increase the variation in the data (increasing the denominator). You might expect that would reduce $R^2$ -- but at the same time, if the point is sufficiently influential, almost all of that additional variation in the data will be explained by a line going through, or nearly through the outlier.

This may be easiest to see with an example.

Consider the following data:

    x       y
    1    0.56
    2    0.63
    3    3.28
    4    3.01
    5    5.42
    6    6.88
    7    7.69
    8    6.65
    9    7.49
   10    9.76

no influential outlier

This has an $R^2$ of 91.6%

Now add a highly influential outlier to the above data:

    x       y
  100 -100.00

influential outlier

This has an $R^2$ of 96.4%

While the denominator of the $R^2$ increased from 88.07 to 10137, the numerator increased from 80.68 to 9769 - most of the variation in the data (over 90% of it!) is contributed by one observation, and that one is fitted quite well; this drives $R^2$.

To see that the fit to the rest of the data is actually much worse, simply compare their residuals; that lack of fit does very little to pull down $R^2$.

This example demonstrates not only that it can happen that $R^2$ can increase by adding an influential outlier, but shows how it can happen. (Conversely, if we start with the second data set and delete the influential outlier, $R^2$ will go down.)

It should serve as a cautionary tale - beware of interpreting $R^2$ as fit in any intuitive sense; it does measure a kind of fit, but it's a very particular measure of it, and the behaviour of that measure may not match your personal intuition.

Glen_b
  • 257,508
  • 32
  • 553
  • 939