8

According to my understanding, Cook's distance measures the influence of each observation by excluding points when fitting a model. So I assume it could be an reasonable approach for outlier detection?

My questions, assume data are categorized into groups, is it possible to use Cook's distance on detecting the "outlier" group instead of outlier point? Is Cook's distance a good choice of measuring group influence.

mdewey
  • 16,541
  • 22
  • 30
  • 57
Roy C
  • 103
  • 1
  • 5
  • Can you make a factor variable for the grouping and then do plots? – conv3d Apr 08 '16 at 23:20
  • Thank you, I just noticed there's a group option in `influence()`. I have another questiona about the threshold. Since the usual 4/N is "too sensitive" detecting outlier, while I only care about extreme influential group/point. @jchaykow – Roy C Apr 09 '16 at 00:18
  • 1
    For smaller datasets Cook's D cutoff can be 1. – conv3d Apr 09 '16 at 00:22
  • @jchaykow It works well on some of my datasets, not really small datasets though. I'll try it on others later. Is it some kind of rule of thumb, and how should I interpret this cutoff? Thank you. – Roy C Apr 09 '16 at 02:19

2 Answers2

3

Like you said Cook’s Distance measures the change in the regression by removing each individual point. If things change quite a bit by the omission of a single point, than that point was having a lot of influence on your model. Define $\hat{Y}_{j(i)}$ to be the fitted value for the jth observation when the ith observation is deleted from the data set. Cook’s Distance measures how much $i$ changes all the predictions.

$$D_i = \frac{\sum_{j=1}^{n}\hat{Y}_j - \hat{Y}_{j(i)})^2}{pMSE}$$ $$= \frac{e_i^2}{pMSE}[\frac{h_{ii}}{(1-h_{ii})^2}]$$

If $D_i \geq 1$ it is extreme (for small to medium datasets).

Cook’s Distance shows the effect of the ith case on all the fitted values. Note that the ith case can be influenced by

  1. big $e_i$ and moderate $h_{ii}$

  2. moderate $e_i$ and big $h_{ii}$

  3. big $e_i$ and big $h_{ii}$

In R, use the influence.measures package with cooks.distance(model)

conv3d
  • 626
  • 5
  • 12
  • Thanks for clarifying the definition. But my question is more about whether the – Roy C Apr 09 '16 at 03:18
  • @DaisyLee your comment got cut off – conv3d Apr 09 '16 at 03:25
  • Lol, just noticed it's you. Thanks for clarifying the definition. I want to ask whether the idea of extend the use of Cook's distance to detecting the outlier group instead of some points is erroneous, or reasonable? And how do you think of using boxplot/IQR to cut off extreme influential cooks distances? – Roy C Apr 09 '16 at 03:31
  • 1
    Using Cook's Distance won't work based on the nature of the method (i.e. removing each point individually). If you simply want to check for outlier of a variable based on your groups with sd or a similar method as you state above, this is no problem... df1 = df %>% group_by(grouping) %>% filter(!(abs(value - median(pred1)) > 2*sd(pred1))) %>% summarise_each(funs(mean), pred1) – conv3d Apr 09 '16 at 03:43
  • @DaisyLee beyond this I'm out of ideas unfortunately. Maybe someone else can assist more. – conv3d Apr 09 '16 at 03:52
2

Cook's D is ineffective in detecting cluster of outliers because removing one of those will not affect the model too much (there're still other outliers).

You could use the residual as a measure, which is sensitive to clusters. A simple implementation of k-means is also effective.

SmallChess
  • 6,764
  • 4
  • 27
  • 48