1

I want to use linear regression to pre-process the data (e.g find outliers) so that I can use techniques like ANOVA to analyze the data. The goal is not to fit a regression model.

I saw two posts that are related to this topic, here and here.

In the first post, @Glen_b did a good job to illustrate an outlier (especially a leverage point) can completely influent the regression fit, and make the outlier undetectable.

The second post agrees with the first post in the sense that even robust techniques like the M-estimation are still vulnerable to leverage points (outliers in design space).

My proposal: 1. Fit a multiple linear regression and calculate the leverage statistics $h_i$ for $1 \leq i \leq n$. That is, $h_i$'s are the diagonal entries of the hat matrix $H = X (X^TX)^{-1} X^T$ where X is the design matrix. Use half-normal plot (or something else) to get rid of leverage points.

  1. Then use a M-estimation to detect outliers (in y-axis)

Anyone has any comment or thought on this?

Edit: @Glen_b Is step 1 above still valid even if there is a leverage point that is very influential (like in the second plot in your previous post)? In other words, would we be able to detect the leverage point by using the leverage statistics?

Jack Shi
  • 521
  • 1
  • 3
  • 14
  • 2
    I don't fully understand the question because ANOVA *is* a regression model. I think this issue is important because how you deal with leverage points might depend on the nature of the model you eventually do wish to employ, as well as what your analytical objectives might be. Removing all leverage points just because they have high leverage seems risky at best and dangerous at worst. – whuber Mar 25 '16 at 20:34
  • @whuber can you elaborate a little bit more? What's the difference between implementing a regression model at the end or implementing classification methods (e.g linear discriminant analysis) or tree methods (e.g random forests) at the end (upon the issue of pre-processing outliers)? Furthermore, do you think my proposal is reasonable? My concern is step 1 will have a impact on step 2 (they are "correlated"), so making the process inaccurate. – Jack Shi Mar 25 '16 at 21:16
  • have you *tried* this against some simulated outlier*s*? – user603 Apr 24 '16 at 13:44

0 Answers0