I want to use linear regression to pre-process the data (e.g find outliers) so that I can use techniques like ANOVA to analyze the data. The goal is not to fit a regression model.
I saw two posts that are related to this topic, here and here.
In the first post, @Glen_b did a good job to illustrate an outlier (especially a leverage point) can completely influent the regression fit, and make the outlier undetectable.
The second post agrees with the first post in the sense that even robust techniques like the M-estimation are still vulnerable to leverage points (outliers in design space).
My proposal: 1. Fit a multiple linear regression and calculate the leverage statistics $h_i$ for $1 \leq i \leq n$. That is, $h_i$'s are the diagonal entries of the hat matrix $H = X (X^TX)^{-1} X^T$ where X is the design matrix. Use half-normal plot (or something else) to get rid of leverage points.
- Then use a M-estimation to detect outliers (in y-axis)
Anyone has any comment or thought on this?
Edit: @Glen_b Is step 1 above still valid even if there is a leverage point that is very influential (like in the second plot in your previous post)? In other words, would we be able to detect the leverage point by using the leverage statistics?