0

I am looking to clean up a large data set (~300) with a large number of features (~140). I would like to explore outliers in the data. My first thought was to use PCA to reduce these features to a few components that explain most of the variance, and then exclude outliers of these new n factors.

However, there are a few issues with this approach. The first is that PCA is affected by outliers itself. Robust PCA may be a good alternative here?

The second issue is that some of the variables I would like to partial out, like age, sex, etc..., but not remove data based on these. I would like to control for them first, and then apply some outlier detection on the result. Does it makes sense to partial these out first, and then apply a robust PCA outlier analysis on the results of that?

neuroguy123
  • 146
  • 4
  • 300 is not a large data set, it's rather small, and almost falls into finite sample size domain – Aksakal Mar 20 '18 at 20:34
  • Fair enough. It's relatively large for the field I'm in. I know this is flagged as a duplicate and that's fine. I believe my plan of regressing out age for each variable and running some form of robust PCA on the residuals is a good start. – neuroguy123 Mar 20 '18 at 21:04

0 Answers0