2

I am used to work with manageable number of features. I usually print some descriptive statistics and visualise the histograms of each feature using Python and Pandas or R. I check for outliers and if the data points follow normal distribution or need a transformation.

Now I am dealing with around 200 features and it is not feasible to check each feature manually

Is there any method to auto-check for power-law distribution or to spot outliers in the data?

amrakm
  • 21
  • 3
  • If you want to detect interesting observations (what you call outliers) in a multivariate sense you could use PCA and then plot in the space of the first few components. If you want to automate univariate detection of interesting points then I think that is off-topic here as a request for code. – mdewey Dec 19 '16 at 16:03
  • Could you please explain why printing the space of the PCA components could show any interesting observation? – amrakm Dec 19 '16 at 16:54
  • you can filter out outliers by discarding data points that are above three sandard deviations from the mean, as in the following example https://github.com/vsmolyakov/pyspark/blob/master/outliers.py – Vadim Smolyakov Jun 02 '17 at 21:10

0 Answers0