1

I'm playing around with a dataset and would like to run some clustering on it, but I'm hitting some issues regarding scaling, and the result that this has on my principal components analysis (PCA). I am not quite sure what approach to take here. Unfortunately I'm unable to provide example data since it's quite a large dataset, but I don't believe it is necessary as this is more of a conceptual question.

Essentially, I compared the results of standardization, normalization, and robust standardization (like standardization, only using the IQR and the median instead of the STD and mean), all implemented in scikit-learn in Python. I then ran sklearn's PCA on the data, and plotted the variance explained against each PC and got the following outputs (y-axis is variance explained, labels refer to cumulative variance explained):

enter image description here

My data are highly positively skewed for the most part and seem to contain a few outliers, though it's hard to tell given this skew. I'm using on the first few components that explain most of the variance, and eyeballing some of the resulting plots shows that the clustering is quite close to what I want. However, the number of clusters and especially the silhouette scores are quite a bit different, depending on the scaling I'm using.

My questions:

  1. Given that my data are not gaussian distributed, is it best to use robust scaling over the others?
  2. Why is the PCA result so different depending on the scaling?
  3. Is there anything else that I could do instead/additionally to improve this?

I'd appreciate any feedback on this.

fffrost
  • 175
  • 5
  • The need for scaling before PCA is discussed here: https://stats.stackexchange.com/questions/69157/why-do-we-need-to-normalize-data-before-principal-component-analysis-pca – spdrnl Jul 09 '20 at 20:46
  • Thanks for the suggestion but I don't see how this relates to my question, unless I overlooked something. – fffrost Jul 09 '20 at 21:40
  • PCA depends on the scaling method. There are several works that have shown what it may be the optimal scaling in various contexts. Surely a robust scaling may help if you have outliers. If the distribution is right skewed you can try log-transform before. –  Jul 09 '20 at 23:48
  • Thanks, do you have any links to these works? I have not come across anything specific about this – fffrost Jul 10 '20 at 06:38

0 Answers0