0

I am struggling a little bit with PCA. I understand that standardization is an important part of the algorithm but I do not understand which elements should be standardized. Let's say I have a 10x100 matrix X where the 10 rows are the samples and the 100 columns are the features. Each sample is a RGB image considered as an array (my real dataset has 1087 samples, each one with 154587 features).

Should I standardize each feature or each sample? What if I do not take into account the rows and the columns and I simply do this:

X_std = (X - X.mean()) / X.var()

I can't figure out the reason why I should standardize with respect to the feature, to the sample or to the entire dataset. What I know is that the standard scaler from sklearn by default runs standardization feature-wise making each feature zero mean and unit variance.

Thank you for your help

matteof93
  • 1
  • 1
  • You can do neither, one, both, or either (in either order--the order matters): it depends on what your data mean and what your objectives are. – whuber Dec 28 '18 at 23:34
  • my data are simply rgb images treated as arrays and I do not understand the meaning of standardizing directly the entire dataset with the code above versus a feature-wise or sample-wise standardization. – matteof93 Dec 28 '18 at 23:40
  • Would it be better to standardize sample-wise? Might be duplicate to this [question](https://stats.stackexchange.com/q/200070/103153)? – Lerner Zhang Jan 01 '19 at 08:51

0 Answers0