0

Given a data set where we have different measured features in the same units for each subject. For example, numbers of different cell types (features) in a tumour (subject), where we have n tumours and m features.

If we want to see which cell types (features) explain most of the variation across tumours (subjects), is it correct to z-score the values of the features within subjects (i.e. have each subject distribution of values centred around 0)?

Thanks.

alejandro
  • 163
  • 2
  • 10
  • 1
    Yes, if you need to remove level and scale differences between the profiles (subjects) you may do that. You may then perform PCA of features (usual way, or R-way), or PCA of the transposed data, of subjects ([Q-way](https://stats.stackexchange.com/a/20103/3277)). – ttnphns Nov 14 '18 at 10:19

1 Answers1

1

It's reasonable choice but it doesn't need to be only z-score.

PCA requires each column to have zero mean in order for the algorithm to find a correct first principal component. See this answer for nice visualization and more details
https://stats.stackexchange.com/a/22331/226852

There are other scaling methods that allow your data to have zero mean as well. Check this out
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

S.J.
  • 11
  • 1