I am aware this has been discussed here. But I would like to ask a bit more about the topic under different scenarios.
Standardizing the whole dataset before splitting. In general this is considered as wrong because of the information leakage from training to testing. What if I have a sufficiently large dataset and all data points are well mixtured such that training and testing sets are draw from the same stable distribution? How realistic would this assumption be?
Standardizing training and testing sets separately. Would the assumption in 1. renders this way of standardizing sensible?
Using the stats of training set to standardize testing set. This is recommended everywhere. The potential drawback of way would be if the entire dataset is small, the training data stats might result in abnormaly large or small data points in testing data.
Using potential population stats to standardize data. For example, if I am using human proteins to do some kind of classification, then potential population could be the all proteins of life systems, or of all mammals. The big problem here is that even all natural proteins combined is still very tiny comparing to the universal set of combinatorial space of sequences consisting of 20 amino acids. Then my question is how reliable it is to use the potential population in this case.
Many thanks in advance!