1

For the past week, I have been constantly checking with people on this sub on how to avoid data leakage during preprocessing like feature selection and/or scaling etc here and here.


I understand most ideas to a large extent, but can't help to wonder how then, should one approach EDA processes that make use of these preprocessing techniques.

As an example:

I want to perform some PCA EDA to check for explained variance of my features. During EDA, I will first standardize my whole dataset and then perform PCA on it. I gained some intuition on which features are more accountable. So...what next? In order to not introduce leakage, I will only bear this EDA's conclusion in mind and go on to cross-validation with PCA as part of my pipeline, because in this EDA, I technically did two "data leakage" preprocessing, one is standardization, one is PCA. Of course, Professor Frank Harrell, whom I look up to tremendously, did mention in the feature selection post that sometimes it is OK to do a one-shot PCA, if and only if you do not re-estimate PCA loadings in test set. And he also suggested a more iterative approach such that you can train model with and without PCA to really compare (although he suggested simulating the dataset I have but unfortunately I do not know how to simulate my dataset - unsure what distribution they are).

My main question is, if data leakage is a big issue, then will it be an issue when I do my EDA using the whole dataset? I mean...you surely wont split your dataset into train and test, and then do EDA on the train? If this is the case, won't it be even more troublesome should you decide to do 10 fold cross-validation...as you surely won't split dataset into 10 folds and then do EDA.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
nan
  • 511
  • 2
  • 9

0 Answers0