Questions tagged [data-leakage]
47 questions
11
votes
4 answers
Combining PCA, feature scaling, and cross-validation without training-test data leakage
The sci-kit learn documentation for cross-validation says the following about using feature-scaling and cross-validation:
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature…

woblers
- 293
- 2
- 9
5
votes
1 answer
Does using a random train-test split lead to data leakage?
I am trying to understand data leakage in modeling practice.
If we had a dataset of patient instances from 2000-2018 (with all patient visits included), and used a randomly selected train-test split (say 80%/20%), would that lead to data leakage?
To…

AmeySMahajan
- 123
- 6
5
votes
1 answer
what does it mean that there is leakage of information when one uses a test set?
I have read about the term "leakage of information" that occurs when one tries to estimate the generalization error by using a test set in Machine Learning models. However, I was not able to find any formalism (maths or statistics) that could help…

Layla
- 569
- 1
- 5
- 16
5
votes
1 answer
Using Random Forest variable importance for feature selection
I'm currently trying to convince my colleague that his method of doing feature selection is causing data leakage and I need help doing so.
The method they are using is as follows:
They first run a random forest on all variables and get the feature…

astel
- 1,388
- 5
- 17
4
votes
1 answer
When imputing missing values in a test set, should the new values come from the training set or be recalculated from the test set?
Both answers to this question on imputing missing values note that, when imputing missing values in a test set for model evaluation, the replacement values should be the ones calculated and used in the training process (not calculated anew on the…

danpelota
- 257
- 2
- 8
4
votes
0 answers
Can SVM leak training data?
Is it possible to have access to trained model, e.g. through some API, and reverse engineer the model by asking for predictions for some arbitrary data, therefore recover the support vectors of the model and thus original training data (medical…

rep_ho
- 6,036
- 1
- 22
- 44
3
votes
0 answers
Is it okay to include the dependent variable as an input variable to the higher-level regression model, in a hierarchical / multi-level setup
Let's say I have a hierarchical dataset with student scores (for each student) nested within schools. While modelling for a varying intercept, would it be okay to include the average of student scores within a school as an input variable while…

infinitesimal
- 110
- 6
2
votes
1 answer
Data Leakage Concerns
I've come across the concept of data leakage in which optimistically biased generalisation errors occur due to test data in some sense 'seeing' the training data. For instance, normalisation on an entire dataset before train/test splitting occurs…

N Blake
- 539
- 3
- 8
2
votes
1 answer
What is the difference between standardizing time series data and non-time series data?
From reading some answers on this site (1, 2, 3 and 4) I found that, on time series data, standardization must be applied separately on the train and test sets to avoid data leakage.
So the train data would be standardized using a different mean…

Marcus
- 255
- 1
- 4
2
votes
1 answer
How to split multiple measurements of the same sample between folds
I'm solving the spectroscopy problem. Based on reflectivity values for wavelengths from the spectrum, I build a regression to find a target for the sample.
I have 30 samples. For each sample I take measurements of its spectrum 3 times. In total, I…

Mishin V.
- 23
- 4
2
votes
1 answer
data leakage when scaling time series
Suppose I want to forecast future values of $y$ past values of features $x$.
In this example I am using:
the training set goes from $t_0$ to $t_{15}$
values from $x_{t_0}$ to $x_{t_{10}}$ to forecast $y_{t_{11}}$
values from $x_{t_1}$ to…

gioxc88
- 1,010
- 7
- 20
2
votes
1 answer
Using target variable in training process
I am dealing with a task where I train a classification model to predict whether an item is going to be returned in a web shop.
Can I use features which contain information from the target variable?
For example can I create a feature with the…

end9
- 23
- 3
2
votes
1 answer
Data leakage if I add prediction as feature?
I have a training set and a test set.
Let's assume the following:
I train random forest on the training set
I make prediction on training set and test set
Then I add those prediction as features back into the training set and the test set
Now I…

Viðar Ingason
- 407
- 2
- 10
2
votes
1 answer
One hot encoding vs apply the average of the label to each category
I have a fairly reasonably sized dataset (row>50k). And I'm looking for the best way to utilize some of the categorical columns. For purpose of this question, let's say that one of the categorical column is zipcode. The premise is, after feature…

Rocky Li
- 145
- 5
2
votes
0 answers
Data leakage concern in a binary classification problem
I have a binary classification problem (where 1 = broken and 0 = not broken) for machine engines under study. There are 25 continuous features over which I use to make predictions of 1 or 0 using random forest (RF). These 25 features in addition to…

Jane Wayne
- 1,268
- 2
- 14
- 24