Wondering what shuffle does if I set it to True when splitting a dataset into train and test splits. Can I use shuffle on a dataset which is ordered by dates?
train, test = train_test_split(df, test_size=0.2, random_state=42, shuffle=True)
Wondering what shuffle does if I set it to True when splitting a dataset into train and test splits. Can I use shuffle on a dataset which is ordered by dates?
train, test = train_test_split(df, test_size=0.2, random_state=42, shuffle=True)
With time-series data, where you can expect auto-correlation in the data you should not split the data randomly to train and test set, but you should rather split it on time so you train on past values to predict future. Scikit-learn has the TimeSeriesSplit
functionality for this.
The shuffle
parameter is needed to prevent non-random assignment to to train and test set. With shuffle=True
you split the data randomly. For example, say that you have balanced binary classification data and it is ordered by labels. If you split it in 80:20 proportions to train and test, your test data would contain only the labels from one class. Random shuffling prevents this.
If random shuffling would break your data, this is a good argument for not splitting randomly to train and test. In such cases, you would use splits on time, or clustered splits (say you have data on education, so you sample whole schools to train and test, rather than individual students).
When should you use shuffle=False
? TL;DR never.
TimeSeriesSplit
functionality instead or write the code by hand to have greater control on what you are doing. For example, if you want to split by years, you probably don't want by accident few days of one year to land in other set than the rest of the year--so you would rather do the split manually.