3

Wondering what shuffle does if I set it to True when splitting a dataset into train and test splits. Can I use shuffle on a dataset which is ordered by dates?

train, test = train_test_split(df, test_size=0.2, random_state=42, shuffle=True)

Example dataframe: enter image description here

Tim
  • 108,699
  • 20
  • 212
  • 390
Abhiram
  • 131
  • 4

2 Answers2

3

With time-series data, where you can expect auto-correlation in the data you should not split the data randomly to train and test set, but you should rather split it on time so you train on past values to predict future. Scikit-learn has the TimeSeriesSplit functionality for this.

The shuffle parameter is needed to prevent non-random assignment to to train and test set. With shuffle=True you split the data randomly. For example, say that you have balanced binary classification data and it is ordered by labels. If you split it in 80:20 proportions to train and test, your test data would contain only the labels from one class. Random shuffling prevents this.

If random shuffling would break your data, this is a good argument for not splitting randomly to train and test. In such cases, you would use splits on time, or clustered splits (say you have data on education, so you sample whole schools to train and test, rather than individual students).

When should you use shuffle=False? TL;DR never.

  • Your data was randomly sampled or was already shuffled. But shuffling one more time wouldn't hurt you. I remember seeing multiple datasets that were supposed to be randomly shuffled but weren't.
  • Your dataset is huge, so shuffling makes the whole pipeline a little bit slower. If that is the case, you probably don't want to use scikit-learn pipelines for preprocessing as well. If you use instead something else that scales better, still you need to make sure that it shuffles the data.
  • You don't want to split randomly and your data is already arranged in the way how you want to split it, for example, you have data collected during the 2010-2020 period and you want to split in 80:20 proportions with years 2010-2018 in train set and 2019-2020 in test set. Here it makes sense, but you would probably would like to use the TimeSeriesSplit functionality instead or write the code by hand to have greater control on what you are doing. For example, if you want to split by years, you probably don't want by accident few days of one year to land in other set than the rest of the year--so you would rather do the split manually.
Tim
  • 108,699
  • 20
  • 212
  • 390
  • 1
    Here's [another way](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) to split on time. – J.G. Oct 31 '21 at 16:59
  • Thank You so much. Your answer made sense. What would be the default value of shuffle parameter if we are not going to mention it while splitting the dataset? – Abhiram Nov 01 '21 at 02:06
  • 1
    @Abhiram see edit. – Tim Nov 01 '21 at 07:42
  • @Tim - Great! From your edit, I think my dataframe falls into the 3rd category. Do you think it is ok to apply TimeSeriesSplit on a small dataframe of size 841 rows, with each row representing observations happened on one date? – Abhiram Nov 03 '21 at 06:02
  • @Abhiram why not? – Tim Nov 03 '21 at 06:14
1

Omitting shuffle on a time-sorted DataFrame would lead to a time-based bias where similar periods are in the same dataset. By toggling shuffle=True you make sure that that the data is mixed prior to creation of test/train datasets.

tdy
  • 313
  • 7
Kosmos
  • 111
  • 3