1

I would like to use k-fold on my small dataset (length = 118) and apply it to a random forest model.

However, it is a time series of monthly data. Starting from October 2010 up to Jul 2020.

What is the best way of cross-validating my data in this case?

Here's the head of my data where the Date is the index column:

          Per.Change Domestic.Production.from.UKCS Import Per.GDP.Growth Average.Temperature Price.Electricity Price.Gas
2010-10-01       2.08                          3.54   5.40            0.2               10.44             43.50     46.00
2010-11-01      -3.04                          3.46   6.74           -0.1                5.52             46.40     49.66
2010-12-01       0.31                          3.54   9.00           -0.9                0.63             58.03     62.26
2011-01-01       2.65                          3.59   7.58            0.6                4.05             48.43     55.98
2011-02-01       1.52                          3.20   5.68            0.4                6.29             46.47     53.74
2011-03-01      -1.38                          3.40   5.93            0.5                6.59             51.41     60.39
gunes
  • 49,700
  • 3
  • 39
  • 75
Joehat
  • 111
  • 3

1 Answers1

0

You'll use time series cross validation which respects the time dimension. This question has very good answers with visualisations. Basically, you'll do something like

Fold 1: Training: [2010, 2011, 2012], Test: [2013]

Fold 2: Training: [2010, 2011, 2012, 2013], Test: [2014]

...

This way, your validation respects the time ordering, and there'll be no data-leakage.

gunes
  • 49,700
  • 3
  • 39
  • 75