2

Say you have a single time series of a financial instrument. You want to classify when we are in a bull vs bear market (this is just an example).

When splitting your data into a train and test set, if you split based on time period - i.e. everything before year 2000 as a train, and after 2000 as test, won't there be significant leakage between train and test, and thus result in biased accuracy estimates?

How would you properly account for this when working with a single time series?

  • 1
    check out [this](http://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection) question – user3494047 Oct 05 '16 at 14:32

2 Answers2

2

No, there won't be. Leakage is when your test data is, in some form, also part of your train data. In your example this doesn't happen, you use data up to 2000 to train (and validate perhaps), and then test in data from 2000 onwards. Your learning scheme hasn't seem the test data, i.e. it could not have learned from the test data itself.

With time-series leakage is trickier because you have to resample respecting the time of each observation, but the scheme you presented is mostly free from it. If it weren't so every resampling technique based on time stamps (e.g. rolling window resampling) would be invalid.

A word of caution

Leakage might still appear under your framework (and any other for the matter) if you pre-process train and test data together. You can pre-process it, but do it in train and apply the pre-processing to testing data, don't learn the pre-process steps using test data.

Firebug
  • 15,262
  • 5
  • 60
  • 127
1

If you are concerned that the first test cases being taken just after the last training cases, you can leave a gap between them.

I'm not at all into time series classification, but here are tow ideas:

  • predict time training end + $t$. The prediction accuracy should go down to guessing over a period of time. Leave a gap of at least this time between training and test time series. (do for many "cut" times)

  • Or: check whether earlier test times are consistently (do for many "cut" times to split train and test) better predicted than later test times.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133