How to implement cross-validation on my time series (leave one out and k-fold)?

Question

I am a software developer. I do not have a formal training in time series. I have started reading Chatfield and Brockwell. I have enough wisdom to reach out to professional statisticians in your field for insightful commentary so I can avoid doing something wrong.

Problem

How can I apply leave one out and k-fold cross validation on my time series?

Details

Technically, I have 10 independent time series that is comprised of 10 participants. For each series, we have participant id, timestamp (data taken in one second interval), heart rate, GIS location, GIS zone(The zone is a GIS polygon of special interest for fatigue), and a binary variable indicating if the user is fatigued or not. My goal is to do cross validation so I can build a model to detect the fatigue.

My data is something like as follows:

participant id, timestamp, heartrate, lat, long, zone, fatigue
1, 10:30, 130, 70, 38, 39, 1, 0
1, 10:30, 130, 72, 38, 39, 1, 0
...
10, 10:30, 138, 72, 38,39, 1, 0
...

where I can tell which time series I am in based on the participant.

Attempts

Let me divide my time series by participant id. I have [1,2,3,4,5,6,8,9,10]. Where 1 here represents all the data I have for participant 1. Thus, my time series 1. We can consider each series independent from each other. So I can do something like:

Leave one out

1 Train: [2,3,4,5,6,7,8,9,10] Test: 1
2 Train: [1,3,4,5,6,7,8,9] Test: 2
3 Train: [1,2,3,4,5,6,7,9,10] Test: [3]
4 Train: [1,2,3,4,5,6,7,8,9,10] Test: [4]
5 Train: [1,2,3,4,6,7,8,9,10] Test: [5]
6 Train: [1,2,3,4,5,7,8,9,10] Test: [6]
7 Train: [1,2,3,4,5,6,8,9,10] Test: [7]
8 Train: [1,2,3,4,5,6,7,9,10] Test: [8]
9 Train: [1,2,3,4,5,6,7,8,10] Test: [9]
10 Train: [1,2,3,4,5,6,7,8,9] Test: [10]

2 - fold validation

I really have confused myself. I was thinking about this approach, but I was told by a colleague that I had it all wrong because I was doing a "within time series approach" and I needed to do a "across time series" approach.

I also checked out this which I think is again for the "within" time series approach because you are taking 1 time series and dividing it in m parts. I have 10 independent time series that supposedly observe the same/similar effect and are independent from each other. I am trying to detect

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

1

The biggest concern with this kind of thing would be having a data point from participant a at time x in the test set, and another point from participant a at time y in the training set, where y > x. In other words, predicting the past based on the future. In this case, even just having participant a's data split between the train and test set could be problematic, since the model might overfit to some peculiarities of that participant.

Your leave-one-out scheme avoids both these problems though.

I'm assuming that your goal is to predict the presence of fatigue at a particular instant in time, without any context about what came before, in which case what you've described is great.

If the problem you're trying to solve involves seeing a sequence of observations and identifying the onset of fatigue, then the "forward chaining" procedure described in this answer is appropriate. You would still want to do the leave-one-out thing you described. But when evaluating performance on the last participant, you would feed your predictor each data point in order, and record its ability to predict the next one, given what it's seen so far.

edited Apr 13 '17 at 12:44

Community

1

answered Sep 29 '16 at 19:48

Coquelicot

126
3

How about k-fold cross validation as well? Consider could have train: [1,2,3,4,5,6,7,8] and test: [9, 10]. Then train: [1,2,3,4,5,6,9,10] test: [7,8] and so on. – hlyates Sep 30 '16 at 00:21
I think the forward chaining procedure is interesting, but I don't know what it buys me over k-fold like I sorta outlined above? Excited to hear your reply. I will be sure to mark it as the answer once we hash out the last few points. Thanks! – hlyates Sep 30 '16 at 00:22
There's very little difference between k-fold and leave-one-out in this case. The k-fold scheme you describe just results in fewer evaluations at the cost of having slightly less data to train on each time. Forward-chaining is only useful if the scenario you're trying to model involves prediction over sequences, rather than one-shot predictions. It's something to be used *with* leave-one-out/k-fold, not instead of. – Coquelicot Oct 03 '16 at 17:18

How to implement cross-validation on my time series (leave one out and k-fold)?

Problem

Details

Attempts

1 Answers1