1

I am trying to implement a leave one out cross-validation for my time series LSTM model, but I am not sure how to go about it considering my dataset.

My dataset consists of flight IDs (1-279) which have different routes labelled R1 - R5. Flight data of each flight ID is recorded sequentially, with each new flight ID being a new flight. There's a table below to understand what I mean easier hopefully.

flight time ... route
1 0 ... R1
1 0.2 ... R1
1 ... ... R1
1 100 ... R1
2 0 ... R5
2 0.2 ... R5
2 ... ... R5
2 120 ... R5

Different flight numbers use the same routes, so for example flights 8,10,12, etc all use R5.

What would be the best way to implement LOOCV? Would it be to run the LSTM for all flights and leave out each flight number, or should the flights be grouped together using the routes they take?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 1
    How many samples do you have in your data? Because if a lot then LOOCV would take a long time, and if very little, then using a neural network like LSTM is not a great idea. – Tim Jul 16 '21 at 14:49
  • My dataset is large (255,000) hence the LSTM. I did read that using LOOCV would take a long time, that's why I was wondering if it would be more intuitive to separate the data into their flight paths R1 - R5 – Carlos Muli Jul 16 '21 at 16:32
  • You could probably split it by either flight ID, time, or route, depending. Depending on the problem, some of those may make more or less sense. To get a better answer you'd probably need to tell us more about your data and the problem you're trying to solve. – Tim Jul 16 '21 at 17:56
  • Sure thing. The data is trying to predict the energy consumption of a flight ID. I am trying to find the accuracy of my model in terms of the route (for now) but I am not sure if I should use the entire dataset or limit it to just the flight IDs with the same routes. The other thing is that I am trying to find the general accuracy considering the entire model, which would mean that I need to predict every flight ID against the entire dataset which would take a long time like you said. – Carlos Muli Jul 16 '21 at 19:47
  • See https://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection and https://stats.stackexchange.com/questions/403574/cross-validation-for-time-series-classification – kjetil b halvorsen Jul 17 '21 at 15:10
  • Thanks @kjetilbhalvorsen, I appreciate it! – Carlos Muli Jul 21 '21 at 15:00

0 Answers0