I have been thinking about the way of splitting the data into training, validation and test sets for the stateless LSTM. For me, the intuitive way is to arrange the original data into the 3D form (batch_size, time_step, variable_number), then randomly split it into 90%, 5%, and 5%. However, both validation set and test set have a lot of data points overlapping to the ones in the training set even though they belong to different sequences. I guess it is OK for the validation set, but not for the test set because many data points have been used to build the model. If so, should I randomly separate 5% consecutive data points from the original data as the test set before arranging the original data to the 3D form? Note: @Tim♦ The question you demonstrated is more about stateful LSTM and cross-validation. I am asking the questions about the validation for the stateless LSTM.
Asked
Active
Viewed 146 times
0
-
The linked question seems to answer your question. LSTM models dependence in time, so obviously you cannot use validation set randomly because it would leak information from the future. It is bad for both training and test set. For validating time-series models we usually split the data on time dimension, so e.g. we take 80% of the "past" as training set and 20% "future" as test set. If the linked thread does not answer your question, please edit it to make it clear what makes it distinct. – Tim Dec 17 '18 at 13:09
-
Please edit your question to explain what is your model and your data. Otherwise it is unclear what exactly is your setting and suggest appropriate cross-validation split. – Tim Dec 17 '18 at 13:29
-
@Tim♦ As I know, for the stateless LSTM the sequences are independent from each other. This means they can be shuffled and selected randomly. Keras has an argument 'stateful' for LSTM layer. Only when stateful=True the sequences are correlated in some sequential way. – Dong Dec 17 '18 at 15:20
-
So if your sequences are *independent* of each other (are they?), what exactly is the problem? If they are unrelated whatsoever, you can just sample them randomly. – Tim Dec 17 '18 at 15:23
-
@Tim♦ Actually they are not independent of each other. Just because I want to focus on the sequential relation within every sequence rather than the relation among the sequences, I leave stateful=False to assume there is no relation among sequences. In this way, there are a lot of data points appearing in all the data sets (training, validation, testing). This makes the validation and testing sets not good (especially for the testing set), I guess. On the other hand, the same points belong to different sequences in different sets. (to be continued) – Dong Dec 17 '18 at 17:22
-
Since the sequence rather than the isolated data points matters in the LSTM models, there is some possibility that this way is also acceptable. I am just not sure about this. – Dong Dec 17 '18 at 17:26
-
Unless you edit your question and give us more details on what is your data and your model, this is too ambiguous to be answerable. – Tim Dec 17 '18 at 17:46
-
@Tim♦ That's fine. I think I kind of figured out what I should do. The question is a little bit general, but I think I specified every detail needed for the question. People who are familiar with stateless and stateful LSTM should be able to give some answer if you could have labeled it as duplicate later. – Dong Dec 17 '18 at 18:18
-
Stateless or not, this does not say anything about your data. Without knowing more about your data this is unanswerable. Cross-validating time-series is complicated and can be easily messed up leading to overoptimistic scores. – Tim Dec 17 '18 at 18:24