Cross Validation for longitudinal/panel data in scikit-learn

Question

I have some longitudinal/panel data that takes the form below (code for data entry is below the question). Observations of X and y are indexed by time and country (eg USA at time 1, USA at time 2, CAN at time 1).

    time x  y
USA 1    5  10
USA 2    5  12
USA 3    6  13
CAN 1    2  2
CAN 2    2  3
CAN 3    4  5

I'm trying to predict y using sklearn. For a reproducible example, we could use, say, linear regression.

In order to perform CV, I can't use test_train_split because then the split might, for example, put data from time = 3 in X_train, and data from time = 2 into y_test. This would be unhelpful, because at time = 2, when we would be trying to predict y, we would not yet really have data at time = 3 to train on.

I'm trying to use TimeSeriesSplit in order to achieve CV as shown in this image:

(source: Using k-fold cross-validation for time-series model selection)

y = df.y
X = df.drop(['y'], 1)
print(y)
print(X)

from sklearn.model_selection import TimeSeriesSplit

X = X.to_numpy()

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits = 2, max_train_size=3)
print(tscv)
for train_index, test_index in tscv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

which gives close to what I need, but not quite:

TRAIN: [0 1] TEST: [2 3]
TRAIN: [1 2 3] TEST: [4 5]

How can I now use TimeSeriesSplit indices to cross-validate a model?

I believe a complication may be that my data isn't strictly time-series: it's not only indexed by time, but also by country, hence the longitudinal/panel nature of the data.

My desired output is:

A series of test and train indices that allow me to perform "walk forward" CV

eg

TRAIN: [1] TEST: [2]
TRAIN: [1 2] TEST: [3]

An X_train, x_test, y_test, y_train that are split using the index above, based on the value of time, or clarity as to whether I need to do that.
An accuracy score of any model (eg. linear regression) cross-validated using the "walk forward" CV method.

Edit: thank you to @sabacherli for answering the first part of my question, and fixing the errors that were being thrown up.

Code for Data Entry

import numpy as np
import pandas as pd

data = np.array([['country','time','x','y'],
                ['USA',1, 5, 10],
                ['USA',2, 5, 12],
                ['USA',3,6, 13],
                ['CAN',1,2, 2],
                ['CAN',2,2, 3],
                ['CAN',3,4, 5]],                
               )
                
df = pd.DataFrame(data=data[1:,1:],
                  index=data[1:,0],
                  columns=data[0,1:])

df

Instead of using sklearn's split, use your own loop using time, and subset your data at each loop (can be optimised later). — gunes, Aug 20 '20 at 11:03

Cross Validation for longitudinal/panel data in scikit-learn

Code for Data Entry

0 Answers0