Engineering lag features for the test set in time-series machine learning

Question

I am trying to do time series forecasting through machine learning. I want to engineer lag features, but was wondering what would be the best way to go about generating these features for the test set (or validation folds). Now, I'm sure that I can't just use test data to engineer these features - should I instead be using the predictions, and generating the lags iteratively?

For example, if I'm only using a 1-period lag, I would use the last entry in my training set as the lag for the first entry in my test set. I make my prediction, then use that as the 1-period lag for the second value in my test set, and so on. Is this the correct way to use lag features with machine learning? And is there a function in sklearn or another library that automates this process?

score 0 · Answer 1 · answered Jan 02 '21 at 18:26

0

Say that your test set starts at time point $t$ up to time point $N$, and you are using time lags of $1, 2,$ and $3$. In such case, you need data from time points $t, t+1, t+2$ to predict for $t+3$, all the way up to $N-3, N-2, N-1$ to predict for $N$. What follows, you cannot make predictions for the first three time points.

As you can see, to achieve this, you need somehow to split the data on time. Simplest solution is to take points $1,2,\dots,t-1$ to train set and points $t,t+1,\dots,N$ go to test set (assuming it is sorted). Another possibility, is to split the data on time into some number of blocks that are randomly assigned to either train, or test set (e.g. values $1,2,\dots,10$ go to train set $11,12,\dots,20$ to test set, $21,...$ again to train set, etc), you can rotate the blocks for $k$-fold cross-validation. Finally, you could use a variation of one-step ahead cross-validation.

You do not need any specialized functions for this, as this uses just the basic functionalities available in pandas, numpy etc.

answered Jan 02 '21 at 18:26

Tim

108,699
20
212
390

Yes, I've split the data such that my test set begins exactly only period after my train set ends. Continuing with your example, I could just use the last three days of the train set as the lags for the first day in the test set, right? Then I would make a prediction for that day, and for the second day in the test set, I would use the last two days of the train set and the prediction from the first day in the test set as the lags. Is this the correct approach? I see how I would do this manually, but my main concern is the amount of slow Python looping it would require. – user12138762 Jan 02 '21 at 19:10
@user12138762 you *cannot* use the samples from train data, because this would lead to data leak, you would use same data for training and testing the model. – Tim Jan 02 '21 at 19:28
Sorry, I'm still not totally sure I follow. Isn't the point of time series to use the train data to predict future values? The way I'm thinking about it is that if I want to predict sales tomorrow, then of course I can't use data from tomorrow. Instead I use the available data up to now to train a model and make a prediction. So if I can't use that data, then what *can* I use? Should I, say, exempt the last 3 days from the training data so that I can use them as lags in making my prediction? – user12138762 Jan 02 '21 at 20:12
@user12138762 in some cases yes, in some no. Imagine that you use lags 1-3 and take even rows as train and odd rows as test, in such case for every row 2/3 of the data would be shared between train and test, so you would be testing on the data partially "seen" by the model. If you can have full separation, this is a solution that is much safer. – Tim Jan 02 '21 at 20:24

Engineering lag features for the test set in time-series machine learning

1 Answers1