How to determine appropriate lagged features for learning systems with states?

Question

In much of machine learning literature, the systems being modelled are instantaneous. Inputs -> outputs, with no notion of impact from past values.

In some systems, inputs from previous time-steps are relevant, e.g. because the system has internal states/storage. For example, in a hydrological model, you have inputs (rain, sun, wind), and outputs (streamflow), but you also have surface- and soil-storage at various depths. In a physically-based model, you might model those states as discrete buckets, with inflow, out-flow, evaporation, leakage, etc. all according to physical laws.

If you want to model streamflow in a purely empirical sense, e.g. with a neural network, you could just create an instantaneous model, and you'd get OK first-approximation results (and actually in land surface modelling, you could easily do better than a physically based model...). But you would be missing a lot of relevant information - stream flow in inherently lagged relative to rainfall, for instance.

One way to get around this would be to include lagged variants of input features. e.g. if your data is hourly, then include rain over the last 2 days, rain over the last month. These inputs do improve model results in my experience, but it's basically a matter of experience and trial-and-error as to how you chose the appropriate lags. There are a huge array of possible lagged variables to include (straight lagged data, lagged averages, exponential moving windows, etc.; multiple variables, with interactions, and often with high covariances). I guess theoretically a grid-search for the best model is possible, but this would be prohibitively expensive.

I'm wondering a) if there is a reasonable, cheapish, and relatively objective way to select the best lags to include from the almost infinite choices, or b) if there is a better way of representing storage pools in a purely empirical machine-learning model.

Impressed about the fact that this question didn't draw to much attention. One thing you should definitely check is recurrent neural networks and lstm's. Though it would be interesting if someone more experienced drop his two cents. — xro7, Apr 20 '17 at 09:39
@xro7: Thanks, I hadn't heard to LSTMs before. Very interesting. They look like training hell though... — naught101, Apr 21 '17 at 01:58
I know right. Though its super easy if you use some high level API like tensorflow or keras etc. I will also add a link to the best (in my opinion) resource for intro to recurrent neural nets (lstm's,gru's etc.) http://r2rt.com/written-memories-understanding-deriving-and-extending-the-lstm.html — xro7, Apr 21 '17 at 06:58

score 2 · Answer 1 · answered Apr 26 '17 at 07:13

you could just create an instantaneous model, and you'd get OK first-approximation results

Have you heard of Recurrent Neural Networks aka RNNs? The state parameter holds information about the past and can theoretically have infinite memory (though in practice that's not always the best configuration). Check out the link I provided—that chapter has some very helpful images for understanding how RNNs model sequences by "rolling up" past information.

For an example application, check out this video of two computers trained on sequences of speech data (clearly they have memory of past words and aren't just instantaneously generating a word at a time).

score 1 · Answer 2 · answered Apr 25 '17 at 15:21

1

I tend to fall back on a pretty standard answer for this: what does your theory suggest? Should rain from a month ago actually impact streamflow? A year ago? A decade? 10 minutes? Surely there is some soil research (it's a thing, I have a family member that studied it!) that gives some hints about this.

If you're simply in need of the best possible prediction and you are not interested in why, then theory can at least give you a starting point for some test-and-trial (guess-and-pray) steps.

Other than that, time series analysis has some basic steps for choosing lag structures to use in, say, a VAR model. This answer might be a good place to start.

answered Apr 25 '17 at 15:21

Savage Henry

131
6

I think theoretically, the answer to all of the above is "yes", dependent on the water capacity of the soil, distance from the stream, etc., all of which are heterogeneous factors... But you can't include all possible lags, obviously, and some lags are highly correlated, and not adding much information. So far I have basically been trying the lags over a number of periods (e.g. 1,2,3,6,12h, 1,2,4,7,10,30,90,80 days), and seeing which ones appear to provide the largest improvement in predictive power. – naught101 Apr 26 '17 at 03:43
Have you tried looking at a correlogram/partial correlogram? That should give you a sense of how many lags a particular variable might need in order to at least drive out autocorrelation. From there you could decide if a longer lag structure has some sort of theoretical heft. – Savage Henry Apr 26 '17 at 15:43

Hooman · Accepted Answer · 2017-04-26T11:40:42.747

If we want to to look at lags over a long time in the past (or features derived from them like exponential moving averages or interactions between them) then there would be a large number of feature candidates. As you correctly mentioned a grid search would be expensive, even if you want to train a simple linear regression model. One approach that can be very helpful in this case would be to use a fast sub optimal feature selection method. For example you can use Greedy backward subset selection, Greedy backward/forward, or Lasso feature selection. This can be much faster and you can potentially look at much larger number of features. Based on my personal experience and also according to this paper if features are high correlated greedy backward/forward is outperforming Lasso: http://papers.nips.cc/paper/3586-adaptive-forward-backward-greedy-algorithm-for-sparse-learning-with-linear-models.pdf

Another intuition that can be helpful in many time series is that as we look more into the past the exact time becomes less important. For example in your example, the impact of rain fall on the flow of water, the rain fall on today and yesterday will probably have different coefficients in your model. But the rain fall on 365 days ago and on 366 days ago will probably have the same impact on the flow today. This facilitates application of transforms / feature engineering techniques that aggregates the data based on time. For example you can have a grid of exponential moving averages (AR(1) systems or IIR(1) filters in signal processing terms) as new features to model long term memory, followed by a linear combination of lagged data (FIR filters) to model short term memory. Note that you don't have to include all the generated features in your model and it is a good idea to perform feature selection to select a few of AR(1) systems and lags. I used a scheme similar to what I described above to extract features from multiple time series as shown in the following diagram.

Another techniques that is commonly used is to perform an unsupervised feature extraction method on the time series data. For example you can use PCA and keep only the most dominant principal components, or you can use discrete (Cosine) Fourier transform and only keep the strongest components. There are other transforms like Haar wavelet that can be useful in certain domains to extract features from time series.

How to determine appropriate lagged features for learning systems with states?

3 Answers3