Selecting the best time lagged moving average for time series analysis

Question

I am studying the effect of weather on agricultural outputs. I have yield data from one farm over 5 years and a number of weather inputs (rainfall, temperature, soil moisture, etc.) for the entire period. Both yield and inputs are available at the daily level. There are clear seasonal trends that I can account for pretty well using factor variables for months and days of the week.

As I plan to use this model to be applied to new farms and without the past data of those new farms, I am loathe to use a traditional time series ARIMA model. I will have the past data of the inputs for the farms but not the past data of the output upon which I can base an ARIMA model.

I am operating under the assumption that some function of past weather inputs is predictive of current yields. For example, the correlation between yield and the rolling average of the last n days of rainfall increases with n for about 2 months before it starts to decline.

I am testing a number of different machine learning algorithms, namely standard OLS, Random Forest, and the Elastic Net, to make predictions. My main question is, what is the best way to determine the appropriate lag for the input variables? Currently, I am using the lagged moving average for each feature that has the highest correlation with output.

Would you use a different structure for the random forest since it doesn't have built-in linearity assumptions?

Richard Hardy · Answer 1 · 2017-10-25T08:16:19.627

1

This is only a partial answer, plus a few comments.

There are clear seasonal trends that I can account for pretty well using factor variables for months and days of the week.

Why account for days of the week? Is there a reason why weather should follow a weekly pattern? On the other hand, the agricultural production might have a weekly pattern if the farmer likes to take a day off on Sunday.

Also, why use monthly seasonality? You would not expect abrupt changes when calendar month changes, would you? If you want to account for the effect of the moon's cycle (which may be relevant for agriculture), that would require using 29.5-day seasonality.

Thus when it comes to seasonality, I would consider using Fourier terms (perhaps one set for the solar and another set for the lunar calendar) as a compromise between smoothness and flexibility.

When using elastic net regression, you would rather include too many rather than too few lags. The less relevant lags would be penalized towards zero.

edited Oct 25 '17 at 08:16

answered Nov 28 '15 at 17:47

Richard Hardy

54,375
10
95
219

Thanks Richard and thank you for the edit. I failed to mention, as you guessed, that there is a distinct decline on the weekends so I account for that with a daily factor variable. I haven't done much with time series before so I haven't used Fourier terms before. Having done a little quick research, they sound perfect. If possible, could you point me towards a good introduction to fourier terms in R? Thanks so much. – TSW Nov 30 '15 at 00:33
There should be some good references in Rob J Hyndman's blog. See [here](http://robjhyndman.com/hyndsight/?s=fourier) and then check out posts titled "Seasonal periods", "TBATS with regressors", "Forecasting weekly data", "Fitting models to short time series", "Forecasting with daily data", and "Forecasting with long seasonal periods". Also, accepting my answer is perhaps not a very good idea; it indicates your problem has been solved (more or less completely), while my answer is incomplete and does not answer all of your questions. Upvoting could make sense to show it is useful, though. – Richard Hardy Nov 30 '15 at 07:27

Selecting the best time lagged moving average for time series analysis

1 Answers1

Linked