I am studying the effect of weather on agricultural outputs. I have yield data from one farm over 5 years and a number of weather inputs (rainfall, temperature, soil moisture, etc.) for the entire period. Both yield and inputs are available at the daily level. There are clear seasonal trends that I can account for pretty well using factor variables for months and days of the week.
As I plan to use this model to be applied to new farms and without the past data of those new farms, I am loathe to use a traditional time series ARIMA model. I will have the past data of the inputs for the farms but not the past data of the output upon which I can base an ARIMA model.
I am operating under the assumption that some function of past weather inputs is predictive of current yields. For example, the correlation between yield and the rolling average of the last n days of rainfall increases with n for about 2 months before it starts to decline.
I am testing a number of different machine learning algorithms, namely standard OLS, Random Forest, and the Elastic Net, to make predictions. My main question is, what is the best way to determine the appropriate lag for the input variables? Currently, I am using the lagged moving average for each feature that has the highest correlation with output.
Would you use a different structure for the random forest since it doesn't have built-in linearity assumptions?