How to perform autoregression analysis with higher-resolution explanatory data

Question

I have two time series of data, shown in the below plot. My response data is an annual total (brown, plotted in the middle of the year totalled), my explanatory data is a monthly summary index (black). Both show clear positive temporal autocorrelation, and there appears to be a clear correlation between the two variables.

In particular we often see lower responses immediately after high index values, but high index values are often preceded by low index. So there are two obvious hypotheses:

Low response values are an immediate effect of high index values
Low response values are a lagged effect of low index values

(There are obviously lots of other interesting questions that could be asked of this data too...)

I'm new to analysis of time series data but my research suggests Auto Regressive models are likely the best way to approach analysing this data. However, as far as I can see these all assume that explanatory and response data are on the same scale.

Are there any modelling approaches that allow for this?

My two ideas so far are to either:

Compute annual summaries of the index data, e.g. annual max, annual min, annual mean
Use 12 different response variables, each for one month of the index values

Neither of these seem like they would be really appropriate, but I'm a bit out of my depth here so would appreciate any knowledgeable advice. Methods that can be implemented in R would be ideal. Thanks!

Edit: Note the data shown above is actually an average of 50 replicates. I also have another dataset that has 70 years of data rather than 20 (but for fewer replicates) - I showed this just because it's easier to illustrate!

score 1 · Accepted Answer · answered Apr 21 '21 at 09:50

You only have 20 data points for your response. I would be very careful both about seeing "clear correlations" with the index (our mind has a way of seeing patterns where none exist), and of using complex methods here. Even a simple AR(1) model with a mean estimates two parameters based on just 20 data points, that is borderline overfitting.

If you do insist on using an explanatory variable, you likely can't do much better than aggregating the index and using it either directly or as a lagged (or leading) indicator. If you test multiple possible models, be aware that you will very likely overfit in the model selection step, especially since if you hold out data to assess predictive power, you have even less data.

In R, take a look at forecast::auto.arima(), which fits a regression with ARIMA errors if you feed a predictor into the xreg parameter.

I would suggest that trying to get higher granularity data for your response would be time better spent than trying to build a more complex model on, again, just 20 data points. See here. Related: Best method for short time-series.

Thanks for your response. I actually oversimplified for the question, but didn't think about the implications. I actually have way more data - in this case the response I'm plotting there is the mean of ~50 different responses for different units we can probably consider to be replicates. I also have another dataset with 5 units going back 70 years. So hopefully lots of power, although not sure how to bring in replicates, it doesn't seem as straightforward as with regression. I guess that's another question... Thank you! Seems like aggregation of the explanatory variable is the way to go. — TJC, Apr 21 '21 at 10:08

How to perform autoregression analysis with higher-resolution explanatory data

1 Answers1