0

I run a simple AR(1) model in my analysis using ols:

ar.ols(df$y, order.max = 1))

However, I work with generations as my unit of analysis. Therefore, the first lag of y would be the observation of y at time t-30. How can I specify this in the AR(1) model in R?

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
R-User
  • 39
  • 4

1 Answers1

0

If $y_t$ and $y_{t-1}$ are actually 30 observations apart, for AR(1) you can do the following:

lm( tail(df$y,-30) ~ head(df$y,-30) )

This assumes the first observation is the oldest. If your variable has the first observation being the newest, switch head with tail. This would also imply overlapping observations.

For AR(2) you would do

lm( tail(df$y,-60) ~ tail(head(df$y,-30),-30) + head(df$y,-60) )

If you wish to trade off the added estimation efficiency due to overlapping observations for computational efficiency, you may use every 30th data point as follows:

n=length(df$y)
m=floor(n/30)
index=seq(from=n,to=(n-m*30),by=-30)
g=df$y[index] # g contains every 30th observation of y dropping the oldest few
ar.ols(g, order.max = 1)) # for AR(1)
ar.ols(g, order.max = 2)) # for AR(2)
Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • If I misunderstood your setup, just let me know. Will update. – Richard Hardy Oct 21 '19 at 10:21
  • Thank you for your advice. Unfortunately, this does not work, because my dataset has about 100'000 observations and I therefore cannot regress the last 30 on the first 30 observations of y. I would like to set up an AR(I) and in a second step also an AR(II) process by comparing the correlation between an indicator in one generation and the one in the previous generation(s), whereas the generation length is assumed to be 30 years. – R-User Oct 21 '19 at 12:07
  • @R-User, note that my code does not regress the last 30 on the first 30. There is a minus sign in front. `tail(df$y,-30)` drops the first 30, while `head(df$y,-30)` drops the last 30 observations. I still do not understand the structure of your data: do your have yearly observations but are interested in generations (30 years)? Would looking at every 30th data point be what you are interested in? If so, my proposed code also works and will be slightly more efficient as it utilizes overlapping observations rather than just deleting 29 out of every 30 observations. But you could do that, too. – Richard Hardy Oct 21 '19 at 12:53
  • Thanks for the explanation. My datastructure is as follows: I do have yearly data consisting of moving averages over 30 years (e.g. the y of 1915 contains the average y for the generation 1900-1930, etc.). Now the aim is e.g. to regress the y of 1915 on the one of 1885 (as the second one is the average for the generation 1870-1900) in order to find the correlation between the two generations. – R-User Oct 21 '19 at 13:13
  • @R-User, then I think my code is just what you need. Alternatively, if you want to avoid overlapping observations and trade off a little bit of precision for computational efficiency, I will include code for that. – Richard Hardy Oct 21 '19 at 13:17
  • Thanks, I will then go with the overlapping observations.Two last related questions: 1) How can I include the second lag? 2) How can I achieve that the AR I process is calculated between every two generations? – R-User Oct 21 '19 at 13:43
  • @R-User, note that in case of overlapping observations, your standard errors will need to be adjusted as mentioned in [this post](https://stats.stackexchange.com/questions/432173/) (see $\text{LRVar}$). The point estimates do not need to be adjusted. See also update of my answer. – Richard Hardy Oct 21 '19 at 13:45
  • I discovered that my approach using moving averages had some drawbacks as it creates a time dependency in the data even if there wouldn't be one. I therefore suggest to use sharp generation calculations for this AR(1) calculation. In my opinion, there should be no problem if I reduce the dataset in a way that I only have the data point 1915 (as the average of the generation 1900-1930) and the data point 1945 (as the average for the generation 1930-1960) and so on. Is that correct? Can I then calculate an AR(1) process for group means (or generation means) with this reduced dataset? – R-User Oct 29 '19 at 11:58
  • @R-User, could you please post this on a separate thread, probably including a link to this thread? Such is the accepted practice on Cross Validated. (This way your question will have better visibility and you will have the opportunity to earn more votes from users who find the question interesting.) – Richard Hardy Oct 29 '19 at 12:07
  • I posted it on a separate thread and included the link to this thread: https://stats.stackexchange.com/q/433624/260678?sem=2 – R-User Oct 29 '19 at 12:27
  • @R-User, great, thank you! – Richard Hardy Oct 29 '19 at 12:29