3

I have some trouble understanding the forecasting/inference process of ARMA models.

From Hamilton (which I am reading now), we can obtain forecasts at $Y$ from any linear process with r.v. values $X$ using the expectations as follows:

$\mathbf{a} = E(XX^T)^{-1}E(XY^T)$

This assumes that the forecast is the optimal linear forecast using projections. The matrix $E(XX^T)$ is the matrix of autocovariances (since ARMA is stationary), and $E(XY^T)$ is also a matrix of autocovariances.

1) We can establish all autocovariances from the sample (time series) and forecast the future values using the above formula by simply substituting estimates for second moments, inverting matrices, and we are done. Why do we need to find the ARMA coefficients to forecast which is done in all packages like stat::arima?

2) I still cannot figure the whole process - stat::arima uses the Kalman filter to compute the fundamental innovations, and somehow compute the likelihood (cannot decipher Hamilton chapter 13 yet since have not yet read the prerequisite chapters). From stat::arima:

The exact likelihood is computed via a state-space representation of the ARIMA process, and the innovations and their variance found by a Kalman filter.

So, these innovations + observations should be enough to forecast the future values. Why do we even need likelihood at all? From Gardner et al. https://www.jstor.org/stable/2346910:

The prediction and updating are carried out by means of a set of recursive equations known as the "Kalman filter" The parameter $\sigma^2$ does not appear in the recursion

3) When Kalman filter finds the innovations - it finds also the variance of those and computes the maximum likelihood value. BUT HOW ARE PARAMETERS of ARMA for maximum likelihood found? Kalman can compute the maximum likelihood values, but how are the maximum likelihood parameters $\theta$ and $\phi$ selected? As far as I know, to find these parameters, some gradient descent has to be used, or Newton method. From Gardner again:

The log-likelihood function may then be maximized with respect to ($\theta, \phi$) by minimizing:

$\mathcal{L}(\theta, \phi) = n \log S(\theta, \phi) + \sum_{i = 1}^n \log f_i, \quad \quad f_i \propto \hat{MSE}$

So, the procedure described in the paper outputs the values of errors. likelihood, etc. But where does it get the optimal $\theta, \phi$?

SWIM S.
  • 1,006
  • 9
  • 17
  • 1
    *Why do we even need likelihood at all?* Innovations and observations are not enough, the parameters of the model ($\phi\,,\theta$) needs to be estimated. ARMA models with the same orders, eg. ARMA(2,2), but with different values of the AR and MA coefficients do not involve the same features (period of cycles, persistency,...). This is where the likelihood function comes into play; it measures the likelihood or plausibility of a set of parameters values given the observed data. So the goal is to chose those parameter values that maximise the likelihood function. – javlacalle Jan 04 '19 at 20:32
  • 1
    *But how are parameters of ARMA for maximum likelihood found?* The likelihood function is maximised with respect to the parameters of the model (AR and MA coefficients). To do so, an optimisation algorithm and the Kalman filter can be combined as follows: The Kalman filter is used to compute the value of the likelihood function for a given set of parameters values; it is passed as argument to an optimisation algorithm, eg. Newton method. As a result, a set of parameter values that maximises the likelihood function are obtained and used to compute forecasts upon the given model. – javlacalle Jan 04 '19 at 20:36
  • 1) I mean second moments that can be estimated from the data directly and these parameters are needed to forecast. Hamilton (chapter 4) computes them from params $\theta, \phi$ of arma, but why not estimate from data? 2) Thank you. That is what bugged me about it - and that is what I wanted to verify - seems like internally it uses Nelder-Mead optimization algo. It is just written so clumsily in a different section of the package so not clear at all. So, I guess then it really uses Kalman filter for innovations, and then calls the Nelder-Mead to choose the next set of parameters to try. – SWIM S. Jan 04 '19 at 20:55
  • So, it iterates KF and NM for a long time, and then forecasts on the last iteration of change Nelder-Mead -> Kalman Filter -> final predictions. This is what i got so far. – SWIM S. Jan 04 '19 at 20:58
  • 1
    *But why not estimate [second moments] from data?* Calculating the sample autocovariances from the data is a natural alternative. In fact, the theory of ARMA models establishes a mapping between the theoretical autocovariances and the parameters of the model. This approach is known as the **method of moments**. For AR models it amounts to the [Yule-Walker system of equations](https://en.wikipedia.org/wiki/Autoregressive_model#Yule%E2%80%93Walker_equations). When MA is present, the system of equations is less simple to solve. – javlacalle Jan 04 '19 at 22:41
  • 1
    Maximum likelihood has in general better properties than the method of moments, [related post](https://stats.stackexchange.com/questions/252936/). The method of moments is nonetheless a helpful procedure to get initial parameter estimates from which to start the search of ML estimates. – javlacalle Jan 04 '19 at 22:42
  • 1
    Once the algorithm that combines the Kalman filter and the Newton method converges to a set of parameter values, the data can be extrapolated by means of the Kalman filter (maybe this is what you see related to your last comment). In practice, it may be better to use the Kalman smoother to obtain forecasts (employs information on the whole data set). – javlacalle Jan 04 '19 at 22:44
  • I do not think it is a method of moments I mean, but related. Hamilton suggests: best linear forecast is when $E[(Y_t - aX)X^T] = 0$, where $a$ can be found as I wrote in the question. Then we can take the matrix $\Omega = E(XX^T)$ and LU-decompose it into $ADA'$ from where we can get perpendiculars to projections, and then projections (forecasts) themselves. Thus this requires only estimation of matrix $\Omega$. Hamilton gives example (p. 95) where he calculates $\Omega$ from params $\theta$ of MA(1), but i dont get why, when he can calculate $E(XX^T)$ and $E[YX^T]$ with sample covariances – SWIM S. Jan 05 '19 at 12:32
  • Thus, we can forecast even not knowing, if it was MA or AR, or ARMA, but our forecast will optimally correspond to the underlying model. We don't even need to know the number of terms in ARMA - we just need to make a sufficiently large matrix $\Omega$. Im just guessing here according to my words in above comment ^ Is it also a method of moments? I got from your explanation that MM estimates the params, but i wonder if forecasting without any params is actually possible there ^ – SWIM S. Jan 05 '19 at 12:36
  • I haven't seen this approach to obtain forecasts before, so I cannot say much about this part of your question. Just some thoughts: 1) what you say about forecasting without parameters seems interesting in this approach, but isn't it $\mathbf{a}$ a set of parameters anyway? 2) If I am correct, $Y$ is the observed time series and $X$ are explanatory variables (e.g., lags of $Y$ and lags of the disturbance term $\varepsilon$). If this is correct, it may not be straightforward to calculate $E(XX^T)$ because $\varepsilon_t$ are not observed (unlike the lags of $Y$). – javlacalle Jan 05 '19 at 14:07
  • Yep, $\mathbf{a}$ is parameters. The most interesting part - X is also observed $Y$s. Forecast does not use errors - if we want a forecast of Y, we find covariances between lagged Ys (that are called $X$), and forecasted $Y$, and form matrix $\Omega$. It is possible because even though we do not know Y, we know sample covariances (on assumption about stationarity), so can compute the forecast. I am confused because Hamilton does not use sample covariances in his example. But instead, he computes params of ARMA first, then forms the cov. matrix from these params, and only then uses the formula. – SWIM S. Jan 05 '19 at 15:07

0 Answers0