When initializing an ARMA simulation, say ARMA(2,2), the first data point, $X_1$ can be created by drawing from an unconditional distribution $f(\mu,\gamma(0))$ where $\mu=\mathbb{E}[X]$ and $\gamma(0) =\mathbb{V}[X]$. However, the subsequent value of $X_2$ depends on this first value $X_1$, so we can't use the same unconditional distribution.
So, I am trying to derive the conditional mean and variance for the first $k<p$ points, but seem to be going round in circles. I begun with the one-step conditional mean, i.e.
$$\mathbb{E}[X_t|X_{t-1}=x_{t-1}] = \phi_1 x_{t-1} + \phi_2 \mathbb{E}[X_{t-2}|X_{t-1}] + \mathbb{E}[\epsilon_t|X_{t-1}] +\theta_1\mathbb{E}[\epsilon_{t-1}|X_{t-1}]$$
I think we can safely say that $\mathbb{E}[\epsilon_t|X_{t-1}] = 0$, i.e. future innovations are independence of the past observations. However, I don't think we can say the same about the second and last terms. So, I tried to get a conditional expectation of the contemporaneous error given the observation by re-arranging the ARMA equation for $X_{t-1}$, i.e.
$$\mathbb{E}[\epsilon_{t-1}|X_{t-1}] = \mathbb{E}[x_{t-1} - \phi_1 X_{t-2} - \phi_2 X_{t-3} - \theta_1 \epsilon_{t-2}|X_{t-1}] = X_t - \phi_1\mathbb{E}[X_{t-2}|X_{t-1}] - \phi_2\mathbb{E}[ X_{t-3}|X_{t-1}] - \theta_1\mathbb{E}[\epsilon_{t-2}|X_{t-1}]$$ but this just seems to introduce more terms going back in time.
I guess one way of starting the simulation off is to generate the first $p$ values from a joint unconditional multivariate distribution, with the same mean $\mu$ for all $p$ and with the covariance matrix $(\mathbf{\Gamma})_{ij} = \gamma(|i-j|)$, but it seems to me that once we have one of the initial values generated from a unconditional uni-variate distribution, the multivariate for the other $p-1$ changes, but I can't figure out how.
Also, when estimating model parameters from data using a full MLE, what distribution is used for the first $p$ observation? Is it the multivariate unconditional? Or a uni-variate unconditional for the first observation and then the "partial-conditionals" for the rest? The latter is/are the one I am looking for, I guess. Using the unconditional multivariate seem to me to loose some information, but maybe this is not so.
Add. 1
Maybe we just use unconditional values for mean and variance for the "unobserved values" prior to $t<1$, e.g.
$$\mathbb{E}[X_0|X_1=x_1] = \mathbb{E}[X_0] = \mu$$
and
$$\mathbb{V}[X_0|X_1=x_1] = \mathbb{V}[X_0]= \gamma(0),$$
and then we for $X_2$
$$ \mathbb{E}[X_2|X_1=x_1] = \phi_1 x_1 + \mu\sum_{i=2}^{p}\phi_i $$
and
$$ \mathbb{V}[X_2|X_1=x_1] = \mathbb{V}[\phi_1 x_1 + \sum_{i=2}^{p} \phi_{i} x_{2-i} + \epsilon_{2} + \sum_{i=1}^{q} \theta_{i} \epsilon_{2-i}] \\ = \sum_{i=2}^{p} {\sum_{j=2}^{p} \phi_{i} \phi_{j} \gamma(|i-j|)} + \sigma^2 + \sigma^2 \sum_{i=1}^{q} \theta_i^2 + \sum_{i=2}^{p}{\sum_{j=2}^{i-1} \phi_{i} \theta_{j} \delta_{j-i}} $$
where $\delta_{j-i} = \mathbb{E}[\epsilon_{t-j} X_{t-i}] = \mathbb{Cov}[\epsilon_{t-j}, X_{t-i}]$ and which is equal to 0 for $j > i$.
This seems to give a correct answer for when all $p$ observations are available, but it still doesn't sit right to me. For example, if we have a process which is very persistent and $x_1$ is far away from the long-term mean $mu$, then the expression for the conditional expectation for $X_2$ above seem to give far too much importance to the long-term mean and for persistent process we would expect the neighboring observations to close to each other (assuming that the innovations variance is comparatively small).
Add 2
If we work with the simplest AR(1) process, then $$ X_{t-1} = \frac{1}{\phi_1}(X_t - \epsilon_t) $$ and, assuming that $\epsilon_t \sim \mathcal{N}(0,\sigma^2)$, $$ X_{t-1} | X_t = x_t \ \sim \ \mathcal{N} \left(\frac{1}{\phi_1} x_t, \frac{\sigma^2}{\phi_1^2} \right) $$ But we also have that for bi-variate normal $$ X_1\mid X_2=a \ \sim \ \mathcal{N}\left(\mu_1+\frac{\sigma_1}{\sigma_2}\rho( a - \mu_2),\, (1-\rho^2)\sigma_1^2\right) $$ which for our AR(1) with $\mu_1 = \mu_2 = 0$ and $\sigma_1^2 = \sigma_2^2 = \sigma^2/(1 - \rho^2)$ and $\rho = \phi_1$ means that $$ X_{t-1} | X_t = x_t \ \sim \ \mathcal{N}\left(\phi_1 x_t, \sigma^2 \right) $$ which doesn't agree with our first results above, but would be a correct distribution for $X_{t} | X_{t-1}$. What is going on here?