Forecasting in a state-space model from a Bayesian perspective

Question

We have the following state-space model(or linear dynamical model):

\begin{align} x_t&\sim N(Ax_{t-1},Q)\\ y_t&\sim N(Bx_{t},\Sigma) \end{align}

I want to obtain a sample from $p(y_{T+1}\mid y_{1:T})$.

$$p(y_{T+1}\mid y_{1:T})=\int p(y_{T+1}\mid y_{1:T},\theta,x_{T+1},x_{1:T})\cdot p(\theta,x_{T+1},x_{1:T}\mid y_{1:T}) \ dx_{T+1} \ dx_{1:T} \ d\theta$$

I think we have $$p(\theta,x_{T+1},x_{1:T}\mid y_{1:T})=p(x_{T+1}\mid x_{1:T}, \theta, y_{1:T})\cdot p(x_{1:T}, \theta \mid y_{1:T})$$ and $$p(y_{T+1}\mid y_{1:T},\theta,x_{T+1},x_{1:T})=N(y_{T+1}\mid Bx_{T+1},\Sigma)$$

Let us assume that we can draw from $p(x_{1:T}, \theta \mid y_{1:T})$.

I think we can draw from $p(x_{T+1}\mid x_{1:T}, \theta, y_{1:T})$ by $$x_{T+1}\sim N(A[i]x_{T}[i],Q[i])$$ with the quantities indexed by $i$ drawn from $p(x_{1:T}, \theta \mid y_{1:T})$.

So, I'm thinking of doing

\begin{align} p(y_{T+1}\mid y_{1:T})&\approx \frac{1}{N}\sum_{i=1}^N N(y_{T+1}\mid B[i]x_{T+1}[i],\Sigma[i])\\ \end{align}

and sample from this mixture of normals.

Is this correct?

Taylor · Accepted Answer · 2019-07-24T21:14:32.030

Yes, if you're drawing one $x_{T+1}$ for every sample joint posterior sample $x_{1:T},\theta$.

To be more clear, you are correct if you do the following steps for $i=1,2,\ldots$:

-1. Draw $\theta[i], x_{1:T}[i] \sim p(\theta, x_{1:T} \mid y_{1:T})$ (probably with a Gibbs sampler), then

-2. Draw $x_{T+1}[i] \sim N(A[i]x[i]_{T},Q[i])$ (from the model's Markov transition)

-3. (a) Be ready to evaluate the function (in $y_{T+1}$) $\frac{1}{N}\sum_{i=1}^N N(y_{T+1}\mid B[i]x_{T+1}[i],\Sigma[i])\\$ OR

-3. (b) Draw $y_{T+1} \sim N(Bx_{T+1},\Sigma)$.

Using 3b instead of 3a, ignoring everything except the collection of $y_{T+1}$s in your samples, is giving you a sample from the prediction distribution you want. You can make a histogram, take an average of them, whatever you want. If this is your goal, read below for how to do this in a better way by using Rao-Blackwellization/marginalization.

However, the last equation in your question appears to be going with 3a instead of 3b. You aren't sampling future $y_{t+1}$s; rather, you are picking a specific $y_{T+1}$ and estimating the height of the density with Monte Carlo.

If you want to approximate moments of this distribution using this same Monte Carlo approach, you invoke the law of total expectation or law of total variance, and use Monte Carlo on each raw/uncentered moment. Right now you're just approximating the density pointwise, albeit with common random numbers.

Also notice that this is difficult in real time, because obtaining samples from the joint posterior in step (1) is slow, and gets slower with time.

Edit:

Seems like I should also mention Rao-Blackwellization/marginalization here, too. Say you were interested in the mean of $p(y_{T+1} \mid y_{1:T})$, and not points on the density. Then you could use the fact that $$ E[y_{T+1} \mid y_{1:T}] = E\left[ E\left( y_{T+1} \mid X_T,\theta \right) \mid y_{1:T} \right]. $$ This would go a long way to reducing variance, which translates into faster computation because you wouldn't need to sample for as many iterations.

In your particular case, the inner expectation is $$ E\left( y_{T+1} \mid X_T,\theta \right) = E\left( B x_{T+1} \mid X_T, \theta \right) = B A X_T. $$ This is useful when you plug this back into the iterated expectation. You don't have to sample as many states in time, and that means you have less variance: $$ E[y_{T+1} \mid y_{1:T}] = E\left[ B A X_T \mid y_{1:T} \right] \approx \frac{1}{N}\sum_{i=1}^N B[i] A[i] X_T[i], $$ where $B[i] A[i] X_T[i] \sim p(\theta, X_{T} \mid y_{1:T})$. These come from sampling the full joint posterior $p(\theta, X_{1:T} \mid y_{1:T})$ and then ignoring the $x_{1:T-1}$ portion of each of your $N$ samples.

Taylor, what would happen if I just chose $x_{T+1}=\frac{1}{N}\sum_{i=1}^N A[i]x_T[i]$? — An old man in the sea., Jul 24 '19 at 19:10
By the way, many thanks for the present information. I'll accept your answer in 2 days, if nothing better shows up (with a +1 ). ;) — An old man in the sea., Jul 24 '19 at 19:22
Why should I marginalize out $x_{T+1}$ and not just define it as the average? — An old man in the sea., Jul 24 '19 at 19:23
@Anoldmaninthesea. thanks I appreciate that. Not sure I follow the other parts of your comment 100%, but I just threw in an edit. Does it help? — Taylor, Jul 24 '19 at 19:53
Ok. I think I get it now. I'm going to think on what you wrote. thanks ;) — An old man in the sea., Jul 24 '19 at 19:58
By the way Taylor, do you have some bibliography/references for this ? thanks ;) — An old man in the sea., Jul 28 '19 at 10:03
@Anoldmaninthesea. the general idea of Rao-Blackwellization is probably in most textbooks, especially ones that deal a lot with simulation. I also had an early chapter of my dissertation that sort of outlined this general roadmap, and that's available on my website, but it was mostly motivated by particle filtering. If you have linear-Gaussian state space models, you don't really need to sample the future observations, because that distribution is closed-form. — Taylor, Jul 28 '19 at 18:30
Taylor, many thanks. In reality I'm dealing with a non-linear Gaussian SSM. I've just simplified my problem here to have a higher chance of getting an answer. — An old man in the sea., Jul 28 '19 at 19:02

jcz · Answer 2 · 2019-07-24T23:17:47.237

Everything you said is correct. All I would add is that the mixture you introduce at the end is perhaps unnecessary for the purposes of sampling from $p(y_{T+1}\,|\,y_{1:T})$. But as the other answer pointed out, it's a useful perspective for getting good estimates of the predictive mean and variance.

You have a joint distribution

$$p(y_{T+1},\,x_{T + 1},\,x_{1:T},\,\theta\,|\,y_{1:T})$$

which you can factor as

\begin{align} x_{1:T},\,\theta &\sim p(x_{1:T},\,\theta\,|\,y_{1:T}) &&(1)\\ x_{T+1}\,|\,x_{1:T},\,\theta &\sim N(Ax_T,\,Q)&&(2)\\ y_{T + 1}\,|\,x_{T + 1},\,x_{1:T},\,\theta &\sim N(Bx_{T+1},\,\Sigma).&&(3)\\ \end{align} You want the $y_{T+1}$ marginal of this joint. As with any joint distribution, to generate iid draws from the marginal, all you have to do is simulate the entire joint distribution and retain the variate that you care about. In this case, it means drawing $x_{1:T}$ and $\theta$, conditionally drawing an $x_{T + 1}$, conditionally drawing a $y_{T+1}$, and then retaining the $y_{T+1}$ and "ignoring" the $\theta,\, x_{1:T},\,x_{T+1}$. Do this as many times as you want, and the retained subsample of $y_{T+1}$ draws is a sample from the marginal.

That is to say, you don't have to first generate many draws from $p(x_{T+1},\,x_{1:T},\,\theta\,|\,y_{1:T})$, then write down a mixture and sample from the mixture (which presumably involves simulating an auxiliary indicator for the mixture components) to get draws from $p(y_{T+1}\,|\,y_{1:T})$. You can just cycle through (1) - (3) and naively retain the subsample of $y_{T+1}$ draws.

Hi Zito, thanks for the answer. ;) – An old man in the sea. Jul 24 '19 at 19:55 — An old man in the sea., Jul 24 '19 at 19:55

Forecasting in a state-space model from a Bayesian perspective

2 Answers2

Edit: