6

For an ergodic Markov Chain $$ \frac{1}{N}\sum_{i=1}^n f(X_i) \rightarrow E_\pi[f] $$ where $\pi$ is the invariant distribution. I am also dealing with a Markovian process (a state space model to be specific) and I have a quantity like the following: $$ \frac{1}{T} \sum_{t=1}^T \log p(x_t \mid x_{t-1},\theta) $$ where the state space model that generated the data is $x(0) \sim p(x_0)$ and the transition model is $x_t \sim p(x_t \mid x_{t-1},\theta)$. Can I apply the ergodic theory in this setting? If so, what would the above sum converge to?

In general, instead of $\frac{1}{T}\sum_{t=1}^T f(X_t)$ what happens if I have $\frac{1}{T}\sum_{t=1}^T f(X_{t-L},\dots,X_t)$?

jkt
  • 513
  • 3
  • 14

1 Answers1

3

Say your state space is $\Omega$ and your process is $X_{t}$. Consider now a new state space - $\Omega \times \Omega$. Then $Y_{y} := (X_{t-1}, X_{t})$ is a Markov process on $\Omega \times \Omega$. Now, you can use the ergodic theorem, provided you know the invariant distribution of $Y_t$. This is a distribution of pairs $(X_{t-1}, X_t)$ and we may write it as the joint distribution $\pi( x_{t-1}, x_t)$. By laws of probability, $$ \pi( x_{t-1}, x_t ) = \pi (x_{t-1} ) p( x_t | x_{t-1} ). $$

Thus:

\begin{align} \lim _{T\to \infty} \frac{1}{T} \sum_{t=1}^{T} \log p(x_t | x_{t-1} ) &= \mathbb{E}_{\pi( x, y )} [ \log p( y | x ) ]\\ &= \sum_{(x,y) \in \Omega \times \Omega } \log p( y | x ) \pi(x,y) \\ &= \sum_{(x,y) \in \Omega \times \Omega } \log \frac{\pi(x,y)}{\pi(x)} \pi(x,y) \\ &= \sum_{(x,y) \in \Omega \times \Omega } \log \pi(x,y) \pi(x,y) -\log \pi(x) \pi(x,y) \\ &= \sum_{(x,y) \in \Omega \times \Omega } \log \pi(x,y) \pi(x,y) -\sum_{x \in \Omega } \log \pi(x) \pi(x) \text{ marginalized in } y \\ &= H(X_{t-1}) - H(X_{t-1},X_t) \\ &= H(X_{t-1}) - H(X_{t-1},X_t) \\ &= -H(X_t | X_{t-1} ). \end{align}

$H$ is the entropy function(al) and $H(X|Y)$ is the conditional entropy. According to Wikipedia: conditional entropy (or equivocation) quantifies the amount of information needed to describe the outcome of a random variable Y given that the value of another random variable X is known.

So I think maybe you should consider the negative of the above quantity. Regarding your last question - you can apply the same trick from above to $(X_{t-L} ,..., x_{t})$.

Yair Daon
  • 2,336
  • 16
  • 29
  • Thanks, I actually figured out the expectation wrt $\pi(x_t,x_{t-1})$ on my own later on but I did not express the whole thing in terms of entropy terms. – jkt May 20 '16 at 18:42
  • I am also interested in the second order derivative wrt the parameter $\theta$, i.e. $\frac{1}{T}\sum_t \partial_i \partial_j \log p(x_t \mid x_{t-1},\theta)$. This is related to the question [here](http://stats.stackexchange.com/questions/211392/markov-model-parameter-concentration-and-fisher-information-matrix) which I asked with a bounty of 50, though no one answered. I expect this would be related to the Fisher matrix in the standard iid data case. I am mainly interested in the generalization of the classical statistical setting to the Markovian setting. – jkt May 20 '16 at 19:09