1

Suppose $ \{ X_i \}_{i=1}^{T}$ are states of a markov chain and let $P_{\theta}(X_1, ..., X_T)$ be the probability of observing the path when $\theta$ is the true parameter value (a.k.a. the likelihood function for $\theta$). Using the definition of conditional probability, we know

$$ P_{\theta}(X_1, ..., X_T) = P_{\theta}(X_T | X_{T-1}, ..., X_1) \cdot P_{\theta}(X_1, ..., X_{T-1})$$

Since this is a markov chain, we know that $P_{\theta}(X_T | X_{T-1}, ..., X_1) = P_{\theta}(X_T | X_{T-1} )$, so this simplifies this to

$$ P_{\theta}(X_1, ..., X_T) = P_{\theta}(X_T | X_{T-1}) \cdot P_{\theta}(X_1, ..., X_{T-1})$$

Then, the likelihood function is then:

$$ P_{\theta}(X_1, ..., X_T) = \prod_{i=1}^{T} P_{\theta}(X_i | X_{i-1} ) $$

where $X_0$ is to be interpreted as the initial state of the process.

My question is how the above analysis works if we use the different notation of $$P(X_1, ..., X_T|\theta)$$

being the likelihood function for $\theta$ instead. Then, won't be end up with:

$$ P(X_1, ..., X_T|\theta) = P(X_T | X_{T-1}, ..., X_1, \theta) \cdot P(X_1, ..., X_{T-1}|\theta)$$ ?

Then, we will have:

$$ P(X_1, ..., X_T|\theta) = P(X_T | X_{T-1}, ..., X_1, \theta) \cdots P(Y_1|Y_0,\theta) $$

But, I am not sure how to find $P(Y_1|Y_0,\theta)$ since it now has $\theta$ in the conditional part?

In other words, if we were to treat the likelihood as a conditional probability distribution, do we now have a different problem of trying to find the joint distribution of the states and $\theta$? What is the correct notation for using the conditional distribution notation for the likelihood?

user321627
  • 2,511
  • 3
  • 13
  • 49

1 Answers1

2

In a Frequentist approach, $\theta$ is a static value to estimate, so $P(X|\theta)=P_\theta(X)$ is an unconditional distribution of $X$ parameterized by $\theta$. If $\theta^*$ is the true parameter of the underlying Markov process, then $P(X|\theta^*)=P(X)$.

Whether $P(X|\theta)$ is a conditional distribution depends on how you use it. As a function of the observed data $x$, $P(x|\theta)$ is a distribution. However, for maximum likelihood estimation we are interested in finding a static value for $\theta$ that maximizes $P(x|\theta)$ and thus treat $P(x|\theta)$ as a function of $\theta$ with fixed $x$.

The result is not a distribution since integrating $P(x|\theta)$ over $\theta$ doesn't give 1. The notation $\mathcal{L}(\theta|x)$ is used to make this distinction clear. Note that here the $|$ means "given" in a non-probabilistic sense, and a less common but clearer notation would be $\mathcal{L}_x(\theta)$.

In a Bayesian setting, we treat $\theta$ as a random variable and $P(\theta)$ is the prior belief over possible parameters $\theta$. In this case, it makes sense to write $P(x|\theta)=P(x,\theta)/P(\theta)$. For more explanation you can take a look at my recent question: Why do people use $\mathcal{L}(\theta|x)$ for likelihood instead of $P(x|\theta)$?.

I noticed a small mistake in your question: The true parameter $\theta$ is not "the likelihood function for $\theta$". The likelihood function $\mathcal{L}(\theta|x)=P(x|\theta)$ is usually a function of observed data $x$. In contrast, the true parameter $\theta^*$ depends on the random variable $X$ that follows the underlying distribution, so the optimal parameter $\theta^*={argmax}_{\theta'}{P(X|\theta')}$ cannot be found.

danijar
  • 740
  • 1
  • 5
  • 16
  • In this way, is the conditional $P(X|\theta)$ really a conditional distribution? Meaning, can I factor it as: $P(X|\theta) = \frac{P(X,\theta)}{P(\theta)}$? If $\theta$ is fixed then what is the interpretation of $P(\theta)$ in the denominator? – user321627 Jun 13 '17 at 05:01
  • @user321627 Please see my updated answer. – danijar Jun 13 '17 at 06:04