This question is related to my quest of clustering the sequences using mixture Markov modeling.
I have trouble understanding Dirichlet priors in the context of MAP-estimate (Mixture Markov Models). Namely, my priors end up being (much) larger than one.
I have non-informative priors defined as following:
$$ p(\theta_n^{j}|a_n^{j})=\frac{\Gamma(\sum_{m=1}^{M}(a_{nm}^{j}+1))}{\prod_{m=1}^{M}\Gamma(a_{nm}^{j}+1)}*\prod_{m=1}^{M}(\theta_{nm}^{j})^{a_{nm}^{j}}, $$
where each $a_n^{j}$ is a M-vector with components $a_{nm}^{j}>0$. j denotes the jth component in the mixture and n is the number of the row of the TPM ( transition probability matrix). Then I use the sum of log(Dirichlet Priors) across each row and each component.
The most confusing aspects are:
1) In all literature the Dirichlet prior formula is given as:
$$ p(\theta_n^{j}|a_n^{j})=\frac{\Gamma(\sum_{m=1}^{M}(a_{nm}^{j}))}{\prod_{m=1}^{M}\Gamma(a_{nm}^{j})}*\prod_{m=1}^{M}(\theta_{nm}^{j})^{a_{nm}^{j}-1}, $$
(notice the -1 in the exponent term). Is there possibly a typo in the article or can the -1 term be skipped?
2) The article sets $a_n^{j}$ equal to 10% of the corresponding relative frequencies of the TPM of the original counts across all sequences ( if our model has just 1 component). Then, let us consider such example:
there are 3 possible states in the Markov Chain, and the n-th row transition probabilities are 0.1, 0.8, 0.1. Let $a_n^j$ be equal to (0.01,0.03,0.06). Then, following the formula, my prior will be ( calculated in R) :
according to the first version of formula: 1.961152
according to the second version of the formula: 245.144
This has been computed by defining a function in R:
> dirichletPrior<-function(matrix_row,alpha_row){
dirichl<-((gamma(sum(alpha_row+1)))/(prod(gamma(alpha_row+1))))*prod(matrix_row^(alpha_row))
dirichl
}
The results seem a bit like a nonsense to me, and I do not understand where is my logic faulty.
3) Is "Dirichlet prior" a probability density function or a likelihood function? What would it mean if the result for each row is larger than 1? Are alpha multinomial parameters like I gave in the example, meaningful at all? What do they have in common with concentration parameter?
The article I have been referring to is located under the following link: http://www.cs.uoi.gr/~kblekas/papers/C19.pdf.