0

This question is related to my quest of clustering the sequences using mixture Markov modeling.

I have trouble understanding Dirichlet priors in the context of MAP-estimate (Mixture Markov Models). Namely, my priors end up being (much) larger than one.
I have non-informative priors defined as following:

$$ p(\theta_n^{j}|a_n^{j})=\frac{\Gamma(\sum_{m=1}^{M}(a_{nm}^{j}+1))}{\prod_{m=1}^{M}\Gamma(a_{nm}^{j}+1)}*\prod_{m=1}^{M}(\theta_{nm}^{j})^{a_{nm}^{j}}, $$

where each $a_n^{j}$ is a M-vector with components $a_{nm}^{j}>0$. j denotes the jth component in the mixture and n is the number of the row of the TPM ( transition probability matrix). Then I use the sum of log(Dirichlet Priors) across each row and each component.

The most confusing aspects are:

1) In all literature the Dirichlet prior formula is given as:

$$ p(\theta_n^{j}|a_n^{j})=\frac{\Gamma(\sum_{m=1}^{M}(a_{nm}^{j}))}{\prod_{m=1}^{M}\Gamma(a_{nm}^{j})}*\prod_{m=1}^{M}(\theta_{nm}^{j})^{a_{nm}^{j}-1}, $$

(notice the -1 in the exponent term). Is there possibly a typo in the article or can the -1 term be skipped?

2) The article sets $a_n^{j}$ equal to 10% of the corresponding relative frequencies of the TPM of the original counts across all sequences ( if our model has just 1 component). Then, let us consider such example:

there are 3 possible states in the Markov Chain, and the n-th row transition probabilities are 0.1, 0.8, 0.1. Let $a_n^j$ be equal to (0.01,0.03,0.06). Then, following the formula, my prior will be ( calculated in R) :

according to the first version of formula: 1.961152

according to the second version of the formula: 245.144

This has been computed by defining a function in R:

    > dirichletPrior<-function(matrix_row,alpha_row){
        dirichl<-((gamma(sum(alpha_row+1)))/(prod(gamma(alpha_row+1))))*prod(matrix_row^(alpha_row))
        dirichl
     }

The results seem a bit like a nonsense to me, and I do not understand where is my logic faulty.

3) Is "Dirichlet prior" a probability density function or a likelihood function? What would it mean if the result for each row is larger than 1? Are alpha multinomial parameters like I gave in the example, meaningful at all? What do they have in common with concentration parameter?

The article I have been referring to is located under the following link: http://www.cs.uoi.gr/~kblekas/papers/C19.pdf‎.

zima
  • 739
  • 3
  • 7
  • 19
  • Your formulae for Dirichlet distribution are wrong, at least not for the range of 0 < a. For the first formula, $a > -1$ is assumed, and the second, $a + 1$ in the Gamma functions should be $a$. You might consider computing log-Gamma function instead of product of Gamma, since the numbers are very large. – Memming Aug 26 '13 at 14:41
  • @Memming, now I am really confused. The article provides formula (1) for Dirichlet priors and each term $a_{nm}^{j}>0$. I would be very thankful if you could explain to me in very simple terms what these terms are then supposed to represent. I think I might be confusing the "concentration parameter alpha", pseudocounts and multinomial coefficients when I consider them. The article seems to state they are a very small percentage of corresponding transition probabilities. – zima Aug 26 '13 at 15:19
  • I recommend http://en.wikipedia.org/wiki/Dirichlet_distribution. The referred article doesn't use standard formula, and only part of the parameter space which I think might be a mistake. In any case, the sum of prob should be 1. See http://en.wikipedia.org/wiki/Beta_function to verify that it is indeed the correct normalizer. – Memming Aug 26 '13 at 15:47
  • @Memming, I have gone through the wikipedia article for the nth time, and that helped me realize my huge values are due to a lacking parenthesis in the code. However, I am still at loss as to why was the prior defined the way it was in the paper ( i.e. it is pretty much larger by 1 than it needs to be..). I will use the formula from wikipedia and hope for the best. Is there a chance you could answer one more question: after each iteration of EM-algorithm, do I need to update the alpha parameters? Or do I keep them fixed? – zima Aug 26 '13 at 16:52
  • alpha < 1 tends to make things super sparse, so maybe the paper was trying to avoid that. – Memming Aug 26 '13 at 18:38
  • @Memming, thank you for making me understand what alpha parameters actually mean ( e.g. <1 the parameters are sparce, >1 more even - like in the visualization given in the wikipedia). I have also edited to reflect that the second formula should have alpa and not alpha+1. I have a question regarding the statement "sum of probabilities should be 1" :do you mean the sum of Dirichlet prior probabilities for each row of the transition matrix should be 1? Or the sum of priors across all model components should be 1, where each prior is over a row of transition matrix? – zima Sep 03 '13 at 13:31
  • Also see: http://stats.stackexchange.com/questions/4220/can-a-probability-distribution-value-exceeding-1-be-ok – kjetil b halvorsen May 29 '16 at 19:37

0 Answers0