What is a Dirichlet prior

Question

I am doing some bioinformatics research but my background is Applied Math and I usually have trouble with the statistics part of it.

Basically, I've created a Position Weight Matrix using a R function from the package bioStrings (bioConductor the R bioinformatics suite).

Now, reading the function signature looks like

PWM(x, type = c("log2probratio", "prob"), prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25))

for the type parameter, the details section says this

The PWM function uses a multinomial model with a Dirichlet conjugate prior to calculate the estimated probability of base b at position i. As mentioned in the Arguments section, prior.params supplies the parameters for the DNAbases A,C,G,andT in the Dirichlet prior. These values result in a position independent initial estimate of the probabilities for the bases to be priorProbs = prior.params/sum(prior.params) and the posterior(data infused)estimate for the probabilities for the bases in each of the positions to be postProbs = (consensusMatrix(x) + prior.params)/(length(x) + sum(prior.params)). When type = "log2probratio", the PWM=unitScale(log2(postProbs/priorProbs)). When type = "prob", the PWM = unitScale(postProbs).

Could someone help me understand the two different types here? Usually a PWM will have a pseudocounts added if the entries of the position frequency matrix has a lot of zeros (or small dataset). The columns of the PFM will be multiplied by a Dirichlet distribution. I guess I really just need help on what a Dirichlet distribution is and how it relates to a multinomial model.

Are you familiar with Bayesian inference? Do you already understand how the idea of "pseudocounts" is related to prior information incorporated into the posterior? I think this is a big question, and it'd be helpful to know how deep your knowledge is with these types of methods. — Blue Marker, Dec 24 '14 at 20:56
I understand why they are using pseudocounts. The PFM can contain zeros. So going from PFM -> PWM which are log values, the entries can go to infinity. So I understand the intuition, but not the statistical mechanics behind it. — masfenix, Dec 24 '14 at 23:47
In short: The [Dirichlet](http://en.wikipedia.org/wiki/Dirichlet_distribution#Probability_density_function) is the multivariate extension of the beta corresponding to the multinomial extension of the binomial. It's a continuous distribution all of whose components lie in [0,1] and which components sum to 1. That is Dirichlet is to beta as multinomial is to binomial, and Dirichlet is to multinomial as beta is to binomial. — Glen_b, Dec 25 '14 at 05:43
Check https://stats.stackexchange.com/questions/244917/what-exactly-is-the-alpha-in-the-dirichlet-distribution/244946#244946 — Tim, Dec 16 '18 at 20:08

score 5 · Answer 1 · edited Jun 11 '20 at 14:32

Let me try to respond your very last question about understanding the Dirichlet distribution, its relation to the Multinomial, and what I suspect is what you really would like to know is how this could be explained in an applied context, such as your genomics problem.

Now I am going to explain all this using my vague recall of haplogroup SNPs, which might be somewhat similar to your data:

So let's say I have this dataset inspired by this random NIH paper from Homo Sapiens Sapiens and I need to identify all the SNPs associated with this novel sub-sub-subclade of the haplogroup N that I believe exists but I don't know how many of the SNPs (or I guess, another population genetics term would be finding the linkage disequilibrium) there are in each of those 4 sub-sub-clades and what those SNPs are in this Y-DNA sample.

15121 caacagcctt cataggctat gtcctcccgt gaggccaaat atcattctga ggggccacag

15181 taattacaaa cttactatcc gccatcccat acattgggac agacctagtt caatgaatct

15241 gaggaggcta ctcagtagac agtcccaccc tcacacgatt ctttaccttt cacttcatct

15301 tgcccttcat tattgcagcc ctagcagcac tccacctcct attcttgcac gaaacgggat

....etc...etc...`

Given that we do not know the number of SNPs linked together that would fall into this sub-sub-subclade, or the probability of occurence of a SNP in this novel obscure genomic region I'm about to discover, I will treat those unknown parameters as random under the Bayesian paradigm:
I will actually estimate **the number of the linked SNPs ** to be associated with that sub-sub-subclade since I know that if more than 1% of a population does not carry the same nucleotide at a specific position in the DNA sequence, then this variation can be classified as a SNP.
By the "linked SNPs" I mean some unknown number of groups of SNP, such as let's say one possible group we are considering to be associated with this sub-sub-subclade genomic region of haplogroup N would be the group of SNPs for dopaminergic receptors x, y, and z, the other group - for the Serotonin 5-HT2A receptors, which are SNPs rs6311 and rs6313 and so on)
My other parameter will be estimating the expected number of times (denoted by parameter $k$, where $k\ge2$) the outcome SNP $i$ was observed over $N$ sampled nucleotides, where:

$$X_1,\dots,X_k, x_i \in (0,1)$$ a vector of random category counts $$\sum_{i=1}^N x_i = 1$$,

parametrized by a pseudocount parameter $\boldsymbol{\alpha} = (\alpha_1,\dots,\alpha_k)$

Now, a minute of some math:

The commonly known probability distributions are related as it is clearly illustrated on the map of the Relationships Among Common Distributions, Adopted from Leemis(1986) from what I call "the Bible of Statistics" every statistician sees vivid dreams about before math stats exams in grad school, a.k.a. the 2nd edition of "Statistical Inference" by G.Casella & R.Berger, 2001. Cengage Learning.. Although even in the extended version of the map from the American Statistician the Dirichlet and Multinomial are not depicted, here is my audacious take on where they would be placed in the classical map #1:

Also, somewhere along those arrows would be another relevant distribution, the Categorical Distribution.

In a nutshell, you can derive or easily transform one of them into the other if they are pretty close on the map and come from the same family of distributions. Some of these are generalizations of other distributions hence, including such as Dirichlet, which is a generalization on the Beta distribution, i.e. Dirichlet generalized the Beta into multiple dimensions.
For this reason and so many others, Dirichlet distribution is the Conjugate Prior for Multinomial Distribution.
Now back to our SNPs problem:

The set up we are going to use for our problem is based on the Pólya urn model where we would sample with replacement $N$ strings of the 4-lettered nucleotide bases that show up in $k$ my-sub-sub-subclade-linked SNPs, where each of SNP of an observed nucleotide bases can be sampled with probabilities $p_1,\dots,p_k$.

Since we don't know how many and which SNPs fall into the the "linked SNPs for this sub-sub-subclade", we would assume those unknown SNPs that may or may not exist for this sub-sub-subclade are represented by a parameter $\alpha$, where he $\alpha_1,\dots,\alpha_k$ and those might actually be the pseudocounts you are referencing. Here is a nice response and a reference to some problems with this approach to the Dirichlet-Miltonomial. I would highly recommend checking Bioconductor or that package documentation because those pseudocounts can easily be just a simple method to convert different matrices while integrating very different distributions.

Now we estimate the parameters In Dirichlet-multinomial model and update them: $\alpha_1+n_1,\dots,\alpha_k+n_k$ eventually obtaining the posterior distributions and estimating the number of types of linked SNPs that fall into my novel sub-sub-clade with their probabilities to conclude how likely we are to see those groups in the sub-sub-clade.

P.S. I suspect the reason Dirichlet and Multinomial are so applicable to genomics and are used in some Bioinformatics packages in R is probably due to a lot of discretization in these types of dataset and also, traditional models such as the ones based on the Hardy-Weinberg Principle as they are mathematically a perfect candidate of some Binomial, Beta, Multinomial etc type of a setup because you are essentially estimating the frequencies of some counts in some discrete categories (although the Dirichlet does not require the parameters $\alpha$ to be integers.

P.P.S. From a very reduced set up above, I have omitted other important things to consider which can be found in this tutorial, such as Dirichlet Process and specifically, the partition step of some probability space $\Theta$ to find

$$\theta_k | \text{Prior distribution H over component parameters, } \theta_k \sim H$$

score 1 · Answer 2 · edited Mar 03 '20 at 11:00

There is a good explanation in this presentation.

https://www.slideshare.net/g33ktalk/machine-learning-meetup-12182013

You can watch the whole presentation if you want (it is a good explanation of the Dirichlet distribution) but I think the slides will get the concept across pretty quickly.

Slides 32-35 Explains the mathematical process of the Dirichlet prior.

Slide 50-60 shows what is going on when the distribution updates and shows the prior. (It is easier to see it visually than explain it) This gets the general idea across

Slide 94-102 shows what happens to the whole system as updating occurs. This is the same concept as slide 50-60 but tracks what happens for each iteration.

What is a Dirichlet prior

2 Answers2