5

The mode in Dirichlet-Multinomial is

$$ \mathrm{Mode}(\pi_i) = \frac{\alpha_i + x_i - 1}{\sum_{j=1}^k (\alpha_j + x_j -1)} $$

  1. Could you point out how is it calculated please?
  2. What is the importance of -1 in "αi+xi−1" (I know that the mean/estimation formula is without "-1" so could you explain the effect that this "-1" added over the mean)?
Mosab Shaheen
  • 252
  • 2
  • 9
  • See https://stats.stackexchange.com/questions/262956/why-is-there-1-in-beta-distribution-density-function/263842#263842. – whuber Feb 22 '19 at 12:25

1 Answers1

3

Let $f(x_1,\dotsc,x_n)$ be the PDF of $\operatorname{Dir}(\alpha_1+1,\dotsc,\alpha_n+1)$. Then let

$$A=\log f(x_1,\dotsc,x_n)=\log(x_1^{\alpha_1}\dots x_n^{\alpha_n}) + C= \log(x_1^{\alpha_1})+\dots +\log(x_n^{\alpha_n}) + C$$ where $C$ is some constant.

We have this constraint on the variables $(x_1,\dotsc,x_n)$:

$$g(x_1,x_2,...,x_n) = \sum_i x_i =1 $$

Maximizing $A$ is the same as maximizing f. We introduce a Lagrange multiplier $\lambda$. Let the Lagrangian function be:

$$L(x_1,\dotsc,x_n,\lambda)= A-\lambda (g-c) = \log(x_1^{\alpha_1})+\dots +\log(x_n^{\alpha_n}) +C+ \lambda(x_1+\dots +x_n-1)$$ Taking the gradient of both sides gives:

$$d L(x_1,\dotsc,x_n) = \left({\alpha_1\over x_1} + \lambda\right)dx_1 + \dots \left({\alpha_n\over x_n} + \lambda\right)dx_n + (x_1+\dots +x_n-1)d\lambda$$

Solving for $dL=0$ gives $$\tag{1}x_i=-{\alpha_i \over \lambda} $$ and $$\sum_i x_i =1 \tag{2}$$ Apply the operator $\sum_i$ to $(1)$ taking into account $(2)$. This gives that $$\lambda = -\sum_i \alpha_i $$ which finally means that $$x_i=\frac{\alpha_i}{\sum_j \alpha_j} $$


Intuitive answer

The Dirichlet distribution represents an estimate of what categorical distribution produced some set of observations.

For example: If there's a scenario where there are three types of events:

  • event $1$ has been observed $5$ times

  • event $2$ has been observed $10$ times

  • event $3$ has been observed $7$ times,

then our "knowledge" about the probability distribution that produced those events is represented by the Dirichlet distribution $\operatorname{Dir}(5+1,10+1,7+1)$. It should almost be common sense that the most likely probabilities for events $1$, $2$ and $3$ should be $5\over5+10+7$, $10\over5+10+7$ and $7\over5+7+10$ respectively. Grouping them together as $(\frac5{22}, \frac{10}{22}, \frac{7}{22})$ then gives the mode of the Dirichlet distribution.

The above claim about the mode can be verified by solving a constrained optimisation problem. The optimisation needs to be constrained because the Dirichlet PDF is set to be zero outside those $(x_1, \dotsc, x_n)$ values for which $x_1+\dotsc+x_n=1$. This optimisation is easily done using Lagrange multipliers.

I've been trying to think about why the expression for the expected value is the same as the mode but the number of observations of each kind of event has increased by $1$. Proving it seems to be an exercise in integration. But I'd like to see an intuitive argument for that as well.

wlad
  • 1,290
  • 1
  • 10
  • 23
  • Thanks for the answer. Still I have a doubts here. Is there a standard way of calculating mode i.e. do we have to use a Lagrange multiplier always to calculate the mode? if no please let me know the other ways possible to calculate the mode. – Mosab Shaheen Feb 23 '19 at 14:39
  • Lagrange multipliers are used because the PDF of the Dirichlet distribution is only non-zero when $x_1+\dots+x_n=1$. Therefore the optimisation needs to be constrained to those values. You wouldn't do it for a normal distribution for instance – wlad Feb 23 '19 at 15:05
  • @MosabShaheen I've posted a more intuitive answer – wlad Feb 23 '19 at 15:10
  • Thanks @manonlaptop. I saw for the normal distribution we can find it by setting the derivative to zero (to find the global maxima). May be it is not relevant to my question, but what is the standard name/idiom that is used to refer to that way (setting the derivative to zero)? – Mosab Shaheen Feb 23 '19 at 16:01
  • @MosabShaheen Wikipedia says that it's called [Fermat's Theorem](https://en.wikipedia.org/wiki/Fermat%27s_theorem_(stationary_points)) or the Interior Extremum Theorem. It's not a term many people use. It's also likely to confuse people, as Fermat himself proved lots of theorem – wlad Feb 23 '19 at 16:05
  • @manonlaptop Thanks. I have edited the answer to be more clear for people who do not have experience in statistics. Marked as answered. – Mosab Shaheen Feb 23 '19 at 16:34
  • 1
    @Tim thanks for sharing the link. If you have some information, could you point out why the expectation and the mode differ by "-1" in "αi+xi−1" (what meaning/effect it can carry) – Mosab Shaheen Feb 23 '19 at 16:39
  • @MosabShaheen the above answer derives the formula for mode. – Tim Feb 23 '19 at 16:45
  • @Tim right but I mean why there is a difference in the formula of the mode by "-1" from the mean. I know that this is the result of calculation, but I am asking the meaning of that (why it is always "-1" not "-2" or "-10")?. – Mosab Shaheen Feb 23 '19 at 18:03
  • @manonlaptop I am trying to find the formula of mode of Dirichlet using the Fermat's Theorem (by setting the derivative to zero) i.e. following the same approach to find the mode for the normal distribution. However, I have got weird result. Please check here https://drive.google.com/file/d/1zKI4ojh24NKRDXlaAVpGdZr02FyTHfFr/view?usp=sharing . Please let me know what is wrong? and if it is possible to find the mode using Fermat's Theorem? – Mosab Shaheen Feb 23 '19 at 18:47
  • @Tim could you please help regarding my last comment? – Mosab Shaheen Feb 23 '19 at 18:48
  • @MosabShaheen The optimisation should be done subject to the constraint that $x_1 + \dots + x_n = 1$, because the formula for the Dirichlet PDF is different for vales of $(x_1,\dotsc,x_n)$ where that isn't true (it's zero). That's why I used Lagrange multipliers. You're not incorporating that constraint in your calculation. – wlad Feb 23 '19 at 19:33
  • @manonlaptop Thanks, that is right. By the way in Dirichlet PDF if α1 is 0.5 and x1 is zero then PDF will be undefined (division by zero), right? I think this constraint should also be mentioned with the Dirichlet PDF. – Mosab Shaheen Feb 23 '19 at 20:44
  • This answers seems light on the intuition for *why* the mean and mode differ by this -1. Can anyone suggest why? – Rylan Schaeffer Sep 12 '20 at 17:36