Asymptotic distribution of multinomial

Question

I'm looking for the limiting distribution of multinomial distribution over d outcomes. IE, the distribution of the following

$$\lim_{n\to \infty} n^{-\frac{1}{2}} \mathbf{X_n}$$

Where $\mathbf{X_n}$ is a vector value random variable with density $f_n(\mathbf{x})$ for $\mathbf{x}$ such that $\sum_i x_i=n$, $x_i\in \mathbb{Z}, x_i\ge 0$ and 0 for all other $\mathbf{x}$, where

$$f_{n}(\mathbf{x})=n!\prod_{i=1}^d\frac{p_i^{x_i}}{x_i!}$$

I found one form in Larry Wasserman's "All of Statistics" Theorem 14.6, page 237 but for limiting distribution it gives Normal with a singular covariance matrix, so I'm not sure how to normalize that. You could project the random vector into (d-1)-dimensional space to make covariance matrix full-rank, but what projection to use?

Update 11/5

Ray Koopman has a nice summary of the problem of singular Gaussian. Basically, singular covariance matrix represents perfect correlation between variables, which is not possible to represent with a Gaussian. However, one could get a Gaussian distribution for the conditional density, conditioned on the fact that the value of random vector is valid (components add up to $n$ in the case above).

The difference for the conditional Gaussian, is that inverse is replaced with pseudo-inverse, and normalization factor uses "product of non-zero eigenvalues" instead of "product of all eigenvalues". Ian Frisce gives link with some details.

There's also a way to express normalization factor of conditional Gaussian without referring to eigenvalues, here's a derivation

What exactly do you mean by limiting distribution in this case? — Robby McKilliam, Sep 05 '10 at 21:24
ie, the one you get from Central Limit Theorem, let me update details — Yaroslav Bulatov, Sep 05 '10 at 21:28
What you're referring to is the asymptotic distribution of the **maximum likelihood estimator** of a multinomial. Also, the first equation should be n^{-1}, not n^{-1/2}. — Simon Byrne, Sep 06 '10 at 10:12
In notation above, for d=2, X_n is the number of heads after n coin throws, so it's X_n/sqrt(n) that approaches Normal, not X_n/n, no? — Yaroslav Bulatov, Sep 06 '10 at 18:11

Simon Byrne · Accepted Answer · 2010-09-06T11:28:12.913

7

The covariance is still non-negative definite (so is a valid multivariate normal distribution), but not positive definite: what this means is that (at least) one element of the random vector is a linear combination of the others.

As a result, any draw from this distribution will always lie on a subspace of $R^d$. As a consequence, this means it is not possible to define a density function (as the distribution is concentrated on the subspace: think of the way a univariate normal will concentrate at the mean if the variance is zero).

However, as suggested by Robby McKilliam, in this case you can drop the last element of the random vector. The covariance matrix of this reduced vector will be the original matrix, with the last column and row dropped, which will now be positive definite, and will have a density (this trick will work in other cases, but you have to be careful which element you drop, and you may need to drop more than one).

edited Sep 06 '10 at 11:28

answered Sep 06 '10 at 10:36

Simon Byrne

3,336
15
29

What's a bit unsatisfactory is the freedom of choice, to get a valid density I need to ask for distribution of A x where A is some d-1 rank (d)x(d-1) matrix. Will the error of CLT approximation for finite n be equivalent for all choices of A? That's not clear to me – Yaroslav Bulatov Sep 06 '10 at 18:39
1

Yes, the error should always be the same. Keep in mind that the last element of the vector is functionally dependent on the other (d-1) elements (in both the finite sample and asymptotic cases). – Simon Byrne Sep 06 '10 at 18:46
It's not that the `last' element is dependent, Yaroslav's problem is that he doesn't like the idea of getting to choose which element to drop. I agree with the answer you have given but I also think that a bit more thought and care is required here. – Robby McKilliam Sep 06 '10 at 21:30
@Yaroslav: Perhaps it would be good to have an idea of what application you have in mind here, because at this stage there are potentially lots of answers to your question. – Robby McKilliam Sep 06 '10 at 21:33
1

Robby -- application I had in mind is here http://mathoverflow.net/questions/37582/how-big-is-the-sum-of-smallest-multinomial-coefficients Basically integrals of Gaussian suggested by CLT give extremely good approximation to sums of binomial coefficients (for small n, even better than integrating Gamma representation directly!), so I was seeing if I can do something similar to get approximate sums of multinomial coefficients, which I need to get non-asymptotic error bounds for various fitters (like, maximum likelihood) – Yaroslav Bulatov Sep 06 '10 at 23:23
FWIW I started from some random [blog post](https://aloneinthefart.blogspot.com/2012/09/normal-approximation-of-multinomial.html) and came to the same result; drop the last row + column from the covariance matrix (and the last element of the random variable), see [my answer to What is the normal approximation of the multinomial distribution](https://stats.stackexchange.com/a/454431). I wasn't able to check the derivation of the theorem it relies on. – stephematician Mar 19 '20 at 23:54

score 2 · Answer 2 · answered Nov 06 '10 at 00:52

2

There no inherent problem with the singular covariance here. Your asymptotic distribution is the singular normal. See http://fedc.wiwi.hu-berlin.de/xplore/tutorials/mvahtmlnode34.html which gives the density of the singular normal.

answered Nov 06 '10 at 00:52

Ian Fiske

148
3

Technically, the problem is that singular covariance matrix means that some subset of variables is perfectly correlated, so probability density should be exactly 0 in some areas, but that's not possible with a Gaussian. One solution is to instead look at conditional density, conditioned on the fact that random variable lies in a feasible region. This looks like what they are doing in the link. Never heard the term "G- inverse", I'm guessing it's Penrose-Moore pseudo-inverse? – Yaroslav Bulatov Nov 06 '10 at 01:42
While it's true that a conventional d-dimensional Gaussian has support on all of $\Re^d$, the singular Gaussian does not. G-inverse is generalized inverse, and yes, I believe that the Penrose-Moore definition works here. I think that there's a CLT for singular covariances, stating as expected, convergence in distribution to the singular CLT, although I can't find a ref right now. – Ian Fiske Nov 06 '10 at 02:30

Robby McKilliam · Answer 3 · 2010-09-05T22:20:30.703

1

It looks to me like Wasserman's covariance matrix is singular, to see, multiply it by a vector of $d$ ones, i.e. $[1,1,1,\dots,1]^\prime$ of length $d$.

Wikipedia gives the same covariance matrix anyway. If we restrict ourselves to just a binomial distribution then the standard central limit theorem tells us that the binomial distribution (after appropriate scaling) converges to the normal as $n$ gets big (see wikipedia again). Applying similar ideas you should be able to show that an appropriately scaled mulinomial is going to converge in distribution to the multivariate normal, i.e. each marginal distribution is just a binomial and converges to the normal distribution, and the variance between them is known.

So, I am very confident you will find that the distribution of $$\frac{X_n - np}{\sqrt{n}}$$ converges to the multivariate normal with zero mean and covariance $$\frac{C}{n}$$ where $C$ is the covariance matrix of the multinomial in question and $p$ is the vector of probabilities $[p_1,\dots,p_d]$.

edited Sep 05 '10 at 22:20

answered Sep 05 '10 at 22:04

Robby McKilliam

2,010
14
14

1

but the covariance matrix of multinomial in question is singular, you showed it yourself... – Yaroslav Bulatov Sep 05 '10 at 22:26
Oh, I see your problem! One of the elements, say the $d$th is completely dependent on the others. Probably if you chop off the last row and column of $C$ you will get that the $[p_1,p_2,\dots,p_{d-1}]$ are normally distributed, but I'll have to have a think about it. Surely this is solved somewhere already! – Robby McKilliam Sep 05 '10 at 22:41
One suggestion I found is to still use a Gaussian, but use pseudo-inverse instead of inverse and "product of non-zero eigenvalues" in place of determinant. For d=2 this seems to give the correct density form, but the normalization factor is off – Yaroslav Bulatov Sep 05 '10 at 23:26

jvdillon · Answer 4 · 2010-11-05T19:33:51.173

1

Is it not the case that $|S_{-i}|=|S_{-j}|$ for all $i,j$ where $S_{-i}$ is the Multinomial covariance matrix with the $i$-th row and column removed? Since this is the case, I don't understand what you mean by "freedom of choice" as any "choice" is equivalent.

edited Nov 05 '10 at 19:33

answered Nov 05 '10 at 19:28

jvdillon

180
7

Those matrices are not equal, here's the covariance matrix http://yaroslavvb.com/upload/multinomial-covariance-matrix.png – Yaroslav Bulatov Nov 05 '10 at 19:31
Yes, this is indeed the covariance matrix. Dropping any ith column and row results in the same normalization term for the Gaussian was my point. Perhaps I'm missing something obvious? – jvdillon Nov 05 '10 at 19:41
Ah...didn't notice the determinant sign. Hm...they do seem to be equal on some examples I tried, is there a simple proof of this? Eigenvalues are not equal however. The motivation for question was to find out whether central limit theorem gives you the same approximation error for finite $n$ regardless of which multinomial dist. component you drop – Yaroslav Bulatov Nov 05 '10 at 20:06
Probably the easiest way to convince yourself is that $p_i=1-\sum_{j\ne i}p_j$ and plug that in for $p_i$ in $S$. – jvdillon Nov 05 '10 at 20:33
BTW, I like your application of this idea--hence my interest in responding. – jvdillon Nov 05 '10 at 20:34
@Ian Interesting, I hadn't heard of the singular Normal. Still though, in this example I don't think this distribution is necessary, because as you said (and I implied), one of the events is degenerate. Hence, you can lop off any row/col. – jvdillon Nov 08 '10 at 13:17

mortonjt · Answer 5 · 2020-08-06T19:10:04.343

Another way to think of this is your multinomial distribution parameters are embedded in the simplex, so your asymptotic distribution should also be embedded in the simplex. You can get to this by trying to instead characterize the distribution of $y$ given by the transform

$$ y = ilr(p)=V^T\log p $$

where $ilr(p)$ is the isometric log-ratio transform and $V$ is an orthonormal basis of dimension $D-1\times D$. It doesn't actually matter which basis you use, but you could use the top D-1 eigenvectors from the eigendecomposition of $\log p$ for starters.

It can then be shown that the asympotic distribution can be given by

$$ \hat{y} \sim \mathcal{N}_{D-1}\big(y, \frac{1}{n} V^TD^{-1}V\big) $$ where $D=diag(p)$

See Ortego et al 2015 for more details, the paper can be found below https://www.researchgate.net/publication/327646436_The_asymptotic_distribution_of_log-ratio_transformed_proportions_of_multinomial_count_data

Asymptotic distribution of multinomial

5 Answers5

$$ y = ilr(p)=V^T\log p $$

Linked