17

I need a formula for the probability of an event in a n-variate Bernoulli distribution $X\in\{0,1\}^n$ with given $P(X_i=1)=p_i$ probabilities for a single element and for pairs of elements $P(X_i=1 \wedge X_j=1)=p_{ij}$. Equivalently I could give mean and covariance of $X$.

I already learned that there exist many $\{0,1\}^n$ distributions having the properties just as there are many distributions having a given mean and covariance. I am looking for a canonical one on $\{0,1\}^n$, just as the Gaussian is a canonical distribution for $R^n$ and a given mean and covariance.

mpiktas
  • 33,140
  • 5
  • 82
  • 138

4 Answers4

11

See the following paper:

J. L. Teugels, Some representations of the multivariate Bernoulli and binomial distributions, Journal of Multivariate Analysis, vol. 32, no. 2, Feb. 1990, 256–268.

Here is the abstract:

Multivariate but vectorized versions for Bernoulli and binomial distributions are established using the concept of Kronecker product from matrix calculus. The multivariate Bernoulli distribution entails a parameterized model, that provides an alternative to the traditional log-linear model for binary variables.

cardinal
  • 24,973
  • 8
  • 94
  • 128
Hamed
  • 111
  • 1
  • 2
11

The random variable taking values in $\{0,1\}^n$ is a discrete random variable. Its distribution is fully described by probabilities $p_{\mathbf{i}}=P(X=\mathbf{i})$ with $\mathbf{i}\in\{0,1\}^n$. The probabilities $p_{i}$ and $p_{ij}$ you give are sums of $p_{\mathbf{i}}$ for certain indexes $\mathbf{i}$.

Now it seems that you want to describe $p_{\mathbf{i}}$ by only using $p_i$ and $p_{ij}$. It is not possible without assuming certain properties on $p_{\mathbf{i}}$. To see that try to derive characteristic function of $X$. If we take $n=3$ we get

\begin{align} Ee^{i(t_1X_1+t_2X_2+t_3X_3)}&=p_{000}+p_{100}e^{it_1}+p_{010}e^{it_2}+p_{001}e^{it_3}\\\\ &+p_{110}e^{i(t_1+t_2)}+p_{101}e^{i(t_1+t_3)}+p_{011}e^{i(t_2+t_3)}+p_{111}e^{i(t_1+t_2+t_3)} \end{align} It is not possible rearrange this expression so that $p_{\mathbf{i}}$ dissapear. For the gaussian random variable the characteristic function depends only on mean and covariance parameters. Characteristic functions uniquely define distributions, so this is why Gaussian can be described uniquely by using only mean and covariance. As we see for random variable $X$ this is not the case.

 

panofsteel
  • 123
  • 3
mpiktas
  • 33,140
  • 5
  • 82
  • 138
  • And if the RVs take values in $\{a_i, b_i\}$ - what is the CF then? Do the terms become, e.g., $p_{110} e^{i(t_1 + t_2)} \to p_{b_1,b_2,a_3} e^{i(b_1t_1 + b_2t_2 + a_3t_3)}?$ Thank you. – Confounded Oct 15 '21 at 13:17
1

I don't know what the resulting distribution is called, or if it even has a name, but it strikes me the obvious way to set this up is to think of the model you'd use to model a 2×2×2×…×2 table using a log-linear (Poisson regression) model. As you know the 1st-order interactions only, it's then natural to assume that all higher-order interactions are zero.

Using the questioner's notation, this gives the model: $$P(X_1=x_1, X_2=x_2,\ldots,X_n=x_n) = \prod_i \left[ p_i^{x_i}(1-p_i)^{1-x_i} \prod_{j<i} \left(\frac{p_{ij}}{p_i p_j}\right)^{x_i x_j} \right] $$

onestop
  • 16,816
  • 2
  • 53
  • 83
  • This formula has notational problems: there are $p$'s on the left and the right. The right side makes no reference at all to the subscript $\mathbf{i}$. Furthermore, still interpreting the $p_i$ as probabilities (as in the original question), the rhs clearly is positive whereas the lhs cannot be positive. – whuber Feb 11 '11 at 15:21
  • @whuber Quite right! I stick by the model I set out in the first para, but my equation was screwed up in several ways... Goes to show I haven't actually used log-linear modelling of contingency tables since my MSc, and I haven't got the notes or books to hand. I believe I've fixed it now though. Let me know if you agree! Apols for the delay. Some days my brain just doesn't do algebra. – onestop Feb 14 '11 at 12:51
  • 1
    I don't think this works. Assume $p_i=1/n$ and $p_{ij}=0 \forall i \ne j$. This is a valid combination of probabilities, realized when $I$ is a uniform random variable $\in\{1,...,n\}$ and $X_I=1$ and all $X_j=0 \forall j\ne I$. Still the formula above would be 0 for all events. Still thanks for helping! –  Feb 22 '11 at 16:20
0

An $n$-dimensional Bernoulli distribution can be expressed in terms of an $n$ by $n$ matrix $\Sigma$, which is a matrix analogous to the covariance matrix of the Gaussian distribution, but not necessarily a symmetric matrix. For example, the diagonal elements of $\Sigma$ represent probabilities for a single element $p(X_i=1) = \Sigma_{ii} = \mu_i$. Probabilities for pairs of elements are given by the determinant of the submatrix of $\Sigma$: \begin{align*} p(X_i=1, X_j=1)=\det \begin{bmatrix} \Sigma_{ii} & \Sigma_{ij} \\ \Sigma_{ji} & \Sigma_{jj} \end{bmatrix}. \end{align*} In other words, the covariance between $X_i$ and $X_j$ is expressed as a product of off-diagonal elements as follows, \begin{align*} \mathrm{Cov}[X_i, X_j]=\mathrm{E}[(X_i-\mu_i)(X_j-\mu_j)] = -\Sigma_{ij} \Sigma_{ji}. \end{align*} Hence, covariance alone cannot uniquely determine the off-diagonal elements of $\Sigma$. However, model parameters of a distribution having a given mean and covariance can be obtained by the principle of entropy maximization.

I think the above distribution is a canonical distribution for multivariate binary random variables in the sense that it shares similar properties to the multivariate Gaussian distribution. See the following paper for further details:
T. Arai, "Multivariate binary probability distribution in the Grassmann formalism", Physical Review E 103, 062104, 2021.

  • This is intriguing. But given that the generic $n$-dimensional Bernoulli distribution determines (and is determined by) $2^n-n+1$ linearly independent probabilities and any $n\times n$ matrix can parameterize a manifold of at most $n^2$ dimensions, it looks like the matrix approach is not as general as you claim by the time $n \ge 5.$ Am I misunderstanding your post? – whuber Jan 18 '22 at 15:51
  • The joint probability itself is given by the principal minor of the matrix $\Lambda - I$ divided by $\det \Lambda$, where $\Lambda=\Sigma^{-1}$ and $I$ is the identity matrix. Since there are $2^n$ principal minors in an $n$ by $n$ matrix, $\Sigma$ is sufficient to specify the joint probabilities. We do not have to specify all $2^n$ joint probabilities directly by model parameters. – Takashi Arai Jan 19 '22 at 19:33
  • You are thereby claiming that the space of all Bernoulli distributions on $n$ variables really has at most $n^2$ dimensions, but that's just not true. Your approach therefore must implicitly be imposing conditions and thereby limiting the possibilities. It's important to recognize that. – whuber Jan 19 '22 at 21:41