What is the number of parameters needed for a joint probability distribution?

Question

Let's suppose we have $4$ discrete random variables, say $X_1, X_2, X_3, X_4$, with $3,2,2$ and $3$ states, respectively.

Then the joint probability distribution would require $3 \cdot2 \cdot2 \cdot 3-1 $ parameters (we don't know any independence relations). Considering the Chain Rule, and considering the fact that you need one parameter, $p$, for the marginal distribution of each node with two states, and $2$ for the ones with $3$ states, we have

$$P(X_4,X_3,X_2,X_1)=P(X_4 | X_3,X_2,X_1)P(X_3 | X_2,X_1)P(X_2 | X_1)P(X_1)$$

so we need $3 \cdot 2 \cdot 2 \cdot 2$ parameters for the first conditional probability distribution (as there are $2 \cdot 2 \cdot 3$ combinations of the first three variables and we need the $2$ parameters of $X_4$ for each one), $3 \cdot 2$ for the second one, $3$ for the third one and $2$ for the last one.

So... do we need $3 \cdot 2 \cdot 2 \cdot 2 +3 \cdot 2 + 3 +2 $ parameters?

It is actually true that $3 \cdot 2 \cdot 2 \cdot 2 +3 \cdot 2 + 3 +2 =3 \cdot 2 \cdot 2 \cdot 3$ ?

All your quantities are equal, except for the right-hand side of your last expression, from which you have to subtract 1 (as in your 2nd sentence). — Mark L. Stone, Jul 09 '17 at 18:01

score 11 · Accepted Answer · edited Jun 11 '20 at 14:32

It takes $3\times 2 \times 2 \times 3 = 36$ numbers to write down a probability distribution on all possible values of these variables. They are redundant, because they must sum to $1$. Therefore the number of (functionally independent) parameters is $35$.

If you need more convincing (that was a rather hand-waving argument), read on.

By definition, a sequence of such random variables is a measurable function

$$\mathbf{X}=(X_1,X_2,X_3,X_4):\Omega\to\mathbb{R}^4$$

defined on a probability space $(\Omega, \mathcal{F}, \mathbb{P})$. By limiting the range of $X_1$ to a set of three elements ("states"), etc., you guarantee the range of $\mathbf{X}$ itself is limited to $3\times 2\times 2 \times 3=36$ possible values. Any probability distribution for $\mathbf{X}$ can be written as a set of $36$ probabilities, one for each one of those values. The axioms of probability impose $36+1$ constraints on those probabilities: they must be nonnegative ($36$ inequality constraints) and sum to unity (one equality constraint).

Conversely, any set of $36$ numbers satisfying all $37$ constraints gives a possible probability measure on $\Omega$. It should be obvious how this works, but to be explicit, let's introduce some notation:

Let the possible values of $X_i$ be $a_i^{(1)}, a_i^{(2)}, \ldots, a_i^{(k_i)}$ where $X_i$ has $k_i$ possible values.
Let the nonnegative numbers, summing to $1$, associated with $\mathbf{a}=(a_1^{(i_1)}, a_2^{(i_2)}, a_3^{(i_3)}, a_4^{(i_4)})$ be written $p_{i_1i_2i_3i_4}$.
For any vector of possible values $\mathbf{a}$ for $\mathbf{X}$, we know (because random variables are measureable) that $$\mathbf{X}^{-1}(\mathbf{a}) = \{\omega\in\Omega\mid \mathbf{X}(\omega)=\mathbf{a}\}$$ is a measurable set (in $\mathcal{F}$). Define $$\mathbb{P}\left(\mathbf{X}^{-1}(\mathbf{a})\right) = p_{i_1i_2i_3i_4}.$$

It is trivial to check that $\mathbb{P}$ is an $\mathcal{F}$-measurable probability measure on $\Omega$.

The set of all such $p_{i_1i_2i_3i_4}$, with $36$ subscripts, nonnegative values, and summing to unity, form the unit simplex in $\mathbb{R}^{36}$.

We have thereby a established a natural one-to-one correspondence between the points of this simplex and the set of all possible probability distributions of all such $\mathbf{X}$ (regardless of what $\Omega$ or $\mathcal{F}$ might happen to be). The unit simplex in this case is a $36-1=35$-dimensional submanifold-with-corners: any continuous (or differentiable, or algebraic) coordinate system for this set requires $35$ numbers.

This construction is closely related to a basic tool used by Efron, Tibshirani, and others for studying the Bootstrap as well as to the influence function used to study M-estimators. It is called the "sampling representation."

To see the connection, suppose you have a batch of $36$ data points $y_1, y_2, \ldots, y_{36}$. A bootstrap sample consists of $36$ independent realizations from the random variable $\mathbf X$ that has a $p_1=1/36$ chance of equaling $y_1$, a $p_2=1/36$ chance of equaling $y_2$, and so on: it is the empirical distribution.

To understand the properties of the Bootstrap and other resampling statistics, Efron et al consider modifying this to some other distribution where the $p_i$ are no longer necessarily equal to one another. For instance, by changing $p_k$ to $1/36 + \epsilon$ and changing all the other $p_j$ ($j\ne k$) by $-\epsilon/35$ you obtain (for sufficiently small $\epsilon$) a distribution that represents overweighting the data value $X_k$ (when $\epsilon$ is positive) or underweighting it (when $\epsilon$ is negative) or even deleting it altogether (when $\epsilon=-1/36$), which leads to the "Jackknife".

As such, this representation of all the weighted resampling possibilities by means of a vector $\mathbf{p} = (p_1,p_2, \ldots, p_{36})$ allows us to visualize and reason about different resampling schemes as points on the unit simplex. The influence function of the value $X_k$ for any (differentiable) functional statistic $t$, for instance, is simply proportional to the partial derivative of $t(X)$ with respect to $p_k$.

Reference

Efron and Tibshirani (1993), An Introduction to The Bootstrap (Chapters 20 and 21).

score 1 · Answer 2 · answered Jul 09 '17 at 19:34

1

The number of parameters needed to represent a random variable is only defined with reference to a model, that is, a family of cumulative distribution functions equipped with a set of parameters that can be used to index them. For example, a normally distributed random variable with mean 3 and standard deviation 1 could be represented with a 0-parameter model (where the only legal distribution is $N(3, 1)$), a 2-parameter model (e.g., $N(μ, σ)$ where $μ$ and $σ$ are parameters), or a 4-parameter model (e.g., $N(μ_1, σ_1) + N(μ_2, σ_2)$).

answered Jul 09 '17 at 19:34

Kodiologist

19,063
2
36
68

1

The question asks how many parameters are required by the family of *all* probability distributions for the random variable $(X_1,X_2,X_3,X_4)$. Because that family can be identified with the unit simplex in $\mathbb{R}^{3\times 2\times 2\times 3}$, there is a unique correct answer. – whuber Jul 10 '17 at 13:43
@whuber "The question asks how many parameters are required by the family of all probability distributions" — News to me. – Kodiologist Jul 10 '17 at 14:53
@whuber But more to the point, any model with finitely many parameters can be expressed as a 1-parameter model, because you can encode a finite set of real numbers into a single real number by interleaving the decimal digits. – Kodiologist Jul 10 '17 at 16:10
That's not right, because a parameter must establish more than a mere one-to-one relationship: it must be *continuous*. Note that I did not say *all* probability distributions: I said distributions for *this* random variable. The "joint probability distribution" referred to in the question must be one of those. – whuber Jul 10 '17 at 17:35
@whuber "a parameter must establish more than a mere one-to-one relationship: it must be *continuous*." — I'm not familiar with such a requirement. – Kodiologist Jul 10 '17 at 17:40
1

It is very hard to find formal definitions--almost all books gloss over the issues by assuming we know what kinds of parameters make sense. There are various ways to conceive of what I meant by "continuous." One was published by Peter McCullagh, *What Is a Statistical Model?* Annals of Statistics (2002) Vol. 30, No. 5, pp 1225-1310. He provides a mathematical formulation intended to supply definitions of the concepts of "a model 'making sense' and a parameter 'having a meaning'" (p. 1237). The parameterization you propose typically is not meaningful. – whuber Jul 10 '17 at 20:21

score 1 · Answer 3 · answered Sep 22 '19 at 12:03

I would get concrete here. Suppose one has this table, credit

| t    | w    | p(t,w) |
|------|------|--------|
| hot  | sun  | 0.4    |
| hot  | rain | 0.1    |
| cold | sun  | 0.2    |
| cold | rain | 0.3    |

To calculate all $p(t,w)$, do we need four params? Yes, or we can get away with three params and let the last param be 1 minus the sum of the other three params.

Now, what if $t$ and $w$ are independent? To generate the whole table, we only need $2-1=1$ param for $t$, let's say $p(t=hot)$ and $2-1=1$ param for $w$, say $p(w=sun)$.

What is the number of parameters needed for a joint probability distribution?

3 Answers3

Reference

Linked