21

Beta distribution appears under two parametrizations (or here)

$$ f(x) \propto x^{\alpha} (1-x)^{\beta} \tag{1} $$

or the one that seems to be used more commonly

$$ f(x) \propto x^{\alpha-1} (1-x)^{\beta-1} \tag{2} $$

But why exactly is there "$-1$" in the second formula?

The first formulation intuitively seem to more directly correspond to binomial distribution

$$ g(k) \propto p^k (1-p)^{n-k} \tag{3} $$

but "seen" from the $p$'s perspective. This is especially clear in beta-binomial model where $\alpha$ can be understood as a prior number of successes and $\beta$ is a prior number of failures.

So why exactly did the second form gain popularity and what is the rationale behind it? What are the consequences of using either of the parametrization (e.g. for the connection with binomial distribution)?

It would be great if someone could additionally point origins of such choice and the initial arguments for it, but it is not a necessity for me.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • 4
    A deep reason is hinted at in [this answer](http://stats.stackexchange.com/a/185709/919): $f$ equals $x^\alpha(1-x)^\beta$ relative to the measure $d\mu=dx/((x(1-x))$. That reduces your question to "why that particular measure"? Recognizing that this measure is $$d\mu=d\left(\log\left(\frac{x}{1-x}\right)\right)$$ suggests the "right" way to understand these distributions is to apply the logistic transformation: the "$-1$" terms will then disappear. – whuber Feb 20 '17 at 15:38
  • @whuber I'm afraid your comment is not totally clear for me... Maybe you could expand it into answer (especially on "why" and "so what")? – Tim Feb 20 '17 at 21:51
  • 1
    I think the actual reason it happened is the historical one -- because it appears that way in the [beta function](https://en.wikipedia.org/wiki/Beta_function) for which the distribution is named. As for why *that* has $-1$ in the power, I expect that would ultimately be connected to the reason whuber mentions (though historically it has nothing to do with measure or even probability). – Glen_b Feb 21 '17 at 02:03
  • @Glen_b even though, then why did it get popular? I would still argue that the first parametrization seems more "intuitive", but I may be wrong. – Tim Feb 21 '17 at 10:28
  • 2
    @Glen_b It's more than historical: there are profound reasons. They are due to the intimate connection between Beta and Gamma functions, reducing the question to why the exponent in $\Gamma(s)=\int_0^\infty t^{s-1}e^{-t}dt$ is $s-1$ and not $s$. *That* is because [$\Gamma$ is a Gauss sum](http://mathoverflow.net/a/185594). Equivalently, it is "right" to view $\Gamma$ as an integral of a multiplicative homomorphism $t\to t^s$ times an additive character $t\to e^{-t}$ against the Haar measure $dt/t$ on the multiplicative group $\mathbb{R}^{\times}$. – whuber Feb 21 '17 at 18:51
  • 1
    @w.h That's a good reason why the gamma function should be chosen to be that way (and I already suggested such a reason existed above and I accept some form of reasoning akin to that - but necessarily with different formalism - came into Euler's choice); correspondingly compelling reasons occur with the density; but that doesn't establish that this was actually the reason for the choice (why the form was chosen as it was), only that it's a good reason to do so. The form of the gamma function ...ctd – Glen_b Feb 21 '17 at 21:38
  • 1
    ctd... alone could easily be enough reason to choose that form for the density and for others to follow suit. [Often choices are made for simpler reasons that the ones we can identify afterward and then it often takes compelling reasons to do anything else. Do we know that was why it was initially chosen?] -- you explain clearly that there's a reason why we *should* choose the density to be that way, rather than why it *is* that way. That involves a sequence of people making choices (to use it that way, and to follow suit), and their reasons at the time they chose. – Glen_b Feb 21 '17 at 21:38
  • 2
    @Glen It's unlikely anyone explicitly stated the "actual reason" historically. It has been noted that Euler changed his parametrization of both Gamma and Beta between 1729 and 1763, finally arriving at the modern one, and that Legendre used the modern parametrization. In scanning over a host of formulae in Whittaker & Watson, I am struck by the simplification effected by this change: where there would otherwise be an abundance of "+1" and "+2" expressions, one tends to see the parameters all by themselves. This would strike any mathematician as sufficient reason to change the parametrization. – whuber Feb 23 '17 at 15:44
  • 1
    @whuber thanks for all your comments. I edited my question to be more clear: actual historical reasons are of lesser concern for me, I am rather interested in arguments and consequences of those two parameterizations. It would be great if you could translate your comments into some answer. – Tim Feb 23 '17 at 23:37

3 Answers3

12

This is a story about degrees of freedom and statistical parameters and why it is nice that the two have a direct simple connection.

Historically, the "$-1$" terms appeared in Euler's studies of the Beta function. He was using that parameterization by 1763, and so was Adrien-Marie Legendre: their usage established the subsequent mathematical convention. This work antedates all known statistical applications.

Modern mathematical theory provides ample indications, through the wealth of applications in analysis, number theory, and geometry, that the "$-1$" terms actually have some meaning. I have sketched some of those reasons in comments to the question.

Of more interest is what the "right" statistical parameterization ought to be. That is not quite as clear and it doesn't have to be the same as the mathematical convention. There is a huge web of commonly used, well-known, interrelated families of probability distributions. Thus, the conventions used to name (that is, parameterize) one family typically imply related conventions to name related families. Change one parameterization and you will want to change them all. We might therefore look at these relationships for clues.

Few people would disagree that the most important distribution families derive from the Normal family. Recall that a random variable $X$ is said to be "Normally distributed" when $(X-\mu)/\sigma$ has a probability density $f(x)$ proportional to $\exp(-x^2/2)$. When $\sigma=1$ and $\mu=0$, $X$ is said to have a standard normal distribution.

Many datasets $x_1, x_2, \ldots, x_n$ are studied using relatively simple statistics involving rational combinations of the data and low powers (typically squares). When those data are modeled as random samples from a Normal distribution--so that each $x_i$ is viewed as a realization of a Normal variable $X_i$, all the $X_i$ share a common distribution, and are independent--the distributions of those statistics are determined by that Normal distribution. The ones that arise most often in practice are

  1. $t_\nu$, the Student $t$ distribution with $\nu = n-1$ "degrees of freedom." This is the distribution of the statistic $$t = \frac{\bar X}{\operatorname{se}(X)}$$ where $\bar X = (X_1 + X_2 + \cdots + X_n)/n$ models the mean of the data and $\operatorname{se}(X) = (1/\sqrt{n})\sqrt{(X_1^2+X_2^2 + \cdots + X_n^2)/(n-1) - \bar X^2}$ is the standard error of the mean. The division by $n-1$ shows that $n$ must be $2$ or greater, whence $\nu$ is an integer $1$ or greater. The formula, although apparently a little complicated, is the square root of a rational function of the data of degree two: it is relatively simple.

  2. $\chi^2_\nu$, the $\chi^2$ (chi-squared) distribution with $\nu$ "degrees of freedom" (d.f.). This is the distribution of the sum of squares of $\nu$ independent standard Normal variables. The distribution of the mean of the squares of these variables will therefore be a $\chi^2$ distribution scaled by $1/\nu$: I will refer to this as a "normalized" $\chi^2$ distribution.

  3. $F_{\nu_1, \nu_2}$, the $F$ ratio distribution with parameters $(\nu_1, \nu_2)$ is the ratio of two independent normalized $\chi^2$ distributions with $\nu_1$ and $\nu_2$ degrees of freedom.

Mathematical calculations show that all three of these distributions have densities. Importantly, the density of the $\chi^2_\nu$ distribution is proportional to the integrand in Euler's integral definition of the Gamma ($\Gamma$) function. Let's compare them:

$$f_{\chi^2_\nu}(2x) \propto x^{\nu/2 - 1}e^{-x};\quad f_{\Gamma(\nu)}(x) \propto x^{\nu-1}e^{-x}.$$

This shows that twice a $\chi^2_\nu$ variable has a Gamma distribution with parameter $\nu/2$. The factor of one-half is bothersome enough, but subtracting $1$ would make the relationship much worse. This already supplies a compelling answer to the question: if we want the parameter of a $\chi^2$ distribution to count the number of squared Normal variables that produce it (up to a factor of $1/2$), then the exponent in its density function must be one less than half that count.

Why is the factor of $1/2$ less troublesome than a difference of $1$? The reason is that the factor will remain consistent when we add things up. If the sum of squares of $n$ independent standard Normals is proportional to a Gamma distribution with parameter $n$ (times some factor), then the sum of squares of $m$ independent standard Normals is proportional to a Gamma distribution with parameter $m$ (times the same factor), whence the sum of squares of all $n+m$ variables is proportional to a Gamma distribution with parameter $m+n$ (still times the same factor). The fact that adding the parameters so closely emulates adding the counts is very helpful.

If, however, we were to remove that pesky-looking "$-1$" from the mathematical formulas, these nice relationships would become more complicated. For example, if we changed the parameterization of Gamma distributions to refer to the actual power of $x$ in the formula, so that a $\chi^2_1$ distribution would be related to a "Gamma$(0)$" distribution (since the power of $x$ in its PDF is $1-1=0$), then the sum of three $\chi^2_1$ distributions would have to be called a "Gamma$(2)$" distribution. In short, the close additive relationship between degrees of freedom and the parameter in Gamma distributions would be lost by removing the $-1$ from the formula and absorbing it in the parameter.

Similarly, the probability function of an $F$ ratio distribution is closely related to Beta distributions. Indeed, when $Y$ has an $F$ ratio distribution, the distribution of $Z=\nu_1 Y/(\nu_1 Y + \nu_2)$ has a Beta$(\nu_1/2, \nu_2/2)$ distribution. Its density function is proportional to

$$f_Z(z) \propto z^{\nu_1/2 - 1}(1-z)^{\nu_2/2-1}.$$

Furthermore--taking these ideas full circle--the square of a Student $t$ distribution with $\nu$ d.f. has an $F$ ratio distribution with parameters $(1,\nu)$. Once more it is apparent that keeping the conventional parameterization maintains a clear relationship with the underlying counts that contribute to the degrees of freedom.

From a statistical point of view, then, it would be most natural and simplest to use a variation of the conventional mathematical parameterizations of $\Gamma$ and Beta distributions: we should prefer calling a $\Gamma(\alpha)$ distribution a "$\Gamma(2\alpha)$ distribution" and the Beta$(\alpha, \beta)$ distribution ought to be called a "Beta$(2\alpha, 2\beta)$ distribution." In fact, we have already done that: this is precisely why we continue to use the names "Chi-squared" and "$F$ Ratio" distribution instead of "Gamma" and "Beta". Regardless, in no case would we want to remove the "$-1$" terms that appear in the mathematical formulas for their densities. If we did that, we would lose the direct connection between the parameters in the densities and the data counts with which they are associated: we would always be off by one.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 1
    Thanks for your answer (I +1d already). I have just a small follow-up question: maybe I'm missing something, but aren't we sacrificing the direct relation with binomial by using the -1 parametrization? – Tim Feb 27 '17 at 08:40
  • I'm not sure which "direct relation with binomial" you're referring to, Tim. For instance, when the Beta$(a,b)$ distribution is used as a conjugate prior for a Binomial sample, clearly the parameters are exactly the right ones to use: you add $a$ (not $a-1$) to the number of successes and $b$ (not $b-1$) to the number of failures. – whuber Mar 23 '17 at 21:01
1

The notation is misleading you. There is a "hidden $-1$" in your formula $(1)$, because in $(1)$, $\alpha$ and $\beta$ must be bigger than $-1$ (the second link you provided in your question says this explicitly). The $\alpha$'s and $\beta$'s in the two formulas are not the same parameters; they have different ranges: in $(1)$, $\alpha,\beta>-1$, and in $(2)$, $\alpha,\beta>0$. These ranges for $\alpha$ and $\beta$ are necessary to guarantee that the integral of the density doesn't diverge. To see this, consider in $(1)$ the case $\alpha=-1$ (or less) and $\beta=0$, then try to integrate the (kernel of the) density between $0$ and $1$. Equivalently, try the same in $(2)$ for $\alpha=0$ (or less) and $\beta=1$.

Zen
  • 21,786
  • 3
  • 72
  • 114
  • 2
    The issue of a range of definition for $\alpha$ and $\beta$ seems to go away when the integral is interpreted, as Pochhammer did in 1890, as a specific contour integral. In that case it can be equated to an expression that determines an analytic function for all values of $\alpha$ and $\beta$--including all complex ones. This throws light on the concern in the question: why exactly has this specific parameterization been adopted, given there are many other possible parameterizations that seem like they might serve equally well? – whuber Feb 23 '17 at 18:41
  • 1
    To me, the OP's doubt seems to be much more basic. He's kind of confused about the "-1" in (2), but not in (1) (not true, of course). It seems that your comment is answering a different question (much more interesting, by the way). – Zen Feb 23 '17 at 20:32
  • 2
    Thanks for your effort and answer, but it still does not answer my main concern: why -1 was chosen? Following your logic, basically *any* value could be chosen changing the arbitrary lower bound to something else. I can't see why -1 or 0 could be better or worse lower bound for parameter values besides the fact that 0 is "aesthetically" nicer bound. On another hand, Beta(0, 0) would be nice "default" for uniform distribution when using the first form. Yes, those are very subjective comments, but that is my main point: are there any non-arbitrary reasons for such choice? – Tim Feb 23 '17 at 23:33
  • 1
    Zen, I agree there was a question of how to interpret the original post. Thank you, Tim, for your clarifications. – whuber Feb 24 '17 at 00:33
  • 1
    Hi, Tim! I don't see any definitive reason, although it makes more direct the connection with the fact that for $\alpha,\beta>0$, if $U\sim\mathrm{Gamma}(\alpha,1)$ and $V\sim\mathrm{Gamma}(\beta,1)$ are independent, then $X=U/(U+V)$ is $\mathrm{Beta}(\alpha,\beta)$, and the density of $X$ is proportional to $x^{\alpha-1}(1-x)^{\beta-1}$. But then you can question the parameterization of the gamma distribution... – Zen Feb 24 '17 at 07:01
  • 1
    Historically, It seems that the first parameterization was originally used. Check the Type I distribution in the Pearson family. https://en.wikipedia.org/wiki/Pearson_distribution#The_Pearson_type_I_distribution – Zen Feb 24 '17 at 07:10
0

For me, the existence of -1 in the exponent is related with the develpment of the Gamma function. The motivation of the Gamma function is to find a smooth curve to connect the points of a factorial $x!$. Since it is not possible to compute $x!$ directly if $x$ is not integer, the idea was to find a function for any $x \geq 0$ that satisfies the recurrence relation defined by the factorial, namely

$f(1)=1\\ f(x+1)=x \cdot f(x). $

Solution was by means of the convergence of an integral. For the function defined as

$f(x+1) = \displaystyle\int_{0}^{\infty} t^{x}e^{-x} dt, $

integration by parts provides the following:

$ \begin{align} f(x+1) & = \displaystyle\int_{0}^{\infty} t^{x}e^{-x} dt \\ & = \Big[-t^{x}e^{-x} \Big]^{\infty}_{0} + \displaystyle\int_{0}^{\infty} x\cdot t^{x-1}e^{-x} dt \\ &= \lim_{x \to \infty} (-t^{x}e^{-x}) - 0 \cdot e^{-0} + x\cdot \displaystyle\int_{0}^{\infty} t^{x-1}e^{-x} dt \\ &= 0 - 0 + x\cdot \displaystyle\int_{0}^{\infty} t^{x-1}e^{-x} dt \\ &= x \cdot f(x) . \end{align} $

So, the function above satisfies this property, and the -1 in the exponent derives from the procedure of integration by parts. See the Wikipedia article https://en.wikipedia.org/wiki/Gamma_function .

Edit: I apologise if my post is not fully clear; I am just trying to point that, in my idea, the existence of -1 in the beta distribution comes from the generalisation of the factorial by means of the Gamma function. There are two conditions: $f(1)=1$ and $f(x+1)=x \cdot f(x)$. We have $\Gamma(x) = (x-1)!$, therefore it satisfies $\Gamma(x+1) = x \cdot \Gamma(x) = x \cdot (x-1)! = x!$. In addition, we have $\Gamma(1) = (1-1)! = 0! = 1$. As for the beta distribution with parameters $\alpha, \beta$, generalisation of the Binomial coefficient is $\dfrac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \cdot \Gamma(\beta)} = \dfrac{(\alpha + \beta - 1)!}{(\alpha-1)! \cdot (\beta-1)!}$. There we have the -1 in the denominator, for both parameters.

aatr
  • 9
  • 3
  • This makes no sense because the recurrence function satisfied by the factorial is not what you state: $(x+1)! \ne x \cdot x!.$ – whuber Sep 23 '19 at 15:31
  • The function $f(x)$ satisfying the recurrence relation is the Gamma: $\Gamma(x+1) = x \cdot \Gamma(x)$. This is how it is defined. – aatr Sep 24 '19 at 17:13
  • Yes: but your stated motivation is based on the *factorial* function, not the Gamma. – whuber Sep 24 '19 at 17:32
  • It is important to recall the relation between Gamma and factorial: $\Gamma(x) = (x-1)!$. – aatr Sep 25 '19 at 20:17
  • Unfortunately, that's circular logic: you start off with the factorial, characterize Gamma as interpolating it, and then conclude that's why there's a -1. In fact, your post exhibits the -1 as if it fell out mistakenly by confusing Gamma with the factorial. Few will find that either illuminating or convincing. – whuber Sep 25 '19 at 20:28
  • Thank you for the comments. I edited the post to try to make it clearer. I recall again that the Gamma is related with the factorial in the form of $\Gamma(x) = (x-1)!$; for we have $\Gamma(x+1) = x \cdot \Gamma(x) = x \cdot (x-1)! = x!$, so the recurrence relation is satisfied. – aatr Sep 27 '19 at 10:56
  • Right: but you still incorrectly claim that this is the recurrence satisfied by the *factorial* function. – whuber Sep 27 '19 at 13:01
  • No, no, just to find a function to smoothly connect the points of a factorial. This function must satisfy the recurrence relation, not the factorial. – aatr Sep 30 '19 at 10:58