4

I’d like to share and submit for (dis)approval and discussion yet another, simple but original (to the best of my knowledge) Bayesian solution to the classical problem of comparing k samples or groups, with the famous Behrens-Fisher problem (in its hypothesis testing variant) as a special case. Without loss of generality, let $k = 2$.

Hence we have two conditionally i.i.d. samples ${x_1} = \left( {x_1^1,...,x_1^m} \right)$ and ${x_2} = \left( {x_2^1,...,x_2^n} \right)$ from two independent experiments (P.S.: see the comments for more details) modelled by a common parametric family of probability distributions $\mathcal{L}\left( {\theta ,\eta } \right)$ with pdf $f\left( {\left. x \right|\theta ,\eta } \right)$ with two different sets of parameters $\left( {{\theta _1},{\eta _1}} \right)$ and $\left( {{\theta _2},{\eta _2}} \right)$

${x_1}\mathop \sim \limits^{i.i.d.} \mathcal{L}\left( {{\theta _1},{\eta _1}} \right)$ and ${x_2}\mathop \sim \limits^{i.i.d.} \mathcal{L}\left( {{\theta _2},{\eta _2}} \right)$

$\theta $ is the set/vector of the parameters of interest and $\eta $ is the set/vector of the nuisance parameters.

For instance, in the Behrens-Fisher problem, $\mathcal{L}\left( {\theta ,\eta } \right) = \mathcal{N}\left( {\mu ,\sigma } \right)$ is the family of Gaussian distributions, $\theta = \mu $ is the mathematical expectation and $\eta = \sigma $ is the standard deviation. (Moreover, ${\sigma _1}$ and ${\sigma _2}$ are supposed to be different, but I won’t discuss this secondary issue here because this hypothesis has probability $1$.)

The problem is testing the null hypothesis

${H_0}:\quad {\theta _1} = {\theta _2}$

against the omnibus alternative

${H_1}:\quad {\theta _1} \ne {\theta _2}$

To the best of my knowledge, there is still no universally accepted solution to this problem in any statistical framework, frequentist, likelihoodist or Bayesian. This is due to the interplay of several difficulties including the proper treatment of nuisance parameters, the non-existence of a UMP test (in the frequentist and likelihoodist frameworks), the proper assignment of prior probability distributions for the hypotheses and the parameters, in particular, the fact that the null hypothesis has prior and posterior probabilities equal to $0$ if $\theta $ is a continuous parameter (in the Bayesian framework), the logical dependence of both experiments, etc.

Nevertheless, a classical, default Bayesian solution based on Bayes factors that goes back to Jeffreys (1940) (and also a classical solution based on a (marginal or supremum) likelihood-ratio test in the likelihoodist framework since Jeffreys found the same analytical solution as Fisher's) runs as follows. See for instance On the difference in means or A fully objective Bayesian approach for the Behrens-Fisher problem using historical studies for variations on this approach applied to the Behrens-Fisher problem.

Under ${H_1}$ , the likelihood or model $M_1$ is

$ p\left( {\left. {{x_1},{x_2}} \right|{\theta _1},{\theta _2},{\eta _1},{\eta _2},{H_1}} \right) = p\left( {\left. {{x_1}} \right|{\theta _1},{\eta _1},{H_1}} \right)p\left( {\left. {{x_2}} \right|{\theta _2},{\eta _2},{H_1}} \right) = \\ \prod\limits_{i = 1}^m {f\left( {\left. {x_1^i} \right|{\theta _1},{\eta _1}} \right)} \prod\limits_{j = 1}^n {f\left( {\left. {x_2^j} \right|{\theta _2},{\eta _2}} \right)} \\ $

by conditional independence, so that the probability for the data $\left( {{x_1},{x_2}} \right)$ is

$p\left( {\left. {{x_1},{x_2}} \right|{H_1}} \right) = \iint {\iint {\prod\limits_{i = 1}^m {f\left( {\left. {x_1^i} \right|{\theta _1},{\eta _1}} \right)} \prod\limits_{j = 1}^n {f\left( {\left. {x_2^j\,} \right|{\theta _2},{\eta _2}} \right)p\left( {{\theta _1},{\eta _1},{\theta _2},{\eta _2}} \right)} }}{\text{d}}{\theta _1}{\text{d}}{\eta _1}{\text{d}}{\theta _2}{\text{d}}{\eta _2}$

Under ${H_0}$ , by definition there is a common parameter of interest

${\theta _0} = {\theta _1} = {\theta _2}$

so that the likelihood or model $M_0$ is

$ p\left( {\left. {{x_1},{x_2}} \right|{\theta _0},{\eta _1},{\eta _2},{H_0}} \right) = p\left( {\left. {{x_1}} \right|{\theta _0},{\eta _1},{H_0}} \right)p\left( {\left. {{x_2}} \right|{\theta _0},{\eta _2},{H_0}} \right) = \\ \prod\limits_{i = 1}^m {f\left( {\left. {x_1^i} \right|{\theta _0},{\eta _1}} \right)} \prod\limits_{j = 1}^n {f\left( {\left. {x_2^j} \right|{\theta _0},{\eta _2}} \right)} \\ $

and the probability of the data is

$p\left( {\left. {{x_1},{x_2}} \right|{H_0}} \right) = \iiint {\prod\limits_{i = 1}^m {f\left( {\left. {x_1^i} \right|{\theta _0},{\eta _1}} \right)} \prod\limits_{j = 1}^n {f\left( {\left. {x_2^j} \right|{\theta _0},{\eta _2}} \right)} p\left( {{\theta _0},{\eta _1},{\eta _2}} \right){\text{d}}{\theta _0}{\text{d}}{\eta _1}{\text{d}}{\eta _2}}$

Therefore, the classical Bayesian solution relies on the “Bayes factor”

$ {B_{01}} = \frac{{p\left( {\left. {{x_1},{x_2}} \right|{H_0}} \right)}}{{p\left( {\left. {{x_1},{x_2}} \right|{H_1}} \right)}} = \\ \frac{{\iiint {\prod\limits_{i = 1}^m {f\left( {\left. {x_1^i} \right|{\theta _0},{\eta _1}} \right)} \prod\limits_{j = 1}^n {f\left( {\left. {x_2^j} \right|{\theta _0},{\eta _2}} \right)} p\left( {{\theta _0},{\eta _1},{\eta _2}} \right){\text{d}}{\theta _0}{\text{d}}{\eta _1}{\text{d}}{\eta _2}}}}{{\iint {\iint {\prod\limits_{i = 1}^m {f\left( {\left. {x_1^i} \right|{\theta _1},{\eta _1}} \right)} \prod\limits_{j = 1}^n {f\left( {\left. {x_2^j\,} \right|{\theta _2},{\eta _2}} \right)p\left( {{\theta _1},{\eta _1},{\theta _2},{\eta _2}} \right)} }}{\text{d}}{\theta _1}{\text{d}}{\eta _1}{\text{d}}{\theta _2}{\text{d}}{\eta _2}}} \\ $

that is equal to the genuine quantity of interest, the prior-to-posterior odds ratio

$\frac{{p\left( {\left. {{H_0}} \right|{x_1},{x_2}} \right)}}{{p\left( {\left. {{H_1}} \right|{x_1},{x_2}} \right)}}/\frac{{p\left( {{H_0}} \right)}}{{p\left( {{H_1}} \right)}}$

only if $p\left( {{H_0}} \right) \ne 0$ and $p\left( {{H_0}} \right) \ne 1$ .

I hold that this classical solution is not correct for a very simple reason (P.S.: see the comments and the part 2 for other, even better reasons.). Indeed, under ${H_0}$ , we cannot say that there is a common, a single parameter ${\theta _0}$ : this is not because two parameters ${\theta _1}$ and ${\theta _2}$ have the same numerical values that there is only one parameter!

There was a confusion between two different concepts: the identity of two parameters (so that there is only one of them) and the equality of their numerical values. But equality is not, does not imply identity. This point has been discussed especially by Henri Poincaré. See for instance this paper (in French) Identity and equality, the criticism of Poincaré

See page 30 of the .pdf:

The impossibility of understanding the application of the mathematical continuum to experimental data (…) as soon as one interprets the equality of two rational or real numbers as the identity of two entities, allows us to conclude…

For instance, this mistake is found in eq. 1 of 2

${H_0}:{\mu _c} = {\mu _t} = \mu $

and around eq. 4 of 1

The hypothesis sv assumes the [two] means and the [two] standard deviations are the same, so two parameters (a constant A and a standard deviation sigma1) have been removed by marginalization.

Eq. 4

where $B \to A$ and ${\sigma _1} \to {\sigma _2}$.

Correcting this mistake is almost immediate. Because the prior and posterior probabilities for ${H_0}$ are equal to $0$ if $\theta $ is a continuous parameter, let’s start with a discrete parameter $\theta $ (e.g. Newcombe "optimal" record linkage method).

Hence we have two joint prior probability distributions

$p\left( {{\theta _1},{\eta _1}} \right)$ and $p\left( {{\theta _2},{\eta _2}} \right)$

from which we get the discrete marginal prior probability distributions for ${\theta _1}$ and ${\theta _2}$

$p\left( {{\theta _1}} \right) = \int {p\left( {{\theta _1},{\eta _1}} \right){\text{d}}{\eta _1}} $ and $p\left( {{\theta _2}} \right) = \int {p\left( {{\theta _2},{\eta _2}} \right){\text{d}}{\eta _2}} $

It follows that the prior probability for ${H_0}$ is

$ p\left( {{H_0}} \right) = p\left( {{\theta _1} = {\theta _2}} \right) = \\ \sum\limits_\theta {p\left( {{\theta _1} = {\theta _2} = \theta } \right)} = \quad \quad \;{\text{by total probability}} \\ \sum\limits_\theta {p\left( {{\theta _1} = \theta ,{\theta _2} = \theta } \right)} = \\ \sum\limits_\theta {p\left( {{\theta _1} = \theta } \right)p\left( {{\theta _2} = \theta } \right)} \quad \quad {\text{by independence}} \\ $

where the sum runs over ${\Theta _1} \cap {\Theta _2}$, ${\theta _1} \in {\Theta _1}$, ${\theta _2} \in {\Theta _2}$.

For instance, if ${\theta _1}$ and ${\theta _2}$ have the same discrete prior probability distribution $p\left( \theta \right)$ , then

$ p\left( {{H_0}} \right) = \sum\limits_\theta {p{{\left( \theta \right)}^2}} $

Of course

$p\left( {{H_1}} \right) = 1 - p\left( {{H_0}} \right)$

Now, the posterior probability distributions for ${\theta _1}$ and ${\theta _2}$ are

$p\left( {\left. {{\theta _i}} \right|{x_i}} \right) = \int {p\left( {\left. {{\theta _i},{\eta _i}} \right|{x_i}} \right){\text{d}}{\eta _i}\quad i = 1,2} $

by total probability, with

$p\left( {\left. {{\theta _i},{\eta _i}} \right|{x_i}} \right) = \frac{{p\left( {{\theta _i},{\eta _i}} \right)p\left( {\left. {{x_i}} \right|{\theta _i},{\eta _i}} \right)}}{{\sum\limits_{{\theta _i}} {\int {p\left( {{\theta _i},{\eta _i}} \right)p\left( {\left. {{x_i}} \right|{\theta _i},{\eta _i}} \right){\text{d}}{\eta _i}} } }}\quad i = 1,2$

by Bayes rule. Hence, as before, the posterior probability for ${H_0}$ is

$ p\left( {\left. {{H_0}} \right|{x_1},{x_2}} \right) = p\left( {\left. {{\theta _1} = {\theta _2}} \right|{x_1},{x_2}} \right) = \\ \sum\limits_\theta {p\left( {\left. {{\theta _1} = \theta } \right|{x_1}} \right)p\left( {\left. {{\theta _2} = \theta } \right|{x_2}} \right)} \quad \quad \\ $

with

$ p\left( {\left. {{H_1}} \right|{x_1},{x_2}} \right) = 1 - p\left( {\left. {{H_0}} \right|{x_1},{x_2}} \right)$

At this point, the problem is solved but we can compute the corresponding Bayes factor for comparison with the classical one above

$ {B_{01}} = \frac{{p\left( {\left. {{H_0}} \right|{x_1},{x_2}} \right)}}{{p\left( {\left. {{H_1}} \right|{x_1},{x_2}} \right)}}/\frac{{p\left( {{H_0}} \right)}}{{p\left( {{H_1}} \right)}} = \frac{{p\left( {\left. {{H_0}} \right|{x_1},{x_2}} \right)}}{{1 - p\left( {\left. {{H_0}} \right|{x_1},{x_2}} \right)}}/\frac{{p\left( {{H_0}} \right)}}{{1 - p\left( {{H_0}} \right)}} = \\ \frac{{\left( {1 - \sum\limits_\theta {p\left( {{\theta _1} = \theta } \right)p\left( {{\theta _2} = \theta } \right)} } \right)\sum\limits_\theta {\frac{{\int {p\left( {{\theta _1} = \theta ,{\eta _1}} \right)p\left( {\left. {{x_1}} \right|{\theta _1} = \theta ,{\eta _1}} \right){\text{d}}{\eta _1}} }}{{\sum\limits_{{\theta _1}} {\int {p\left( {{\theta _1},{\eta _1}} \right)p\left( {\left. {{x_1}} \right|{\theta _1},{\eta _1}} \right){\text{d}}{\eta _1}} } }}\frac{{\int {p\left( {{\theta _2} = \theta ,{\eta _2}} \right)p\left( {\left. {{x_2}} \right|{\theta _2} = \theta ,{\eta _2}} \right){\text{d}}{\eta _2}} }}{{\sum\limits_{{\theta _2}} {\int {p\left( {{\theta _2},{\eta _2}} \right)p\left( {\left. {{x_2}} \right|{\theta _2},{\eta _2}} \right){\text{d}}{\eta _2}} } }}} }}{{\sum\limits_\theta {p\left( {{\theta _1} = \theta } \right)p\left( {{\theta _2} = \theta } \right)} \left( {1 - \sum\limits_\theta {\frac{{\int {p\left( {{\theta _1} = \theta ,{\eta _1}} \right)p\left( {\left. {{x_1}} \right|{\theta _1} = \theta ,{\eta _1}} \right){\text{d}}{\eta _1}} }}{{\sum\limits_{{\theta _1}} {\int {p\left( {{\theta _1},{\eta _1}} \right)p\left( {\left. {{x_1}} \right|{\theta _1},{\eta _1}} \right){\text{d}}{\eta _1}} } }}\frac{{\int {p\left( {{\theta _2} = \theta ,{\eta _2}} \right)p\left( {\left. {{x_2}} \right|{\theta _2} = \theta ,{\eta _2}} \right){\text{d}}{\eta _2}} }}{{\sum\limits_{{\theta _2}} {\int {p\left( {{\theta _2},{\eta _2}} \right)p\left( {\left. {{x_2}} \right|{\theta _2},{\eta _2}} \right){\text{d}}{\eta _2}} } }}} } \right)}}\quad \\ $

It is worth observing, perhaps for the first time?, that this Bayes factor is not a ratio of any (marginal, supremum…) likelihoods.

Finally, if $\theta $ is a continuous parameter, it suffices to

  • Discretize the random variables ${\theta _1}$ and ${\theta _2}$ over a common Cartesian grid of step $\Delta {\theta}$;
  • Compute the Bayes factor as before for those discrete r.v.s;
  • Take the limit of the Bayes factors when $\Delta {\theta} \to 0$, by replacing the Riemann sums by the corresponding integrals.

If measure theory makes the problem trivial and meaningless for a continuous parameter because, by passing to the limit at the very beginning, we immediately fall on the undefined ratio

$\frac{{p\left( {\left. {{H_0}} \right|{x_1},{x_2}} \right)}}{{p\left( {{H_0}} \right)}} = \frac{0}{0}$

the singularity disappears if we discretize at the beginning and pass to the limit at the very end.

At least, this solution is 100% probabilistic and does not rely on any external principle such as the “identity of equality and identity”. In particular, the prior and posterior probabilities for hypotheses ${H_0}$ and ${H_1}$ are computed from the probability distributions of the parameters, as expected, whereas they are completely decorrelated in papers 1 or 2: in 2 we have $p\left( {{H_0}} \right) = p\left( {{H_1}} \right) = \frac{1}{2}$ (page 5, section 2.3), whereas $p\left( {{H_0}} \right) = 0$ and $p\left( {{H_1}} \right) = 1$ according to measure theory, and in 1 the prior probabilities for the four hypotheses are $\frac{1}{4}$ (page 4), whereas all hypotheses but $\bar s\,\bar v$ have probability $0$.

Please provide your opinion about this alternative solution to k-sample tests for comparison and the Behrens-Fisher problem.

Is it novel or already known from the prior art?

Do you agree that there are still two parameters ${\theta _1}$ and ${\theta _2}$ under ${H_0}$ , not a single one ${\theta _0}$?

Any objection against this solution please?

P.S.: the principle of the "identity of equality and identity" can be found here as well: Bayes Factor and likelihood for two sample from different distributions?, Finding maximum likelihood estimates of parameters of multiple normal populations, Bayes Factor for two exponential samples, How to do bayesian model comparison for control and treatment ...

P.S. Next part here:

Bayes-Poincaré solution to the Behrens-Fisher problem 2: calculations for Jeffreys’ priors

0 Answers0