Independent replication experiments yielding contrasting results; how to combine them?

Question

Imagine a simple experiment, trying to answer a simple question. For example, is body temperature the same in men and in women ?

To answer this question, let's say you sample 10 men, and 10 women, randomly from a given city, and measure their respective body temperature (same protocol of measurement for everybody, of course).

Then imagine you get a significant (alpha=5%) difference between these two samples.

You cannot ignore a possible statistical fluke, can you ? (This may constitute a subsidiary question, and I will be pleased if you can answer it too, but the main question lies below) You may want to repeat this experiment a few times, for example, in independent cities, to get very confident about the reality of the difference you observed in the first experiment

Imagine again, that you repeat this experiment 8 times (including the first one), and you observe a significant difference between men and women in 4 of them.

My question is : How much confident can I be that the difference is real, if I have only this information : 4 out of 8 independent tests were significant at alpha = 5% ? (Or, to paraphrase, How can I calculate the overall p-value, when all I have is the p-value linked to each repetition experiment ? Maybe I need additional information ?)

(This is a simple example, for thinking efficiently about a real problem much more complicated...)

Scortchi - Reinstate Monica · Accepted Answer · 2015-01-02T10:19:53.397

The p-value from each experiment should have a uniform distribution between 0 and 1 under the null hypothesis, so tests of the null hypothesis over all experiments can be based on this. Perhaps the most common test statistic is Fisher's: for p-values $p_j$ from $m$ independent experiments the negative log of each follows an exponential distribution

$$-\log p_j\sim \mathrm{Exp}(1)$$

and twice their sum a chi-squared distribution with $2m$ degrees of freedom.

$$-2\sum_j^m \log p_j \sim \chi^2_{2m}$$

So an overall p-value $p^*$ can be got from the chi-squared distribution function $F_{\chi^2}(\cdot)$:

$$p^* = 1-F_{\chi^2}\left(-2\sum_j^m \log p_j; 2m\right)$$

If you only know whether or not $p_j<\alpha$ the no. "successes" follows a binomial distribution with probability parameter $\alpha$ and sample size $m$:

$$\sum_j^m I(p_j) \sim \mathrm{Bin}(\alpha,m)$$ where the indicator function $$I(p_j)=\left\{ \begin{array}{ll} 0 & \text{when } p_j\geq\alpha \\ 1 & \text{when } p_j<\alpha \end{array} \right. $$ & so you can use the binomial distribution function $F_\mathrm{Bin}(\cdot)$ to calculate an overall p-value $$ p^*=1-F_\mathrm{Bin}\left(\sum_j^m I(p_j)-1;\alpha,m\right) $$

Read up on meta-analysis for more complicated situations, & for the (often more useful) estimation of an effect size measured over several studies, & for assessment of heterogeneity (are different studies really measuring the same thing?).

A useful [link](http://en.wikipedia.org/wiki/Fisher%27s_method) — Rodolphe, Nov 17 '14 at 12:47

Independent replication experiments yielding contrasting results; how to combine them?

1 Answers1

Linked