I am currently conducting a meta-analysis and have pooled the prevalence of a certain disease. I would like to check for any association of risk factors, such as gender, ethnicity, and disease classification (which I have all input as proportions), with said prevalence. May I ask what is the best way forward?
-
1Proceed with extreme caution due to mathematical coupling when dealing with proportions. – Robert Long May 27 '21 at 14:46
-
Robert, can you say more about what you mean by 'mathematical coupling'? – Aaron Springer May 27 '21 at 20:48
-
1Answering my own question: "The common problem in each type of mathematic coupling is that one variable either directly or indirectly contains the whole or components of the second variable" from [this paper](https://pubmed.ncbi.nlm.nih.gov/7212790/). Makes sense to me but not a term I was familiar with – Aaron Springer May 27 '21 at 20:51
-
Maybe relevant: https://stats.stackexchange.com/questions/58664/ratios-in-regression-aka-questions-on-kronmal – kjetil b halvorsen Jun 02 '21 at 15:42
1 Answers
Suppose we have a model such as
$$y = x$$
where $y$ and $x$ are some measurements in a number of samples. Now, if we introduce a third variable, something like a number of subjects in each sample or size of each population, $z$, and we wish to form another model so that we are dealing with proportions, we could have the model
$$\frac{y}{z} = \frac{x}{z}$$
it should now be obvious, that since $z$ appears in the denominator on both side, the two sides are "coupled", hence the term mathematical coupling.
A simple example in R can show this. For simplicity we simulate three variables from a standard normal distribution independently:
> set.seed(1)
> x <- rnorm(100)
> y <- rnorm(100)
> cor(x,y)
[1] -0.0009943199
...so the correlation is close to zero. Or in linear regression:
> summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.8768 -0.6138 -0.1395 0.5394 2.3462
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03769 0.09699 -0.389 0.698
x -0.00106 0.10773 -0.010 0.992
Residual standard error: 0.9628 on 98 degrees of freedom
Multiple R-squared: 9.887e-07, Adjusted R-squared: -0.0102
F-statistic: 9.689e-05 on 1 and 98 DF, p-value: 0.9922
so the estimates are close to zero and so is R^2.
Now we introduce a third variable:
> z <- rnorm(100)
> cor(x/z, y/z)
[1] 0.9168795
and suddenly the correlation is above 0.9. Or in regression:
> summary(lm(I(y/z) ~ I(x/z)))
Call:
lm(formula = I(y/z) ~ I(x/z))
Residuals:
Min 1Q Median 3Q Max
-45.996 -4.733 -2.784 -1.524 214.929
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.74090 2.53884 1.08 0.283
I(x/z) 1.44965 0.06375 22.74 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 25.35 on 98 degrees of freedom
Multiple R-squared: 0.8407, Adjusted R-squared: 0.839
F-statistic: 517.1 on 1 and 98 DF, p-value: < 2.2e-16
...and the estimate for the slope is above zero with a very small p value, and the R^2 is 0.8407, which is 0.9168795^2
It is worth noting that this example is rather extreme because all the variables are standard normal, and this induces the largest possible effect of mathematical coupling. When the variables are on different scales, with different variances, of different types, or correlated with each other, the effect of mathematical coupling is less pronounced, but nevertheless still present.
So extreme caution is advised when dealing with proportions.

- 53,316
- 10
- 84
- 148