0

Knowing the coefficient of determination between variables $X$ and $Y$ ($r_{XY}^2$), and $Y$ and $Z$ ($r_{YZ}^2$), what is the expected coefficient of determination between variables $X$ and $Z$?

Initially, I thought $r_{XY}^2 \centerdot r_{YZ}^2$ might be a good approximation of $r_{XZ}^2$. I played around with the formulae a bit (see edit below) to get:

$r_{XZ}^2 = \frac{\bigl(\sum_{i = 1}^n (X_i - \bar X)(Z_i - \bar Z)\bigr)^2}{\sum_{i = 1}^n (X_i - \bar X)^2 \centerdot \sum_{i = 1}^n(Z_i - \bar Z)^2}$

$r_{XZ}^2 = \frac{\bigl(\sum_{i = 1}^n (X_i - \bar X)(Z_i - \bar Z)\bigr)^2 \centerdot\bigl(\sum_{i = 1}^n (Y_i - \bar Y)^2\bigr)^2 \ }{\bigl(\sum_{i = 1}^n (X_i - \bar X)(Y_i - \bar Y)\bigr)^2\centerdot\bigl(\sum_{i = 1}^n (Y_i - \bar Y)(Z_i - \bar Z)\bigr)^2} \centerdot r_{XY}^2 \centerdot r_{YZ}^2$

I ran thousands of simulations to see how $\frac{\bigl(\sum_{i = 1}^n (X_i - \bar X)(Z_i - \bar Z)\bigr)^2 \centerdot\bigl(\sum_{i = 1}^n (Y_i - \bar Y)^2\bigr)^2 \ }{\bigl(\sum_{i = 1}^n (X_i - \bar X)(Y_i - \bar Y)\bigr)^2\centerdot\bigl(\sum_{i = 1}^n (Y_i - \bar Y)(Z_i - \bar Z)\bigr)^2}$ behaved.

For large $n$, this was almost always close to $1$ (and so I thought my hunch was right).

However, I discovered that while the distribution of values had a median close to 1, it had a mean consistently in the thousands!

Is there a nice way to generalise this value as a function of $n$ — thereby generating better estimates of $r_{XZ}^2$?

EDIT:

Knowing $r_{XY}^2 = \frac{\bigl(\sum_{i = 1}^n (X_i - \bar X)(Y_i - \bar Y)\bigr)^2}{\sum_{i = 1}^n (X_i - \bar X)^2 \centerdot \sum_{i = 1}^n(Y_i - \bar Y)^2}$ and $r_{YZ}^2 = \frac{\bigl(\sum_{i = 1}^n (Y_i - \bar Y)(Z_i - \bar Z)\bigr)^2}{\sum_{i = 1}^n (Y_i - \bar Y)^2 \centerdot \sum_{i = 1}^n(Z_i - \bar Z)^2}$

Implies $\sum_{i = 1}^n (X_i - \bar X)^2 = \frac{\bigl(\sum_{i = 1}^n (X_i - \bar X)(Y_i - \bar Y)\bigr)^2}{r_{XY}^2 \centerdot \sum_{i = 1}^n(Y_i - \bar Y)^2}$ and $\sum_{i = 1}^n(Z_i - \bar Z)^2 = \frac{\bigl(\sum_{i = 1}^n (Y_i - \bar Y)(Z_i - \bar Z)\bigr)^2}{\sum_{i = 1}^n (Y_i - \bar Y)^2 \centerdot r_{YZ}^2}$

Since $r_{XZ}^2 = \frac{\bigl(\sum_{i = 1}^n (X_i - \bar X)(Z_i - \bar Z)\bigr)^2}{\sum_{i = 1}^n (X_i - \bar X)^2 \centerdot \sum_{i = 1}^n(Z_i - \bar Z)^2}$, then $r_{XZ}^2 = \frac{\bigl(\sum_{i = 1}^n (X_i - \bar X)(Z_i - \bar Z)\bigr)^2 \centerdot\bigl(\sum_{i = 1}^n (Y_i - \bar Y)^2\bigr)^2 \ }{\bigl(\sum_{i = 1}^n (X_i - \bar X)(Y_i - \bar Y)\bigr)^2\centerdot\bigl(\sum_{i = 1}^n (Y_i - \bar Y)(Z_i - \bar Z)\bigr)^2} \centerdot r_{XY}^2 \centerdot r_{YZ}^2$

NBland
  • 79
  • 5
  • 1
    Correlation between X and Y and between Y and Z don't determine correlation between X and Z. And results of simulations depend a lot on how X, Y and Z are chosen. – Pere Oct 29 '16 at 14:55
  • Is this not what is usually called [partial correlation](https://en.wikipedia.org/wiki/Partial_correlation)? – mdewey Oct 29 '16 at 15:50
  • Doesn't that second formula from my question demonstrate that $r_{XZ}^2$ is at least a function of $r_{XY}^2$ and $r_{YZ}^2$? And when that complicated term of summations is one, $r_{XZ}^2$ is perfectly predicted. I generated the strengths of the correlations randomly: $y = rand(n,1)$; $x = rand*y + rand(n,1)$; and $z = rand*y + rand(n,1)$, and used the coefficients of determination between the first and second, and first and third, to predict the coefficient of determination between the second and third. – NBland Oct 30 '16 at 01:30

1 Answers1

3

The general relationship between the population correlations $\rho_{X,Y}$, $\rho_{X,Z}$, and $\rho_{Y,Z}$ of three random variables is discussed extensively in this question and its answers. There, it is shown that the arithmetic average value of the three correlations must be in the range $\left[-\frac 12,1\right]$ and that any average value in this range is achievable. Of course, if the values of two of the correlations are known, then the constraint on the arithmetic average translates into a constraint on the unknown value of the third correlation. (This constraint gives an exact value only in the extreme case where the two known correlations have value $-1$ and so the third correlation necessarily must have value $+1$.) These results can be translated into constraints that apply to the population coefficients of determination, viz., $\rho_{X,Y}^2$, $\rho_{X,Z}^2$, and $\rho_{Y,Z}^2$ but the results are, of course, weaker.


However, based on the additional information provided by the OP in a comment on the question, the problem being discussed is quite different. Let $Y, A, B$ denote independent $U(0,1)$ random variables. Then, the OP is assuming that $$X = A+ \alpha Y, \quad Z = B + \beta Y$$ where $\alpha, \beta \in (0, 1)$ are chosen at random and independently of each other, but are constants as far as the simulations are concerned. At least, that is how I interpret the OP's statement: $$y = rand(n,1);~~ x = rand*y + rand(n,1);~~ z = rand*y + rand(n,1)$$ which produces a collection of $n$ 3-tuples $(X_i,Y_i,Z_i) = (A_i + \alpha Y_i,\ Y_i, \ B_i + \beta Y_i), 1 \leq i \leq n$ (or so I believe). Accordingly, the covariance matrix of $(X,Y,Z)$ is $$\Sigma = \frac{1}{12}\times \left[\begin{matrix} 1+\alpha^2 & \alpha & \alpha\beta\\ \alpha & 1 & \beta\\ \alpha\beta & \beta & 1+\beta^2 \end{matrix}\right]$$ and the squares of the correlation coefficients are $$\rho_{X,Y}^2 = \frac{\alpha^2}{1+\alpha^2};~~ \rho_{Y,Z}^2 = \frac{\beta^2}{1+\beta^2}; ~~ \rho_{X,Z}^2 = \frac{\alpha^2\beta^2}{(1+\alpha^2)(1+\beta^2)} = \rho_{X,Y}^2\cdot\rho_{Y,Z}^2.$$ Thus, it is not too surprising that the OP's simulations show that the sample coefficient of determination $r_{X,Z}^2$ is very nearly equal to the product of the sample coefficients of determination $r_{X,Y}^2$ and $r_{Y,Z}^2$.

Note: The result that $\rho_{X,Y}^2 = \rho_{X,Y}^2\cdot\rho_{Y,Z}^2$ and consequently $r_{X,Z}^2 \approx r_{X,Y}^2\cdot r_{Y,Z}^2$ applies to the specific case of samples of the form $(X_i,Y_i,Z_i) = (A_i + \alpha Y_i,\ Y_i, \ B_i + \beta Y_i), 1 \leq i \leq n$. It is not a result that is applicable to the more general model of arbitrary random variables whose covariance matrix is not necessarily of the form described above. I have no idea as to how the OP ran "thousands of simulations" to get the result that $r_{X,Z}^2 = \gamma r_{X,Y}^2\cdot r_{Y,Z}^2$ where $\gamma$ is a quantity that has a median value of $1$ and a mean value in the thousands. Were these simulations based on data of the form $$(X_i,Y_i,Z_i) = (A_i + \alpha Y_i,\ Y_i, \ B_i + \beta Y_i), 1 \leq i \leq n ~ ??$$ Or were they based on arbitrary random data where the above results do not hold?

Dilip Sarwate
  • 41,202
  • 4
  • 94
  • 200
  • Something else. I generated random variables as follows: $y = rand(n,1)$; $x = rand*y + rand(n,1)$; $z = rand*y + rand(n,1)$, so that the strengths of the correlations between x and y and z and y took on a large range. After calculating these coefficients of determination, I wanted to see how well they predicted the coefficient of determination between x and z (and they did a good job overall, but this "factor" that I pulled out was not always one). The larger n, the more accurate my estimations tended to be. Thank you for the link to the related question. – NBland Oct 30 '16 at 01:23
  • @NBland This information needs to be included in an edit to your question so that others can read it too instead of being tucked away in your comment above where it is less noticeable. – Dilip Sarwate Oct 30 '16 at 15:37
  • Thanks for this additional information. I've included an edit explaining where exactly $r_{XZ}^2$ comes from. – NBland Oct 31 '16 at 06:23