Showing that estimating $Y$ from $X$ via linear regression and estimating $X$ from $Y$ via linear regression are not invertible

Question

This is a follow up to the question Regression on population values?.

In the theory of test equating (see, e.g., Test Equating, Scaling, and Linking by Kolen and Brennan (2014)), suppose we have a group of examinees whom we randomly assign one of two exam forms, say form X and form Y.

Let $X$ be a random variable representing scores of form X, and similarly for $Y$. Letting $\sigma(X)$ denote the standard deviation of a random variable $X$, linear equating assumes that $$\dfrac{X - \mathbb{E}[X]}{\sigma(X)} = \dfrac{Y - \mathbb{E}[Y]}{\sigma(Y)}$$ with probability one. To interpret the above, suppose we have a particular score $X = x$ from form X; we would use the equation above to yield an equivalent score for form Y.

Solving for $Y$ in the above equation yields $$Y = \dfrac{\sigma(Y)}{\sigma(X)}\cdot X + \mathbb{E}[Y] - \dfrac{\sigma(Y)}{\sigma(X)}\cdot \mathbb{E}[X]\text{.}\tag{1}$$ This can be interpreted as a linear equation of $X$, with slope $\dfrac{\sigma(Y)}{\sigma(X)}$ and intercept $\mathbb{E}[Y] - \dfrac{\sigma(Y)}{\sigma(X)}\cdot \mathbb{E}[X]$.

As stated in the prior question, Kolen and Brennan mention:

The equation for linear equating... is deceptively like a linear regression equation. The difference is that, for linear regression, the $\sigma(Y)/\sigma(X)$ terms are multiplied by the correlation between $X$ and $Y$.

Explaining this has already been well covered by the linked question above.

Kolen and Brennan then go on to say:

However, a linear regression equation does not qualify as an equating function because the regression of X on Y is different from the regression of Y on X, unless the correlation coefficient is 1. For this reason, regression equations cannot, in general, be used as equating functions.

To be more precise about this, with the inclusion of the correlation coefficient from performing linear regression in $(1)$, based on the above discussion, we obtain $$\begin{align} Y &= \dfrac{\sigma(Y)}{\sigma(X)}\rho_{X, Y}\cdot X + \mathbb{E}[Y] - \dfrac{\sigma(Y)}{\sigma(X)}\rho_{X, Y}\cdot \mathbb{E}[X] \\ &= \mathbb{E}[Y] + \dfrac{\sigma(Y)}{\sigma(X)}\rho_{X, Y}\left(X - \mathbb{E}[X]\right)\text{.} \end{align}$$ Denote this function $r_Y(X)$ - i.e., the regression equator of $X$ to $Y$ based on a linear regression of $Y$ on $X$ - more specifically: $$r_Y(X) = \mathbb{E}[Y] + \dfrac{\sigma(Y)}{\sigma(X)}\rho_{X, Y}\left(X - \mathbb{E}[X]\right)\tag{2}$$ If we were to perform regression of $X$ on $Y$, the regression equator of $Y$ to $X$ would be $$r_X(Y) = \mathbb{E}[X] + \dfrac{\sigma(X)}{\sigma(Y)}\rho_{X, Y}\left(Y - \mathbb{E}[Y]\right)\tag{3}$$

What I am merely trying to show is the following:

If $Y$ is the response variable and $X$ is the explanatory variable in a linear regression, given $X = x$, the corresponding value $y = y(x)$ due to this regression would result in a different value of $X$ if we were to find the corresponding value of $X$ when looking at the regression based on $X$ being the response variable and $Y$ being the explanatory variable using $Y = y$ (and vice versa).

In other words, my aim is to show that $r_Y(r_X(Y)) \neq Y$ and $r_X(r_Y(X)) \neq X$ for $\rho_{X, Y} \neq 1$.

However, I suspect something is wrong with my equations $(2)$ or $(3)$. If $X$ is the response variable and $Y$ is the explanatory variable, we obtain $$r_X(r_Y(X)) = \mathbb{E}[X] + \dfrac{\sigma(X)}{\sigma(r_Y(X))}\rho_{r_Y(X), X}\left(r_Y(X) - \mathbb{E}[r_Y(X)]\right)$$ One can easily see that $\mathbb{E}[r_Y(X)] = \mathbb{E}[Y]$. Through variance and standard deviation calculations, one can also see that $$\text{Var}(r_Y(X)) = \rho^2_{X, Y}\text{Var}(Y)$$ so that $$\sigma(r_Y(X)) = |\rho_{X, Y}|\text{Var}(Y)\text{.}$$ Now \begin{align} \text{Cov}(r_Y(X), X) &= \text{Cov}\left( \mathbb{E}[Y] + \dfrac{\sigma(Y)}{\sigma(X)}\rho_{X, Y}\left(X - \mathbb{E}[X]\right), X\right) \\ &= \dfrac{\sigma(Y)}{\sigma(X)}\rho_{X, Y}\text{Var}(X) \\ &= \rho_{X, Y}\sigma(Y)\sigma(X)\text{.} \end{align} Therefore, $$\rho_{r_Y(X), X} = \dfrac{\rho_{X, Y}\sigma(Y)\sigma(X)}{|\rho_{X, Y}|\sigma(Y)\sigma(X)} = \dfrac{\rho_{X, Y}}{|\rho_{X, Y}|}\text{.}$$ Hence, \begin{align} r_X(r_Y(X)) &= \mathbb{E}[X] + \dfrac{\rho_{X, Y}}{|\rho_{X, Y}|} \cdot \dfrac{\sigma(X)}{|\rho_{X, Y}|\sigma(Y)}\left[\mathbb{E}[Y] + \rho_{X, Y}\cdot \dfrac{\sigma(Y)}{\sigma(X)}(X - \mathbb{E}[X]) - \mathbb{E}[Y] \right] \\ &= \mathbb{E}[X] + \dfrac{\rho_{X, Y}^2}{|\rho_{X, Y}|^2}(X - \mathbb{E}[X]) \\ &= X\text{.} \end{align} Thus, I suspect something is wrong here. Could someone point out exactly what I did wrong?

Aren't all these issues addressed at https://stats.stackexchange.com/questions/22718? I'm unsure, because your notation "$r_X(r_Y(X))$" doesn't make sense: you can't regress the mathematical object $r_Y(X)$ against $X,$ but that's how you have tried to define $r_X.$ — whuber, Apr 14 '21 at 15:42
@whuber I've read that particular link several times, and haven't figured out how to translate that to this problem. Maybe I should use a different notation for this, but the intent of the notation "$r_X(r_Y(X))$" is to say this: take a value $X$, and use regression to get a value $Y$ (hence $r_Y(X)$). But if we take this particular value of $Y$ given by $r_Y(X)$ and try to estimate $X$ using a regression with $Y$ being the explanatory variable and $X$ being the response variable, $r_X(r_Y(X))$ should not be equal to $X$ because these yield two regression lines when $\rho_{X, Y} \neq 1$. — Clarinetist, Apr 14 '21 at 15:45
@whuber I suspect, then, something is wrong with my equations (2) and (3). — Clarinetist, Apr 14 '21 at 15:45
Those equations are identical (because $\rho_{X,Y}=\rho_{Y,X},$ they differ merely by changing the roles of $X$ and $Y$ in the notation), so they are either both correct or both wrong! But, as I remarked, your notation makes no sense, so it's likely confusing your algebra. You can simplify your work (greatly) by assuming $E[X]=E[Y]=0.$ You may even assume $SD(X)=SD(Y)=1.$ It's unfortunate that the other thread doesn't have any adequate figures: you might find it helpful to plot both regression lines *within the same plot.* Scale the axes so that each variable's SD is the same length. — whuber, Apr 14 '21 at 15:46
@whuber Ugh, I've been frustrated trying to learn psychometrics on my own since they aren't precise with their mathematical definitions. Back to the drawing board. — Clarinetist, Apr 14 '21 at 15:48
@whuber I agree completely that *visually* that this should make sense. Kolen and Brennan explain this visually as well, but I'd like to formally justify this. — Clarinetist, Apr 14 '21 at 15:52
It might be more helpful for me to just stick with more traditional statistics notation than trying to make sense of the notation Kolen and Brennan use. — Clarinetist, Apr 14 '21 at 15:53
After you choose units appropriately (that is, by standardizing both variables), the regression of $Y$ against $X$ is $Y = \rho X,$ the regression of $X$ against $Y$ is $X=\rho Y$ (mathematically equivalent to $Y = X/\rho$), and the linear equation is $X=Y.$ That's perfectly rigorous and exposes all the relevant concepts. — whuber, Apr 14 '21 at 15:56
@whuber Thank you, that's extremely helpful. I'll work out the details for this. — Clarinetist, Apr 14 '21 at 15:58

Clarinetist · Accepted Answer · 2021-04-14T17:44:03.177

0

Without loss of generality, assume $\mathbb{E}[X] = \mathbb{E}[Y] = 0$ and $\text{Var}(X) = \text{Var}(Y) = 1$.

Then $$r_{Y}(X) = \rho_{X, Y} \cdot X$$ and $$r_{X}(Y) = \rho_{X, Y} \cdot Y\text{.}$$ Hence, as long as $\rho_{X, Y} \neq 1$, $$r_{Y}(r_{X}(Y)) = r_{Y}(\rho_{X, Y} Y) = \rho_{r_{X}(Y), Y}\rho_{X, Y}Y $$ Now $$\rho_{r_X(Y), Y} = \rho_{\rho_{X, Y}Y, Y} = \dfrac{\rho_{X, Y}\cdot \text{Var}(Y)}{|\rho_{X, Y}|\cdot \text{Var}(Y)} = \pm 1$$ hence $$r_{Y}(r_{X}(Y)) = \pm \rho_{X, Y}Y \neq Y\text{.}$$

Thus, $r_Y \neq r_X^{-1}$.

edited Apr 14 '21 at 17:44

answered Apr 14 '21 at 17:11

Clarinetist

3,761
3
25
70

1

You lost me at the third line, because if $r_Y(\cdot)$ is supposed to be $\rho_{\cdot,Y}(\cdot),$ as indicated in the first line, where $(\cdot)$ indicates some random variable, then in the expression "$r_Y(r_X(Y))$" we see $(\cdot)=r_X(Y)=\rho_{X,Y}Y$, giving $$r_Y(r_X(Y))=\rho_{r_X(Y), Y}(r_X(Y))=\rho_{\rho_{X,Y}Y,Y}(\rho_{X,Y}Y).$$ Obviously $\rho_{\rho_{X,Y}Y,Y}=\pm 1,$ simplifying the result to $\pm\rho_{X,Y}Y.$ – whuber Apr 14 '21 at 17:37
@whuber Ah, you're completely right. Thanks for that. – Clarinetist Apr 14 '21 at 17:38
@whuber Thanks, I've made the edits. – Clarinetist Apr 14 '21 at 17:44

Showing that estimating $Y$ from $X$ via linear regression and estimating $X$ from $Y$ via linear regression are not invertible

1 Answers1