3

The correlation between two ratios with the same denominators are spurious. Similarly, the correlation between two mathematically-coupled variables could also be spurious. Are mathematically-coupled variables (e.g. age and income are used to derive two formulas, which are then used as Y and X in linear regression analyses) spurious predictors in linear regressions?

KuJ
  • 1,356
  • 3
  • 15
  • 25
  • 2
    I'm confused with your terminology. Could you give a specific example? – Michelle Jan 26 '12 at 04:32
  • For example, BMI=BW/BH^2 and obesity=BW^2/BH are two mathematically-coupled variables. Is obesity a spurious predictor for BMI and should never be used? – KuJ Jan 26 '12 at 05:27
  • 1
    Obesity is defined as above a specific BMI cut-point, and it's categorical rather than continuous, so I am not familiar with your second formula. Given the definitional association, I would be concerned to see either being used in a regression to predict the other. – Michelle Jan 26 '12 at 05:36
  • "Obesity" as defined above is a continuous (instead of a categorical) variable and is only a (in)convenient term used to illustrate my point. You can call it "new-BMI" or any other terms. – KuJ Jan 26 '12 at 07:57
  • Can you give a fuller example? This seems like a pretty silly thing to do, so I would like to know more information about the particular situation you're thinking of. Can you add a link to your question to illustrate with a specific example? – Michelle Jan 26 '12 at 09:21
  • For example, the dialysis dose (Kt/V) and protein catabolic rate (PCR) are both calculated from pre-dialysis and post-dialysis blood urea nitrogen levels in hemodialysis and are thus mathematically coupled (The formulae are given in the article "Mathematical Coupling and the Association Between Kt/V and PCRn. Seminars in Dialysis 1999;12:S20-S28" http://onlinelibrary.wiley.com/doi/10.1046/j.1525-139X.1999.90204.x/abstract My question is: can PCR be used as a covariate to adjust for the effect of the key predictor variable (such as systolic blood pressure level) on Kt/V? – KuJ Jan 26 '12 at 14:09
  • 1
    Jinn-Yuh, this may sound philosophical but I think it gets to part of the issue: what's the difference between "mathematically coupled" and not independent? In many cases we use sets of explanatory variables that have clear lack of independence. We can remove their correlations with linear transformations (to orthogonalize them). This effectively expresses the original variables as "mathematically coupled" versions of the orthogonal variables. What's any different about this situation and (dialysis dose, PCR) or (BMI, obesity)? – whuber Jan 26 '12 at 14:14

1 Answers1

6

In the case of a regression equation in which the independent and dependent variables have a common component, you can frequently re-write the equation to demonstrate there is a correlation between the independent variable with the common component and the error term in the regression equation. For an example besides ratios, I give one in another answer on the site in regards to including the baseline as a control variable when the dependent variable is the change score from baseline (hence the depedent variable is "mathematically coupled" with an independent variable). This example is much more innocuous though because of the exact linear relationships between the variables, whereas ratio's of variables can have more negative consequences on interpretation (and other estimated parameters) because they can not be re-expressed as different linear combinations.

So, it is best to avoid the case when the indepedent and dependent variables are "pre-processed" in a way that induces some type of dependency between them (and this seems to be what Pearson initially meant when he referred to the term "spurious" correlation, Aldrich, 1995). In terms of ratio's, Kronmal (1993) even suggests that ratio variables should always be avoided (even if it is just a single independent variable as a ratio), as the specific functional form of the relationship specified is more restricted than if the two variables (and their interaction) is included in the regression equation. This still sometimes leaves room for potential theoretical decisions to guide whether to use ratio variables, but in many observational studies in the social sciences it is more reasonable to avoid ratio variables than it is to assume the more specific functional form of the relationship in terms of the ratio between the two variables (Firebaugh, 1985).

I don't see why these arguments don't apply to any type of "mathematically coupled relationship", and hence I suspect it is much easier to interpret the original components uniquely than it is to interpret them together. Another similar line of thinking and other illustrative examples are given in another related question on the site, Including the interaction but not the main effects in a model. In that thread whuber and wolfgang both give examples that counter this argument though.


Just for futher illustration purposes, I will give an a recent example from some of my work. I was working on a project that included a multi-wave panel survey which included several likert scales measured at each wave. One theoretical model my co-author explicated was that a specific outcome at Wave 2 was the effect of both the baseline likert scale score at wave a and the change in the likert scale score from wave a to b. So this could be represented by the model;

$Y = \alpha + \beta_1{L_a} + \beta_2{(L_b-L_a)} + \epsilon $

Where $L_b$ is the likert score at wave b, and $L_a$ is the likert score at wave a. Subsequently, this above equation is difficult to interpret, because the model can be equivalently written as;

$Y = \alpha + (\beta_1 - \beta_2){L_a} + \beta_2{(L_b)} + \epsilon $

Again this is innocuous as it doesn't affect the estimation of other parameters in the model, but the use of variables that are just re-expressions of one another brings with it difficulty in interpretation.


Citations

Andy W
  • 15,245
  • 8
  • 69
  • 191