Understanding Partial Correlation

Question

I'm having some trouble fully understanding partial correlation and I was wondering if some of you can shred some light on my confusion.

Let's consider the following scenario: It is a known fact that heart disease is related to social and economic status. However, I want to understand if Anger is also a factor. So the obvious next step is to find the correlation between Anger and heart disease while controlling for social and economic status.

There are a couple ways I can do this. The one popular way that I found online was to use partial correlation (ppcor in R). However, when I looked into how they did partial correlation, it didn't make a lot of sense to me mathematically. The way they do it is: let's say they have 3 variables ($X, Y , Z$) and we want to correlate $X$ and $Y$ while taking into consideration $Z$, they take the residuals from correlating $X$ and $Y$, then $X$ and $Z$, then they correlated the two residuals to get the result.

This doesn't make a lot of sense to me, if residuals are variance that are not explained through correlation, then wouldn't it make more sense to only take the residuals from $X$ and $Z$, then correlate that residual with $Y$, that way we can see if $Y$ can explain the variance that is not explained by $X$ and $Z$, therefore "controlling" for $Z$?

`residuals from correlating X and Y` Correlation cannot produce residuals. It is regression - directed correlation - that can. X is regressed by Z and residuals saved (i.e. Z is washed out from X). Y is regressed by Z and residuals saved (i.e. Z is washed out from Y). All three variables must be standardized initially. [Then](http://stats.stackexchange.com/a/76819/3277) the two residuals correlate with each other. — ttnphns, Jun 19 '16 at 20:22
Regarding the statement "So the obvious next step is to find the correlation between Anger and heart disease while controlling for social and economic status" : https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation — shrey, Jun 20 '16 at 10:22

score 0 · Answer 1 · edited Jan 31 '21 at 23:08

You say that, with $X = \alpha_1 + \beta_1Z + \epsilon_1$ and $X = \alpha_2 + \beta_2Z + \epsilon_2$, you would prefer to consider $\text{Corr}(\epsilon_1,Y)$ rather than $\text{Corr}(\epsilon_1,\epsilon_2)$. However, please notice that, given $\epsilon_1$ is independent from $Z$, you would not change the covariance ($\text{Cov}(\epsilon_1,Y) = \text{Cov}(\epsilon_1,\epsilon_2)$), so you would only replace the standard error of $\epsilon_2$ with the one of $Y$. You would thus lead your correlation toward zero for no reason. In your case (let's call Anger $A$, SocioEconomic Status $SES$ and Heart Disease $HD$), you would try to see how much "residual" anger (after controlling for $SES$) affects $HD$. The problem is that, after regressing $A$ on $SES$, what is left (the residuals) cannot explain the part of $HD$ that has already been explained by $SES$. So, by using $HD$ instead of its residuals, you would basically calculate a correlation that would be bounded (both from below and above) by the explanatory power of $SES$ on $HD$. In practice, after controlling for $SES$, in this way you'd have that the partial correlation between $HD$ and any other variable would not range between $-1$ and $+1$, but between $-k$ and $+k$ (with $k<1$, and the higher the explanatory power of $SES$ on $HD$, the lower $k$).

Understanding Partial Correlation

1 Answers1