5

I have been hearing about this term "regress out the variable" all the time and understand that it roughly means that you exclude the effects by that variable. But how does one mathematically do this?

I wish to learn how to do it in this example: The data set includes the variables brain volume, cortex thickness, age, and gender of 100 subjects. The variables of interest are brain volume and cortex thickness, and the nuisance variables that I wish to "regress out" are age and gender. How do I regress them out mathematically?

P.S.: I have learned the existence of this similar question and that similar question, but after reading through the question and its answers, I feel that how to DO it is still very vague. I believe a worked example like this question will greatly help the future readers. So I post it anyways.

Sibbs Gambling
  • 2,208
  • 5
  • 20
  • 42

1 Answers1

6

It seems to me that the following is the mathematically simplest way to partial-out variables from a correlated set of items.

Consider a correlation matrix R for 5 items, where we want to "partial-out" the first two variables. This is the initial correlation-matrix: $$ \text{ R =} \small \begin{bmatrix} \begin{array} {r} 1.00& -0.15& 0.27& 0.53& 0.24\\ -0.15& 1.00& -0.09& -0.50& -0.34\\ 0.27& -0.09& 1.00& 0.22& 0.19\\ 0.53& -0.50& 0.22& 1.00& 0.47\\ 0.24& -0.34& 0.19& 0.47& 1.00 \end{array} \end{bmatrix} $$


Now we want to partial out the first item. We determine the vector of correlations of all variables with it, this gives the vector $f_1$ (which is just the first column of R : $$ f_1 = \small \begin{bmatrix} \begin{array} {r} 1.00\\ -0.15\\ 0.27\\ 0.53\\ 0.24 \end{array} \end{bmatrix} $$ Then build the matrix $R_1 = f_1 \cdot f_1^\tau$ $$ \text{ R}_1 =\small \begin{bmatrix} \begin{array} {rrrrr} 1.00& -0.15& 0.27& 0.53& 0.24\\ -0.15& 0.02& -0.04& -0.08& -0.04\\ 0.27& -0.04& 0.07& 0.14& 0.06\\ 0.53& -0.08& 0.14& 0.28& 0.12\\ 0.24& -0.04& 0.06& 0.12& 0.06 \end{array} \end{bmatrix} $$ and subtract this from the original matrix to get $R_{ \; \cdot 1}$

$$ \text{ R}_{\ \cdot 1} =\small \begin{bmatrix} \begin{array} {rrrrr} 0.00& 0.00& 0.00& 0.00& 0.00\\ 0.00& 0.98& -0.05& -0.42& -0.30\\ 0.00& -0.05& 0.93& 0.07& 0.13\\ 0.00& -0.42& 0.07& 0.72& 0.35\\ 0.00& -0.30& 0.13& 0.35& 0.94 \end{array} \end{bmatrix} $$


Now we look at the partial vector $f_{2 \cdot 1}$. First, we get just from extraction of the second column of the remaining covariance matrix. In order to have the entry in its second row such that then $R_{2 \cdot 1} = f_{2 \cdot 1} \cdot f_{2 \cdot 1}^\tau$ has the correct value in row and column 2 we must define $f_{2 \cdot 1} = f_{2 \cdot 1} / \sqrt{ f_{2 \cdot 1}[2]}$, thus we get: $$ f_{2 \cdot 1}= \small \begin{bmatrix} \begin{array} {r} 0.00\\ 0.99\\ -0.05\\ -0.42\\ -0.31 \end{array} \end{bmatrix} $$ Then $ \text{ R }_{2 \cdot 1} = f_{2 \cdot 1} \cdot f_{2 \cdot 1}^\tau $ and we find $$ \text{ R }_{2 \cdot 1} = \small \begin{bmatrix} \begin{array} {rrrrr} 0.00& 0.00& 0.00& 0.00& 0.00\\ 0.00& 0.98& -0.05& -0.42& -0.30\\ 0.00& -0.05& 0.00& 0.02& 0.01\\ 0.00& -0.42& 0.02& 0.18& 0.13\\ 0.00& -0.30& 0.01& 0.13& 0.09 \end{array} \end{bmatrix} $$ and after removing that covariance as well by $ \text{ R }_{ \cdot 12}= \text{ R }_{ \cdot 1}- \text{ R }_{ 2\cdot 1} $ we get

$$ \text{ R }_{ \cdot 12} =\small \begin{bmatrix} \begin{array} {rrrrr} 0.00& 0.00& 0.00& 0.00& 0.00\\ 0.00& 0.00& 0.00& 0.00& 0.00\\ 0.00& 0.00& 0.93& 0.05& 0.11\\ 0.00& 0.00& 0.05& 0.54& 0.22\\ 0.00& 0.00& 0.11& 0.22& 0.85 \end{array} \end{bmatrix} $$


This can be iterated for the next variable(s) to be partialled out analoguously. You can then analyze the remaining nonzero-part as covariances, which are the "partial correlations" when the "partialled-out" variables are, so-to-say, "held constant".
Gottfried Helms
  • 1,494
  • 15
  • 23
  • Thanks a lot for guiding me thru these steps. Two follow-up questions: (1) say, I have regressed out the first columns and obtained the "reduced" matrix. I believe the two variables of interest (brain volume and cortex thickness in my case) will change after I regress out age and gender. How do I further use this covariance matrix to "modulate" my brain volume and cortex thickness data? (2) Indeed the two columns are nullified, meaning the first two variables no longer have covariance with other variables. But what is the rationale behind? Thanks a lot! – Sibbs Gambling Oct 05 '14 at 04:12
  • at 1) usually this is done by the calculation of "residue" values of , for instance, *brain volume* when regressed on *age* and *gender*: the regression computes for each data-case the *brain-volume*-prognosis based on the regressors and subtracts this from the original value of *brain-volume*. At 2) - about the rationale behind: let "brain-volume* increase with age, and *cortex-thick* too. Then correlate the *br-vol* with *cor-th* after that part of increase, which is caused by the increase of age, is subtracted. To see more- the answer to which whuber linked to is really good! – Gottfried Helms Oct 05 '14 at 08:05
  • Thanks a lot. (1) I have regressed my data on age and gender and already obtained the $\beta$ vector and the residuals. Should I just use the residuals as my "new data"? From my understanding, these residuals are `myCurrentData - effectsByAge&Gender`. So I think I could just use them as my "new data"? (2) Assuming what I said so far is true, I notice that many of them are negative values. So if I use them as new data, how do they make physical sense? I mean, my data are brain volume. How could it be negative after the regression? – Sibbs Gambling Oct 06 '14 at 14:00
  • The negative values have the meaning of negative in *some abstract dimension* relative to some average. So if somehow th age=45y is the average, then a 20y old has on the *dimension "age"* -20 compared to the average. Similarly it must be understand with the brain-volume et.al. It's not the volume but the difference to the *prognosed volume* (prognosed by the influence of the measure of age&gender). And that differences not only can become negative but even must be negative because the sum of *all* such differences must sum up to zero(in the same way as the sum of differences to the mean is 0) – Gottfried Helms Oct 06 '14 at 14:27
  • ... and if you have some cases which have negative residuals and some with positive residual you can proceed and ask what further influence is there that the one group has brain-volume below the prognosis(by age&gender) and the other group are above the prognosis. And so on... – Gottfried Helms Oct 06 '14 at 14:31
  • I now understand that why the some residuals can/must be negative. But could you please kindly help confirm on my 1st point? I reiterate: should I directly use these residuals as my "new" brain volumes with age and gender already regressed out? – Sibbs Gambling Oct 06 '14 at 14:31
  • Yes to your 1st point (If you've more questions I can be back only later, just have to go at the moment) – Gottfried Helms Oct 06 '14 at 14:32