2

I’m trying to find a way to measure how much a single variable ‘summarizes’ a full set of continuous variables. For instance, in a PCA the first principal component will explain a certain percentage of the total variability in a multivariate set. So, how can I obtain a similar measurement for a pre-existing (untransformed) variable?

For instance, how much does altitude (single variable) explains overall climatic variability (i.e. multiple variables: mean precipitation; mean temperature…)?

I am particularly interested in a measurement that is directly comparable to the variance explained by a PCA.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • Have you checked how this percentage is actually calculated in the PCA case? That could give you a hint. Apart from that: don't forget that PCAs are orthogonal whereas your variables may not be, so the percentages may not sum to 1. – Nick Sabbe Mar 20 '13 at 16:19
  • Dear Nick, you mean calculate the explained deviance? I've tought of that but as you say my variables are not orthogonal, so I am not sure how to compare the explained deviance with the one calculated by PCA? I wonder if there is any sort of data transformation that will make both measurments compatible? – Edward Correia Mar 21 '13 at 09:28
  • It's perfectly valid to compare them, when looking at 1 variable at a time. Just remember that e.g. for a variable that explains 40% of the variance, only 60% will be explained when leaving that variable out. – Nick Sabbe Mar 21 '13 at 09:43
  • Thank you Nick, just one last question to let me know if this makes any sense: I want to compare the variance explained by the first principal component ‘summarizing’ climate versus the variance explained by Altitude so: 1. I extract the PC1 of the climate variables | 2. I linearly stretch Altitude to the same range of PC1 | 3. I calculate the explained variance of the original variables by PC1 | 4. I calculate the explained variance of the original variables by the scaled Altitude Are these now comparable? Does this make any sense? Thank you for your time. – Edward Correia Mar 21 '13 at 11:56
  • No problem Edward. Upon rereading my earlier comment: I meant exactly the opposite of what it says: leaving out the 40 % variable will not reduce the variance explained to 60%. In fact, if the left out variable correlates perfectly with another set of variables, the variance explained may still be 100%. – Nick Sabbe Mar 21 '13 at 12:45
  • Unless I'm mistaken (but I could be wrong and am too lazy to check the math right now), the stretching of the explaining variable is irrelevant or at best arbitrary. Apart from that: yes, I believe this should work. – Nick Sabbe Mar 21 '13 at 12:48
  • Thanks a lot Nick. Hope I can somehow compensate in the future. – Edward Correia Mar 21 '13 at 13:10

1 Answers1

3

There is a more general question here of which this one is a special case:

There are three answers there giving different answers, but I argue that my answer is the correct one :) Namely, if the covariance matrix of the data is $\newcommand{\S}{\boldsymbol \Sigma} \S$ and if we consider a unit vector $\newcommand{\w}{\mathbf w} \w$, then the variance explained by the projection on this vector is given by $$R^2=\frac{\|\S \w\|^2}{\w^\top \S \w \cdot \mathrm{tr}(\S)}.$$

This question asks about a single variable (e.g. the first one), which means that $$\w = (\begin{array}{}1&0&...&0\end{array})^\top.$$

Plugging it in the general formula, we obtain that $$R^2 = \frac{\sum \sigma_{1k}^4/\sigma_{11}^2}{\mathrm{tr}(\S)},$$ where $\sigma_{ij}^2$ are the elements of $\S$.

Note that if the first variable is uncorrelated with all the others (as is the case for PCA eigenvectors), i.e. $\forall \sigma_{1k}^2=0$ for $k\ne 1$, then the formula reduces to the well-known PCA expression: $$R^2 = \frac{\sigma_{11}^2}{\mathrm{tr}(\S)}.$$

Alternative derivation

We can obtain the same result via a different route. The proportion of variance of the $k$-th variable explained by the first variable is given by the square of the correlation coefficient $$R_{12}^2 = \rho_{12}^2 = \frac{\sigma_{12}^4}{\sigma^2_{11}\sigma^2_{kk}}.$$ The amount (not the proportion) of explained variance is given by $R_{12}^2\sigma_{kk}^2$. Taking the sum over all variables and dividing by the total variance, we obtain the same expression as above: $$R^2 = \frac{\sum R_{12}^2\sigma_{kk}^2}{\sum \sigma_{kk}^2} = \frac{\sum \sigma^4_{1k}/\sigma^2_{11}}{\mathrm{tr}(\S)}.$$

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • No problem, @Tim. I think that from the formula it is not immediately obvious that it has to be below $1$: the numerator can certainly be much larger than $\sigma_{11}^2$ alone; but if the general formula is correct (and for this, see my linked post), then it *has to be* below $1$. – amoeba Jan 29 '15 at 11:27