2

I'm making up my own notation here because I don't know the standard notation (but I would love help with the notation too):

Let $x // z$ denote $x$ residualized for $z$. The fraction of explained variance $R^2$ in $x$ explained by $z$ is the ratio of $var_z(x)/var(x)$, where $var_z(x) = var(x) - var(x//z)$. Here, $var_z(x)$ is in some sense the total explained variance of $x$ by $z$ (whereas $R^2$ is the fraction of explained variance). However, since the variance of a variable is just the covariance of the variable with itself, this suggests that we can generalize explained variance to "explained covariance" by defining $cov_z(x, y) = cov(x, y) - cov(x//z, y//z)$, which can further be turned into a sort of "$z$-correlation" through $r_{xy\leftarrow z}=\frac{cov_z(x, y)}{\sqrt{var_z(x)var_z(y)}}$.

I'm familiar with the concept of genetic correlation. Is the quantity $r_{xy\leftarrow z}$ analogous to genetic correlations? For instance, if $z$ is a person's genes, is the $z$-correlation the genetic corelation? (In the use-cases I have in mind, $z$ is not actually genes, but I'm asking because I'm interested in something analogous to genetic correlation for other domains. $z$ should be thought of as being multidimensional, as otherwise this all becomes rather trivial.)

If it is analogous to genetic correlation, are there any better ways of estimating it than just directly computing the expressions I wrote above? Further, are there more-standard names for these things, and resources on how to compute the error bounds on the estimates available anywhere?

I've tried computing it in two ways. First, by just using standard least squares linear regression. This appeared to work about the way one would expect, but I'm asking here because it's probably unwise to just go with the statistical tools without understanding them properly. Secondly, since I was worried about overfitting, I tried residualizing the variables in a leave-one-out manner. However, this frequently yielded $z$-correlations that were much higher than $1$ or much lower than $-1$. Is there a better approach for residualization than these two?

tailcalled
  • 31
  • 2
  • The idea of multivariate R or R-square is the theme of canonical correlation analysis and of MANOVA. The other name of this domain is multivariate general linear model or multivariate linear regression. What corresponds there to R-sq of the univariate linear regression is called Pillai's trace. And the correlation is indeed multidimensional, as you foresee: it is the sum of the squared canonical correlations which are simple (univariate) correlations between the "canoninal variates", the latent features from both sides - the predictors X and the predictands Y. – ttnphns Jul 22 '19 at 08:02
  • (cont.) When you have 2+ variables as Y and only one variable as X then Pillai's trace is equal to the R-sq of the regression of this X by those Y. – ttnphns Jul 22 '19 at 08:06
  • Pillai's trace is not the only coefficient which is possible to compute in multivariate settings, other are densely considered for example in [this answer](https://stats.stackexchange.com/a/255444/3277) which focuses on MANOVA (i.e. when predictor is a cateforical factor). – ttnphns Jul 22 '19 at 08:12
  • To understand simply what canonical correlation analysis is: https://stats.stackexchange.com/a/65817/3277 – ttnphns Jul 22 '19 at 08:14
  • What you're mentioning looks useful and interesting, and I'm definitely going to learn about it to use it for the things I'm working on. However, I think it appears to be somewhat different from the question I'm asking about, as it doesn't appear to have something like the $z$-correlation I mentioned. – tailcalled Jul 22 '19 at 10:12
  • Yes it may be different than you ask, that is why it is a comment. I am new to genetic correlation so can't help with it. – ttnphns Jul 22 '19 at 10:58

0 Answers0