4

In a discussion with a colleague, she told me that if a variable X_i in our design matrix (X) is highly correlated with our variable of interest (target, y, etc), it will make the regression unsovable because:

  • In the closed solution form the inverse of the matrix would be impossible to find.
  • With an optimizer (lbfgs, saga, etc) there will be also issues.

I can't seem to find a theoretical justification for this.

At most what will happen is that we will have leakage and our variable X_i will have a huge weight.

Perhaps I'm wrong.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Leon palafox
  • 825
  • 1
  • 6
  • 9
  • "In the closed solution form the inverse of the matrix would be impossible to find." the inverse of the matrix does not depend in any way on the variable of interest, which suggests that the answer from @Demetri_Pananos is correct and that it is correlated X variables that are the problem. It isn't actually a big deal, you can just use the Moore-Penrose pseudo-inverse instead. Well written regression software will not find this a substantial problem. – Dikran Marsupial Apr 15 '21 at 14:13
  • 1
    In say natural science a predictor that is highly correlated with the response or target is often highly desirable, although not always a great contribution, e.g. it is not a surprise that yesterday's temperature is a fairly good predictor of today's temperature. In say social science or medicine a high correlation is usually a disturbing sign (e.g. of a banal theory). These comments address a statement in the question, even though answers to date have focused on an interpretation that the question is really about correlations among predictors. – Nick Cox Apr 15 '21 at 14:19

2 Answers2

7

I believe your colleague meant if a column is highly correlated with another variable. If the design matrix is rank deficient then the regression parameters are under determined (there are an infinite amount of parameters which might explain the data).

High correlation is problematic for inference, but often the model can still be fit.

EDIT:

R will still fit a model with perfect collinearity

x = rnorm(100)
y = x
z = 2*x + rnorm(100)

lm(z~y+x)
> lm(z~y+x)

Call:
lm(formula = z ~ y + x)

Coefficients:
(Intercept)            y            x  
    0.05419      1.92536           NA  

You will notice R will NA one of the columns which are collinear. The methods by which R determines which columns are collinear and decides which variables to NA are a mystery to me but might be found in the documentation.

R will fit with a model with high collinearity too

sig = matrix(c(1, 0.99, 0.99, 1), nrow = 2)
X = MASS::mvrnorm(1000, c(0,0), Sigma = sig)
beta = c(2, 2)
y = X %*% beta + rnorm(1000)
z = X[,1]
w = X[,2]

lm(y~z+w)
Call:
lm(formula = y ~ z + w)

Coefficients:
(Intercept)            z            w  
   -0.04606      2.25132      1.73306  

So it doesn't matter how adamant you colleague is, the proof is in the pudding, so to speak.

Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
2

I use the phrase "variable of interest" to refer to the $\it X$-variable you want to know about, not $Y$. I do this commonly1, and as far as I can tell (here in the Kingdom of Zhou) it's uncontroversial. I can't guarantee that people use the phrase the same way in other fields, or other countries. But if they do, the comment is sensible.

A prototypical study has a variable (a treatment, an intervention, an exposure, etc.) that is theorized to be associated with a response. That variable might be called "$X_1$". In addition, the researchers control for a set of covariates, $X_2, \ldots, X_p$, that (1) enhance the power of the test of $X_1$, or (2) enhance the interpretation of the fitted coefficient for $X_1\!$. In such a case, we say $X_1$ is the variable of interest. The other variables are called "control variables" (or more pejoratively, "nuisance variables"). However, both of the two stated goals for covariates can be undermined if one of the covariates correlates too highly with $X_1$. By [arbitrary] convention, you have a problem with multicollinearity if the VIF is >10, that means you might2 be wary of including a covariate if $r_{_{X_1,\, X_j}}\ge.95$. I believe that is what your colleague was getting at3.

  1. Here are some examples of answers on CV where I used the phrase in this way: 1, 2, 3.
  2. N.b., whether a potential covariate that is highly correlated with the exposure should be excluded from a planned model is much more complicated than this lets on. However, I would not be at all surprised to hear such a comment from a researcher.
  3. Note that @Demetri's points are still valid, though.
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • Interesting, so to speak. FWIW, I was not aware of this usage at all and would have guessed wildly that "variable of interest" implied the response or outcome. Are you aware of this usage across various application fields? (The OP seems to equate variable of interest and target.) – Nick Cox Apr 16 '21 at 13:36
  • @NickCox, I guess I don't really know. *I* use "variable of interest" to mean $X_1$ all the time. As far as I'm aware, that's uncontroversial. My guess at what happened is the OP's colleague said "variable of interest", not *variable of interest (target, y, etc)*, & that there was a miscommunication between what the colleague meant & the OP's interpretation of that comment. The colleague's comment is perfectly sensible if they meant a variable collinear w/ the exposure. Hence my answer. I'll edit this to make it clearer. – gung - Reinstate Monica Apr 16 '21 at 14:48