In a regression problem, having a variable highly correlated with our target messes up the optimization of the parameters?

Question

In a discussion with a colleague, she told me that if a variable X_i in our design matrix (X) is highly correlated with our variable of interest (target, y, etc), it will make the regression unsovable because:

In the closed solution form the inverse of the matrix would be impossible to find.
With an optimizer (lbfgs, saga, etc) there will be also issues.

I can't seem to find a theoretical justification for this.

At most what will happen is that we will have leakage and our variable X_i will have a huge weight.

Perhaps I'm wrong.

"In the closed solution form the inverse of the matrix would be impossible to find." the inverse of the matrix does not depend in any way on the variable of interest, which suggests that the answer from @Demetri_Pananos is correct and that it is correlated X variables that are the problem. It isn't actually a big deal, you can just use the Moore-Penrose pseudo-inverse instead. Well written regression software will not find this a substantial problem. — Dikran Marsupial, Apr 15 '21 at 14:13
In say natural science a predictor that is highly correlated with the response or target is often highly desirable, although not always a great contribution, e.g. it is not a surprise that yesterday's temperature is a fairly good predictor of today's temperature. In say social science or medicine a high correlation is usually a disturbing sign (e.g. of a banal theory). These comments address a statement in the question, even though answers to date have focused on an interpretation that the question is really about correlations among predictors. — Nick Cox, Apr 15 '21 at 14:19

Demetri Pananos · Accepted Answer · 2021-04-14T23:56:48.133

7

I believe your colleague meant if a column is highly correlated with another variable. If the design matrix is rank deficient then the regression parameters are under determined (there are an infinite amount of parameters which might explain the data).

High correlation is problematic for inference, but often the model can still be fit.

EDIT:

R will still fit a model with perfect collinearity

x = rnorm(100)
y = x
z = 2*x + rnorm(100)

lm(z~y+x)
> lm(z~y+x)

Call:
lm(formula = z ~ y + x)

Coefficients:
(Intercept)            y            x  
    0.05419      1.92536           NA

You will notice R will NA one of the columns which are collinear. The methods by which R determines which columns are collinear and decides which variables to NA are a mystery to me but might be found in the documentation.

R will fit with a model with high collinearity too

sig = matrix(c(1, 0.99, 0.99, 1), nrow = 2)
X = MASS::mvrnorm(1000, c(0,0), Sigma = sig)
beta = c(2, 2)
y = X %*% beta + rnorm(1000)
z = X[,1]
w = X[,2]

lm(y~z+w)
Call:
lm(formula = y ~ z + w)

Coefficients:
(Intercept)            z            w  
   -0.04606      2.25132      1.73306

So it doesn't matter how adamant you colleague is, the proof is in the pudding, so to speak.

edited Apr 14 '21 at 23:56

answered Apr 14 '21 at 23:12

Demetri Pananos

24,380
1
36
94

Yes, this I get, but it can still ve solvable, right – Leon palafox Apr 14 '21 at 23:15
1

+1 but perhaps the colleague was referring to something like perfect separation in logistic regression. – Dave Apr 14 '21 at 23:19
@Leonpalafox High correlation does not preclude solving the problem. So long as columns are linearly independent, the solution will exist. However, the precision of the solution then comes into question. – Demetri Pananos Apr 14 '21 at 23:27
Yes, I agree with that, but my colleague argued quite strongly that the software wouldn't even be able to spit out a solution – Leon palafox Apr 14 '21 at 23:33
@Leonpalafox Your colleague is demonstrably wrong. Please see my edits – Demetri Pananos Apr 14 '21 at 23:54

gung - Reinstate Monica · Answer 2 · 2021-04-19T14:47:41.660

I use the phrase "variable of interest" to refer to the $\it X$-variable you want to know about, not $Y$. I do this commonly¹, and as far as I can tell (here in the Kingdom of Zhou) it's uncontroversial. I can't guarantee that people use the phrase the same way in other fields, or other countries. But if they do, the comment is sensible.

A prototypical study has a variable (a treatment, an intervention, an exposure, etc.) that is theorized to be associated with a response. That variable might be called "$X_1$". In addition, the researchers control for a set of covariates, $X_2, \ldots, X_p$, that (1) enhance the power of the test of $X_1$, or (2) enhance the interpretation of the fitted coefficient for $X_1\!$. In such a case, we say $X_1$ is the variable of interest. The other variables are called "control variables" (or more pejoratively, "nuisance variables"). However, both of the two stated goals for covariates can be undermined if one of the covariates correlates too highly with $X_1$. By [arbitrary] convention, you have a problem with multicollinearity if the VIF is >10, that means you might² be wary of including a covariate if $r_{_{X_1,\, X_j}}\ge.95$. I believe that is what your colleague was getting at³.

_{Here are some examples of answers on CV where I used the phrase in this way: 1, 2, 3.
N.b., whether a potential covariate that is highly correlated with the exposure should be excluded from a planned model is much more complicated than this lets on. However, I would not be at all surprised to hear such a comment from a researcher.
Note that @Demetri's points are still valid, though.}

Interesting, so to speak. FWIW, I was not aware of this usage at all and would have guessed wildly that "variable of interest" implied the response or outcome. Are you aware of this usage across various application fields? (The OP seems to equate variable of interest and target.) — Nick Cox, Apr 16 '21 at 13:36
@NickCox, I guess I don't really know. *I* use "variable of interest" to mean $X_1$ all the time. As far as I'm aware, that's uncontroversial. My guess at what happened is the OP's colleague said "variable of interest", not *variable of interest (target, y, etc)*, & that there was a miscommunication between what the colleague meant & the OP's interpretation of that comment. The colleague's comment is perfectly sensible if they meant a variable collinear w/ the exposure. Hence my answer. I'll edit this to make it clearer. — gung - Reinstate Monica, Apr 16 '21 at 14:48

In a regression problem, having a variable highly correlated with our target messes up the optimization of the parameters?

2 Answers2