Questions tagged [multicollinearity]

Situation when there is strong linear relationship among predictor variables, so that their correlation matrix becomes (almost) singular. This "ill condition" makes it hard to determine the unique role each of the predictors is playing: estimation problems arise and standard errors are increased. Bivariately very high correlated predictors are one example of multicollinearity.

Multicollinearity refers to when predictor variables are (linearly) correlated with each other. Although the term is sometimes used to mean perfectly correlated (i.e., $r=1$) only, it is more often used to simply mean strongly correlated. Multicollinearity need not be manifested in bivariate correlations; a variable could be correlated with several other variables such that all bivariate correlations are low.

Conceptually, the existence of multicollinearity means that it is difficult to determine the role each of the correlated variables is playing. Mathematically, it manifests in larger standard errors. Thus, collinearity reduces statistical power.

Multicollinearity can produce counter-intuitive phenomena. For example, when a collinear variable is added or dropped from a model, other variables can switch between significance and non-significance, and / or the sign of their relationship with the response can switch between positive and negative.

Additionally, when there is multicollinearity, small changes in the data can lead to large changes in the parameter estimates, even reversals of sign.

Detecting and addressing multicollinearity is an important topic in multivariable statistical modeling.

1074 questions
100
votes
9 answers

Is there an intuitive explanation why multicollinearity is a problem in linear regression?

The wiki discusses the problems that arise when multicollinearity is an issue in linear regression. The basic problem is multicollinearity results in unstable parameter estimates which makes it very difficult to assess the effect of independent…
user28
89
votes
1 answer

What correlation makes a matrix singular and what are implications of singularity or near-singularity?

I am doing some calculations on different matrices (mainly in logistic regression) and I commonly get the error "Matrix is singular", where I have to go back and remove the correlated variables. My question here is what would you consider a "highly"…
Error404
  • 1,261
  • 2
  • 13
  • 18
84
votes
9 answers

Why is it possible to get significant F statistic (p<.001) but non-significant regressor t-tests?

In a multiple linear regression, why is it possible to have a highly significant F statistic (p<.001) but have very high p-values on all the regressor's t tests? In my model, there are 10 regressors. One has a p-value of 0.1 and the rest are above…
81
votes
0 answers

How can a regression be significant yet all predictors be non-significant?

My multiple regression analysis model has a statistically significant F value however all beta values are statistically non-significant. All the regression assumptions are met. No multicollinearity was found. Correlations among all predictors are…
Serene
  • 811
  • 1
  • 7
  • 3
70
votes
6 answers

Why is multicollinearity not checked in modern statistics/machine learning

In traditional statistics, while building a model, we check for multicollinearity using methods such as estimates of the variance inflation factor (VIF), but in machine learning, we instead use regularization for feature selection and don't seem to…
62
votes
3 answers

What is the effect of having correlated predictors in a multiple regression model?

I learned in my linear models class that if two predictors are correlated and both are included in a model, one will be insignificant. For example, assume the size of a house and the number of bedrooms are correlated. When predicting the cost of a…
57
votes
3 answers

Won't highly-correlated variables in random forest distort accuracy and feature-selection?

In my understanding, highly correlated variables won't cause multi-collinearity issues in random forest model (Please correct me if I'm wrong). However, on the other way, if I have too many variables containing similar information, will the model…
Yoki
  • 739
  • 1
  • 7
  • 10
36
votes
3 answers

Which variance inflation factor should I be using: $\text{GVIF}$ or $\text{GVIF}^{1/(2\cdot\text{df})}$?

I'm trying to interpret variance inflation factors using the vif function in the R package car. The function prints both a generalised $\text{VIF}$ and also $\text{GVIF}^{1/(2\cdot\text{df})}$. According to the help file, this latter value To…
jay
  • 1,045
  • 1
  • 12
  • 23
33
votes
3 answers

How to tell the difference between linear and non-linear regression models?

I was reading the following link on non linear regression SAS Non Linear. My understanding from reading the first section "Nonlinear Regression vs. Linear Regression" was that the equation below is actually a linear regression, is that correct? If…
31
votes
5 answers

How to test and avoid multicollinearity in mixed linear model?

I am currently running some mixed effect linear models. I am using the package "lme4" in R. My models take the form: model <- lmer(response ~ predictor1 + predictor2 + (1 | random effect)) Before running my models, I checked for possible…
mjburns
  • 1,077
  • 3
  • 12
  • 16
30
votes
3 answers

How to deal with multicollinearity when performing variable selection?

I have a dataset with 9 continuous independent variables. I'm trying to select amongst these variables to fit a model to a single percentage (dependent) variable, Score. Unfortunately, I know there will be serious collinearity between several of the…
Julie
  • 741
  • 2
  • 9
  • 17
29
votes
2 answers

Is PCA unstable under multicollinearity?

I know that in a regression situation, if you have a set of highly correlated variables this is usually "bad" because of the instability in the estimated coefficients (variance goes toward infinity as determinant goes towards zero). My question is…
probabilityislogic
  • 22,555
  • 4
  • 76
  • 97
28
votes
2 answers

Collinearity diagnostics problematic only when the interaction term is included

I've run a regression on U.S. counties, and am checking for collinearity in my 'independent' variables. Belsley, Kuh, and Welsch's Regression Diagnostics suggests looking at the Condition Index and Variance Decomposition…
27
votes
3 answers

How to systematically remove collinear variables (pandas columns) in Python?

Thus far, I have removed collinear variables as part of the data preparation process by looking at correlation tables and eliminating variables that are above a certain threshold. Is there a more accepted way of doing this? Additionally, I am aware…
orange1
  • 557
  • 1
  • 4
  • 9
26
votes
1 answer

Logistic Regression - Multicollinearity Concerns/Pitfalls

In Logistic Regression, is there a need to be as concerned about multicollinearity as you would be in straight up OLS regression? For example, with a logistic regression, where multicollinearity exists, would you need to be cautious (as you would…
Brandon Bertelsen
  • 6,672
  • 9
  • 35
  • 46
1
2 3
71 72