How does the GLM handle collinear predictors?

Question

In the case of an ordinary least squares GLM with two nearly collinear predictors, how does this shared variance get reflected in the parameter estimates? My understanding is that the parameter estimates will reflect the unique effect of each parameter, i.e., controlling for all other parameters.

Consider a case where the predictors are correlated very highly, such that the "unique" effect of each predictor is small. But, the shared variance among them explains much variance in the outcome. How is this shared variance allocated between the two predictors?

Are you really asking about collinear predictors or only about *nearly* collinear predictors? With truly collinear ones, estimates are not even identifiable and there's no basis at all to allocate variance among them. Even the concept of "sharing" or "explaining" variance in a GLM is generally not applicable. It applies only to certain kinds of response distributions (such as Gaussian) where variables are orthogonal. — whuber, Jun 09 '15 at 17:19

score 9 · Answer 1 · answered Jun 09 '15 at 19:33

This is one of those situations where what is theoretically true, and what is true in practice can be quite different. I'll try to give an example.

Let's suppose we have centered ant standardized both $X$ and $y$ so that:

The predictor covariance is $\Sigma = X^{t} X$.
The intercept estimate is $\beta_0 = 0$.

I'll focus on the case of a linear regression, and try to say something about general glm's at the end. I'll also assume we have two predictors, because it captures all the essential points of the situation.

The solution parameter estimates for a linear model satisfy the equation

$$ X^t X \beta = X^t y $$

which under our assumptions, can be written as

$$ \Sigma \beta = X^t y $$

On the right hand side, are simply taking the dot product of the response vector $y$ with $X$, so we can write

$$ X^t y = \left( \begin{array}{c} cov(X_1, y) \\ cov(X_2, y) \end{array} \right) $$

On the left hand side, we get

$$ \Sigma \beta = \left( \begin{array}{c} \beta_1 + cov(X_1, X_2) \beta_2 \\ cov(X_1, X_2) \beta_1 + \beta_2 \end{array} \right) $$

So the system of equations is

$$ \beta_1 + cov(X_1, X_2) \beta_2 = cov(X_1, y) $$ $$ cov(X_1, X_2) \beta_1 + \beta_2 = cov(X_2, y) $$

As a sanity check, if the predictors are uncorrelated, we get

$$ \beta_1 = cov(X_1, y) $$ $$ \beta_2 = cov(X_2, y) $$

which is intuitive.

So, now, what if the predictors are corellated? Then we can solve by multiplying the bottom equation through by $cov(X_1, X_2)$ to get

$$cov(X_1, X_2)^2 \beta_1 + cov(X_1, X_2) \beta_2 = cov(X_1, X_2) cov(X_2, y) $$

And then subtracting from the top equation cancels out the $\beta_2$'s

$$(1 - cov(X_1, X_2)^2) \beta_1 = cov(X_1, y) - cov(X_1, X_2) cov(X_2, y) $$

Which can be solved for $\beta_1$

$$\beta_1 = \frac{cov(X_1, y) - cov(X_1, X_2) cov(X_2, y)}{(1 - cov(X_1, X_2)^2) } $$

So what do we see here?

The numerator is what you would get naively, if you first regressed $Y$ on $X_2$, and then regressed $X_1$ on the residual.
The denominator is the "correction" to the above procedure. If you follow the step-by-step procedure from the first bullet point, it seems you are under-explaining the variance in $Y$ due to $X_1$ and $X_2$. This makes sense, because you have ignored the additional variance due to the fact that $X_1$ and $X_2$ tend to move together.
If $X_1$ and $X_2$ are tightly correlated, then the denominator is close to zero. This means that any errors in estimation of $cov(X_1, X_2)$ get magnified in the final coefficient estimates. This explains why parameter estimates can be so unstable in high correlation situations.

An analysis on a general glm is much harder to work through, but I'll mention one thing. The glm fitting algorithm reduces to a repeated application of the linear fitting algorithm (using newton's method, this is usually called iteratively re-weighted least squares. The same considerations will hold at each step of that procedure, so you can see how the same general phenomena will be true for the final estimates as well.

Adding to this answer, for those wondering how perfect collinearity is still a problem with when using MLE to estimate GLM, [this question](https://stats.stackexchange.com/questions/169943/why-can-multicollinearity-be-a-problem-for-logistic-regression) addresses it. — Heisenberg, Feb 23 '18 at 20:34

Jason Sanchez · Accepted Answer · 2015-06-09T18:50:50.900

Let's predict income with two highly positively correlated variables: Years of work experience and number of carrots eaten in one's lifetime. Let's ignore omitted variable bias issues. Also, let's say years of work experience has a much greater impact on income than carrots eaten.

Your beta parameter estimates would be unbiased, but the standard errors of the parameter estimates would be greater than if the predictors were not correlated. Collinearity does not violate any assumptions of GLMs (unless there is perfect collinearity).

Collinearity is fundamentally a data problem. In small datasets, you might not have enough data to estimate beta coefficients. In large datasets, you likely will. Either way, you can interpret the beta parameters and the standard errors just as if collinearity were not an issue. Just be aware that some of your parameter estimates might not be significant.

In the event your parameter estimates are not significant, get more data. Dropping a variable that should be in your model ensures your estimates are biased. For example, if you were to drop the years of experience variables, the carrots eaten variables would become positively biased due to "absorbing" the impact of the dropped variable.

To answer the shared variance question, here is a fun test you can do in a statistical program of your choice:

Make two highly correlated variables (x1 and x2)
Add an error term (normally distributed, zero mean)
Create y by adding x1 to the error term. (i.e. The actual beta values of x1 and x2 are 1 and 0 respectively)
Regress y on x1 and x2 with a large data set.

Although there is a very large shared variance between x1 and x2, only x1 has a ceteris paribus, marginal effect relationship to y. In contrast, holding x1 constant and changing x2 does nothing to the expected value of y, so the shared variance is irrelevant.

thanks! But, how does shared variance get allocated between the two predictors? I think this affects the parameter estimates and the significance values differently. As you note, standard error around the parameter estimates will be larger, so significance will be reduced. But what about the actual parameter estimates -- how do they reflect the shared variance? — Yoni, Jun 09 '15 at 18:21

score 1 · Answer 3 · edited Apr 13 '17 at 12:44

@Jason Sanchez has provided a good answer. Let me add some complementary information.

The beta estimates from OLS estimation are each conditional on the other variables included in the model. No variable is given precedence, or used first, etc. For example, it is not the case that the slope for $x_1$ is the slope you would have if only $x_1$ were included in the model, and then the slope for $x_2$ tried to account for the variability that was 'left over'. The effect that you would have ignoring other variables is called the 'marginal' association, but when you have multiple variables, all associations are conditional instead. Of course, you can always just fit two models to see the marginal effect of (say) $x_1$, and then the conditional effect when $x_2$ is also included, but you need to be sure you understand what the results will be telling you; if they are highly correlated, the marginal effect of $x_1$ is the effect of $x_1$ conflated with the proportion of $x_2$ that is correlated with $x_1$. (It may help to read my answer here: Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression?)

Another issue that comes up in situations like this is that your standard errors will be larger than they would otherwise be. This is appropriate, and reflects the fact that when variables are highly correlated, it is very difficult to determine the slope of the relationships with good precision. It is often best to simply recognize that it is difficult to tell the independent roles of the collinear variables. You can test that something is related to the response by dropping all of the collinear variables and performing a nested model test. With small- to moderately-sized datasets and strongly collinear variables, it is typical to have no individual variable be significant but for the set of variables to be clearly significant. At that point, that there is an association in there somewhere is what you can say from that dataset.

On the other hand, you should realize that the $t$-tests that come with your output are analogous to so-called 'type III sums of squares', and that much of the information available to you has been set aside. If you have an a-priori reason to allocate that to a particular variable (to give it precedence), you can use a sequential testing strategy (i.e., type I SS). This will give you more power for the test of that variable. For more information on this, it may help to read my answer here: How to interpret type I (sequential) ANOVA and MANOVA?

How does the GLM handle collinear predictors?

3 Answers3

Linked

Related