Multicollinearity between ln(x) and ln(x)^2

Question

I am running a negative binomial model and one of my predictor variables is a count variable. Since this variable was heavily skewed, I decided to log-transform it.

However, the effect of this variable is hypothesized to be non-linear. However, as soon as I include the squared term in my model, I obtain VIFs of these two variables that are >20, while all other predictors remain stable at VIFs between 1 and 5.

To my current understanding, the relationship should not be linear and hence multicollineairy should not arise.

Can anyone explain the cause of the multi-collinearty and give possible solutions to this problem?

Well, f(x) = x^2, which is where the collinearity arises. If you want to reduce the collinearity between x and x^2, I suggest centering x and then squaring the centered covariate. See this post: http://www.theanalysisfactor.com/centering-for-multicollinearity-between-main-effects-and-interaction-terms/ — Brash Equilibrium, Nov 08 '17 at 16:01
What is the domain of $x$? For very small values $x$ could be considered approximately $x^2$. — Dan, Nov 08 '17 at 16:03
x is between 1 and 650, but after the log transformation the values are obviously much smaller (between 0 and 2.8) — statsnewby, Nov 08 '17 at 16:06
It seems that you assume the relationship between a variable and its square (it happens to be the log of $x$ and the square of that log, but that is not so important here) is not a linear one and so they are uncorrelated. Others have explained the error already, but you may be interested in this related thread: [Pearson correlation between a variable and its square](https://stats.stackexchange.com/q/297685/22228). — Silverfish, Nov 08 '17 at 22:34

score 12 · Accepted Answer · answered Nov 08 '17 at 16:46

Except for very small counts, $\log(x)^2$ is essentially a linear function of $\log(x)$:

The colored lines are least squares fits to $\log(x)^2$ vs $\log(x)$ for various ranges of counts $x$. They are extremely good once $x$ exceeds $10$ (and still awfully good even when $x\gt 4$ or so).

Introducing the square of a variable sometimes is used to test goodness of fit, but (in my experience) is rarely a good choice as an explanatory variable. To account for a nonlinear response, consider these options:

Study the nature of the nonlinearity. Select appropriate variables and/or transformation to capture it.
Keep the count itself in the model. There will still be collinearity for larger counts, so consider creating a pair of orthogonal variables from $x$ and $\log(x)$ in order to achieve a numerically stable fit.
Use splines of $x$ (and/or $\log(x)$) to model the nonlinearity.
Ignore the problem altogether. If you have enough data, a large VIF may be inconsequential. Unless your purpose is to obtain precise coefficient estimates (which your willingness to transform suggests is not the case), then collinearity scarcely matters anyway.

This is the answer I would prefer as it addresses the $log(x)$ component of the question, which I failed to do below. — Brash Equilibrium, Nov 08 '17 at 17:00
thank you for the answer, that made it perfectly clear! As a follow-up - I want to show diminishing returns to that variable, and am only aware of the option of introducing a squared term. What would be a more appropriate approach considering the use of a log? — statsnewby, Nov 08 '17 at 17:14

score 1 · Answer 2 · answered Nov 08 '17 at 16:10

The source of the collinearity is that $f(x) = x^2$. One way to reduce the correlation between $x$ and $x^2$ is to center $x$. Let $z=x-E(x)$ and compute $z^2$. Because the low end of the scale now has large absolute values, its square becomes large, making the relationship between $z$ and $z^2$ less linear than that between $x$ and $x^2$. This advice comes from The Analysis Factor: http://www.theanalysisfactor.com/centering-for-multicollinearity-between-main-effects-and-interaction-terms/

Note: When interpreting the effects, please remember that you scaled the covariate. Also, some researchers may caution against scaling because then the results of your model are data-dependent. Here is some perspective from Andrew Gelman on that issue: http://andrewgelman.com/2009/07/11/when_to_standar/

Thanks! I have two questions about this approach: Firstly, is the x you are referring to the untransformed x or the ln(x)? Centering ln(x) did not lead to major improvements (VIF of 16). Secondly, do you mean the average of X with E(x), hence mean centering the variable? — statsnewby, Nov 08 '17 at 16:21
Ah, good point, I forgot that part of your question. I would refer to the answer from @whuber. — Brash Equilibrium, Nov 08 '17 at 17:00

Multicollinearity between ln(x) and ln(x)^2

2 Answers2