7

I am running a negative binomial model and one of my predictor variables is a count variable. Since this variable was heavily skewed, I decided to log-transform it.

However, the effect of this variable is hypothesized to be non-linear. However, as soon as I include the squared term in my model, I obtain VIFs of these two variables that are >20, while all other predictors remain stable at VIFs between 1 and 5.

To my current understanding, the relationship should not be linear and hence multicollineairy should not arise.

Can anyone explain the cause of the multi-collinearty and give possible solutions to this problem?

statsnewby
  • 73
  • 5
  • Well, f(x) = x^2, which is where the collinearity arises. If you want to reduce the collinearity between x and x^2, I suggest centering x and then squaring the centered covariate. See this post: http://www.theanalysisfactor.com/centering-for-multicollinearity-between-main-effects-and-interaction-terms/ – Brash Equilibrium Nov 08 '17 at 16:01
  • What is the domain of $x$? For very small values $x$ could be considered approximately $x^2$. – Dan Nov 08 '17 at 16:03
  • x is between 1 and 650, but after the log transformation the values are obviously much smaller (between 0 and 2.8) – statsnewby Nov 08 '17 at 16:06
  • It seems that you assume the relationship between a variable and its square (it happens to be the log of $x$ and the square of that log, but that is not so important here) is not a linear one and so they are uncorrelated. Others have explained the error already, but you may be interested in this related thread: [Pearson correlation between a variable and its square](https://stats.stackexchange.com/q/297685/22228). – Silverfish Nov 08 '17 at 22:34

2 Answers2

12

Except for very small counts, $\log(x)^2$ is essentially a linear function of $\log(x)$:

Figure showing plots and linear fits

The colored lines are least squares fits to $\log(x)^2$ vs $\log(x)$ for various ranges of counts $x$. They are extremely good once $x$ exceeds $10$ (and still awfully good even when $x\gt 4$ or so).

Introducing the square of a variable sometimes is used to test goodness of fit, but (in my experience) is rarely a good choice as an explanatory variable. To account for a nonlinear response, consider these options:

  • Study the nature of the nonlinearity. Select appropriate variables and/or transformation to capture it.

  • Keep the count itself in the model. There will still be collinearity for larger counts, so consider creating a pair of orthogonal variables from $x$ and $\log(x)$ in order to achieve a numerically stable fit.

  • Use splines of $x$ (and/or $\log(x)$) to model the nonlinearity.

  • Ignore the problem altogether. If you have enough data, a large VIF may be inconsequential. Unless your purpose is to obtain precise coefficient estimates (which your willingness to transform suggests is not the case), then collinearity scarcely matters anyway.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • This is the answer I would prefer as it addresses the $log(x)$ component of the question, which I failed to do below. – Brash Equilibrium Nov 08 '17 at 17:00
  • thank you for the answer, that made it perfectly clear! As a follow-up - I want to show diminishing returns to that variable, and am only aware of the option of introducing a squared term. What would be a more appropriate approach considering the use of a log? – statsnewby Nov 08 '17 at 17:14
  • Any of the four bulleted choices would be a possibility. – whuber Nov 08 '17 at 20:27
1

The source of the collinearity is that $f(x) = x^2$. One way to reduce the correlation between $x$ and $x^2$ is to center $x$. Let $z=x-E(x)$ and compute $z^2$. Because the low end of the scale now has large absolute values, its square becomes large, making the relationship between $z$ and $z^2$ less linear than that between $x$ and $x^2$. This advice comes from The Analysis Factor: http://www.theanalysisfactor.com/centering-for-multicollinearity-between-main-effects-and-interaction-terms/

Note: When interpreting the effects, please remember that you scaled the covariate. Also, some researchers may caution against scaling because then the results of your model are data-dependent. Here is some perspective from Andrew Gelman on that issue: http://andrewgelman.com/2009/07/11/when_to_standar/

Brash Equilibrium
  • 3,565
  • 1
  • 25
  • 43
  • Thanks! I have two questions about this approach: Firstly, is the x you are referring to the untransformed x or the ln(x)? Centering ln(x) did not lead to major improvements (VIF of 16). Secondly, do you mean the average of X with E(x), hence mean centering the variable? – statsnewby Nov 08 '17 at 16:21
  • Ah, good point, I forgot that part of your question. I would refer to the answer from @whuber. – Brash Equilibrium Nov 08 '17 at 17:00