Abusing Linear Models under Multicollinearity: Simulation for 'realistic' movement of predictors

Question

I have a reasonable understanding of why multicollinearity is a problem is regression models, along the lines of this excellent post.

To summarise my understanding, for a regression model of $y = \alpha + \beta_1x + \beta_2z$ (where $x$ and $z$ are correlated), beta coefficient estimates (as well as being unstable) are difficult to interpret, as a situation where you might increase $z$ without increasing $x$ is unlikely to occur, and not supported by the data.

I understand multicollinearity is less harmful to purely predictive as opposed to explanatory or descriptive models.

I'm interested in another interpretation:

If I decided to increase $z$, and let $x$ vary as it pleases in reaction, what would I see happen to $y$, accounting for the fact that $x$ is likely to move with $z$, and also have it's own effect?

In other words, accepting the causal interpretation that $x$ and $z$ both cause $y$, and are themselves correlated to some extent (.7 say), how would all three variables move if $z$ is (linearly) increased by some amount?

I've tried to model this sort of thing before, fitting $y = \alpha + \beta_1x + \beta_2z$ (model 1), and $x = \alpha + \beta_1z$ (model 2). Hypothetical increased $z$ values are produced, and resulting $x$ values are predicted with model 2. The hypothetical $x$ and $z$ values are used to predict $y$ using model 1. However this feels very unsatisfactory, complicated simulations are required to capture uncertainty (I used sim in arm). Additionally, my gut tells me that apart from being painfully inelegant, it's a bad idea for other reasons I can't put my finger on.

Is such an 'observational'/conditional-when-I-feel-like-it interpretation possible?
Does anyone know of a better method for this interpretation?
Can anyone recommend a paper or R package along these lines?
Is the above multi-model mess at-all valid?

I'm aware that a model along the lines of $y = \alpha + \beta_1z$ would yield a similar answer to the two-stage mess above, but would lose information in $x$.

I understand that these ideas are similar to structural equation modelling, but apart from having scant knowledge of SEM, I'm yet to find an R package which allows flexibly extending these models with different link functions for proportional odds models, etc.

score 1 · Answer 1 · answered Jun 11 '15 at 21:54

Try lavaan. It's an R package that is supposed to being built to handle link functions as well.

The problem with your question is the lack of a purpose. Statistical modeling is very difficult to translate and interpret when dealing with variables and abstract hypotheses.

X and Z are correlated. If there is large variation in either, you're bound to have a poor model when there is multicollinearity. The information from one is confounded by the other since they "move together".

On the other hand, if you're dealing with variables that are relatively reliable in their measurement, and you have an ample sample, it's worth keeping both since the correlation is not as high as, say, 0.85-0.95.

Lastly, if the goal is accurate prediction, keep them both. If the goal is statistical validity, use your fit statistics and use Wald tests, LR tests, AIC, BIC... etc. I'd also suggest writing code from scratch to ensure you really understand what you're doing. Packages are for the non-academics. If you want valid answers, you need to have a firm grip on everything happening "under the hood".

And it is usually true: the ends justify the means.

Abusing Linear Models under Multicollinearity: Simulation for 'realistic' movement of predictors

1 Answers1