3

Please see the model below (link to bigger image). The independent variables are properties of 2500 companies from 32 countries, trying to explain companies' CSR (corporate social responsibility) score.

I am worried about the VIF scores of especially the LAW_, GCI_ and HOF_ variables but I really need them all included in the model to connect it to the theoretical background the model is built upon. All variables are discrete numeric values, except the law LAW_ variables: they are dummies as to which legal system applies in the companies' country of origin (either english, french, german or scandinavian).

enter image description here

Amongst other articles, I have read this article about dealing with collinearity. Often-suggested tips are removing the variable with highest VIF-score (in my model this would be LAW_ENG). But then other VIF-scores increase as a result. I do not have the proper knowledge to see through what is going on here and how I can solve this problem.

I have uploaded the corresponding data here (in SPSS .sav format). I would really appreciate somebody with more experience having a quick look and tell me a way to solve the collinearity problem without taking out (any or too many) variables.

Any help is greatly appreciated.

P.S. For reference, I am including a correlation table (link to bigger image):

enter image description here

Pr0no
  • 748
  • 7
  • 16
  • 28
  • 1
    Despite the collinearity, is there a reason your results are not interpretable as is? Take a look at this too perhaps: http://psycnet.apa.org/psycinfo/2008-08581-005 – Behacad Apr 18 '13 at 22:29
  • Thanks, this takes away some of my initial stress :-) From the article however: "At times, however, it may be reasonable to eliminate or combine highly correlated independent variables, but doing this should be theoretically motivated." The motivtion here could be that `GCI_505` and `GCI_701` are measuring essentially the same thing. So I then can replace them by `NEW_VAR = GCI_505 + GCI_701`? – Pr0no Apr 18 '13 at 23:06

1 Answers1

8

When variables are co-linear, you can think of them sometimes as being different manifestations of the same thing. Say I had a dataset of happiness of cats, a variable of whether they were soaking wet, and a variable of whether or not there were nearby children who thought it was fun to throw cats into water. Clearly cats don't like water, yet sometimes they will fall into it on their own. More often however, they are thrown in by malevolent children. Sometimes however, malevolent children fail to thrown cats in the water.

So, wet cats and malevolent children are different, but can be thought of as a unitary dynamic. If a researcher was only interested in the effect of wetness on cat happiness, and didn't control for malevolent children, the estimates would be biased. Include them, and VIF goes up. This is because you simply don't have enough independent observations of wetness to know its effect apart from the effect of malevolent children.

Shrinkage estimators are one way to go. Basically, you increase the bias of your estimator in order to decrease its variance. Appealing for prediction, but not for inference.

If you're willing to put aside (or think differently about) inference on individual model terms, you could first do a principal components analysis, "interpret" your principal components somehow, and then fit your regression to the rotated dataset. Collinearity will be gone, but you're only able to conduct inference on these PC's, which may or may not have a convenient interpretation. In the case of wet cats and malevolent children, the first PC would increase as the probability of wetness got higher and as the probability of malevolent children increased. The other PC would be perpendicular, and relate to wetness as the probability of malevolent children decreased. If you simply wanted to know the effect of wetness absent malevolent children, you'd be interested in the coefficient on the second PC. Most PC regs don't have interpretation this straightforward however.

It is also worth emphasizing that prediction from a model with high collinearity is fine. So if your F-stat is good and you don't care about any of the coefficients individually, leave the model as it is.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
generic_user
  • 11,981
  • 8
  • 40
  • 63
  • Thank you for a very comprehensive answer! Quote: "It is also worth emphasizing that prediction from a model with high colinearity is fine. So if your F-stat is good and you don't care about any of the coefficients individually, leave the model as it is.". I care about the individual coefficients in the sense that I want to draw conclusions like "stronger shareholder rights (measured by `LAW_SR`) tend decrease a companies' CSR performance". I'm not really interested in the beta, but only in the direction of the coefficient and its significance. With F-stat being sign. at 1%, can I do this? – Pr0no Apr 18 '13 at 22:53
  • 1
    No you can't, unfortunately. Read up on the F-stat. It is a measure of how different the whole model is from a null model. Your problem is that `LAW_SR` always moves together with other predictors of CSR. How are you able to say what causes what? They all move together. If you want to know what causes what, you need to see them move separately. – generic_user Apr 18 '13 at 22:55
  • "Your problem is that LAW_SR always moves together with other predictors of CSR" How are you able to tell this just by looking at the model? I know F-stat = mean square regression / mean square residual. Do mean that the significant large difference between the two mean squares rejects the H0 hypothesis of CSR being described by these independent variables? And if so, could you please elaborate on "you need to see them move separately"? Do I perform a regression for for each independent variable on de DV? And what is it I am then looking for? Low significant F-statistic? – Pr0no Apr 18 '13 at 23:11
  • 1
    I can't really see the picture you posted, so I don't know. Go back to the cat example. If you only see cats that are wet when there are malevolent children around, you have no idea if they are unhappy because they are wet or unhappy because there are malevolent children around. Are there variables that are either jointly or singly correlated with `LAW_SR`? (If not, you're done: null result.) If so, you can't separate the effect of `LAW_SR` from things that it is correlated with. And yes, you have the correct technical definition of the F stat, but you're missing the intuition/purpose. – generic_user Apr 18 '13 at 23:24
  • 1
    One thing you COULD do is to identify which factors are most highly correlated with `LAW_SR`. Trivially, lets just say that Babylonian legal heritage and incidence of vegetarianism were all highly correlated with `LAW_SR`. You could do an F test for whether those three variables are jointly significant predictors of CSR. But you still can't make an empirical argument that its `LAW_SR` causing the CSR. You could appeal to theory maybe? Or maybe not, because babylonians and vegetarians are likely to affect CSR. – generic_user Apr 18 '13 at 23:31
  • I have added the correlation table and links to the larger images. Most of the IV's are correlated to `LAW_SE`. But then again, there are _many_ significant correlations between all of the IV's. In fact, _most_ IV's are seriously correlated to one another. If I understand you correctly, this is making it very hard to distinguish between the separate effects, yes? Yet again, this model is the outcome of a theoretical debate. Higher `LAW_SR` theoretically should decrease `CSR`, as it does in the model, but other effects, such as the negtive effect of `HOF_LTO` cannot reasonably be explained. – Pr0no Apr 18 '13 at 23:40
  • So I think my question would be ... how to go forth from here, seeing all the intertwined effects? – Pr0no Apr 18 '13 at 23:42
  • That I can't tell you. Try to think of collinearity as a feature of the world, rather than a bug in your model. – generic_user Apr 18 '13 at 23:47
  • That's sure has a nice ring to it, but you say that I cannot draw any conclusions based on this model. But I still need to graduate ;-) so what would your advice be? In the actual dataset, I have many more variables, but theoretically they have nothing to do with CSR performance. I would like to not having to conclude "the theoretical background I just gave you seems to make sense but I have no statistical evidence to back it up whatsoever" ;-) – Pr0no Apr 18 '13 at 23:55
  • If a tree falls in the forest with nobody there to hear it, does it make any sound? I don't mean to mock you, but some questions are just really hard to answer because the data to answer them is never observed. Maybe drop some of the co-linear variables if you don't need them? – generic_user Apr 19 '13 at 00:16