2

If all of my coefficients in my logsitic model have really perfect t-statistics that all show sufficiently high significance but have two coefficients that have high VIF like 13-14 with sample size of 11000 for each independent variable, can I ignore the multicollinearity, given the way I propose to use the results?

I have a logistic regression model that has 6 independent variables where each independent variable and the dependent variable has the same sample size of 11000.

From this logistic regression I produce 6 predictions, based on changing the values of the independent variables. So let's say for prediction 1, I increase independent variable 1 by two units while all other independent variables are increased by only one unit. For prediction 2, I increase independent variable 2 by 2 units while all other independent variables are increased by only one unit and so on.

This makes a total of 6 different $y$ values which I need to use for my own purpose as below.

 y = a + b1*2+b2*1+b3*1+b4*1+b5*1+b6*1 + error
 y = a + b1*1+b2*2+b3*1+b4*1+b5*1+b6*1 + error
 ...
 y = a + b1*1+b2*1+b3*1+b4*1+b5*1+b6*2 + error

I do this for each independent variable and produce 6 different predicted $y$ values accordingly. As you can see above, a two-unit increase for different independent variable each time so it makes a total of 6 different $y$ values.

So I have 6 independent variables and I have 6 different predicted $y$ values, based on changing each independent variable separately. Of course all intercepts and coefficients are highly significant at 0.01 level or less.

Then my main objective is to use the numerical values of these 6 different predicted $y$ values as an input to a separate function to produce a numerical output, a "utility value."

This "utility value" is the one I need. I want to show how this final utility value from a different model differs from that produced by this logistic regression model, as a function of how $y$ changes with different emphasis on independent-variable increases up to two units.

EdM
  • 57,766
  • 7
  • 66
  • 187
Eric
  • 434
  • 1
  • 10
  • 27
  • 1
    This question reads remarkably like your other two questions at http://stats.stackexchange.com/questions/220189 and http://stats.stackexchange.com/questions/220214, both of which ask exactly the same thing: "can I ignore the multicollinearity." Could you explain how the answers you got do not address the situation? – whuber Jun 24 '16 at 17:46
  • Yes, because I know that multicollinearity problem can be ignored if I am only interested in using the y value from my regression. However, even though I explained what I was exactly doing, the answers were not in either the correct or wrong side. May I kindly ask you again about this then..? – Eric Jun 24 '16 at 18:04
  • The supporting sources of my arguement that I can ignore multicollinearity if I am only interested in using the y value are as follows: http://www.stat.tamu.edu/~hart/652/collinear.pdf – Eric Jun 24 '16 at 18:05
  • http://www.public.iastate.edu/~alicia/stat328/Model%20diagnostics.pdf – Eric Jun 24 '16 at 18:06
  • http://econweb.ucsd.edu/~rramanat/ec120c/spring98/ch5sum.htm – Eric Jun 24 '16 at 18:06
  • 1
    It they are not correct, then please point that out in the previous threads rather than starting a new thread! In the present instance, it is difficult to understand what you are trying to do. This leads me to suspect that the problems with the previous threads could stem from a lack of clarity in the *question*. – whuber Jun 24 '16 at 18:14
  • Then you're not here to help. You are one of the admin here to clear the messes. – Eric Jun 24 '16 at 18:16
  • 1
    When you say that you have "sample size at least more than 2500 up to 11000 for each independent variable," it sounds like your regression coefficients $b_i$ are based on individual regressions of $y$ versus each of the predictor variables, rather than on a multiple regression. If so, then that modeling approach is much more a problem than the collinearity among the predictors. Also, it's not clear that setting each of the predictors to a value of 1 except for one value set to 2 will accomplish what you want, except under some particular circumstances. – EdM Jun 24 '16 at 19:01
  • Oh no you misunderstood. I have a total of 6 different logistic regressions. Each regression has equal numbers of data for each independent variable. But the total number of data for each different regression is different. – Eric Jun 24 '16 at 19:03
  • So for one regression model each independent variable has equal number of data points let's say 6000 and the other regression model 10000 and so on – Eric Jun 24 '16 at 19:04
  • 4
    Please take a break, calm down, and edit your question (ideally, your original question) to say more about the whole problem that you are trying to address, not just this particular aspect of your present solution as it relates to collinearity. That will make it easier for those who take the time to try to answer questions on this site to provide you with useful help. What general question are you trying to answer? Why do you have 12 "different" logistic regressions? What makes them different? What is the nature of the procedure into which you will be placing your values of y? – EdM Jun 24 '16 at 19:04
  • I simplified my question to make it better to understand. So I have 6 logistic regression model that each has 6 independent variables where each independent variable and the dependent variable has the same sample size of 11000. – Eric Jun 24 '16 at 19:13
  • For each logistic regression, I increase the unit of independent variable differently. So let's say for logistic regression 1, I increase two units of independent variable 1 while all other independent variables are increase by only one unit. For logistic regression 2, I increase two units of independent variable 2 while all other independent variables are increased by only one unit and so on. This makes a total of 6 different y values which I need to use for my own purpose. – Eric Jun 24 '16 at 19:13
  • 2
    What you're doing doesn't make any sense to me. What is your ultimate "purpose"? I suspect there will be a better way of achieving it than what you're doing. Moreover, what are your variables & your data? What is the larger situation? Whether you can "ignore multicollinearity" depends on that information. – gung - Reinstate Monica Jun 24 '16 at 19:28
  • I am using the y value as my input in other function – Eric Jun 24 '16 at 19:29
  • I am not using any coefficients or anything just using the different y value resulting from different independent variable unit increase combination. That's it. – Eric Jun 24 '16 at 19:39
  • 1
    You understand correctly, Eric: I'm trying to prevent a mess from forming and spreading. I'm here to help you express your problem in a way that other people will understand as you intended. If that doesn't happen, then you might collect a set of answers that potentially mislead you and future readers (which is what you seem to have noticed already). So please keep working with the kind people like @EdM and gung who are patiently trying to find out what your problem is. If your questions get "put on hold" in the meantime than don't take it the wrong way--that's just part of the process. – whuber Jun 24 '16 at 19:57
  • 2
    Is the following what you are trying to do? First, run a multiple logistic regression with a binary outcome variable and 6 predictor variables. Second, use the coefficients from that logistic regression to make 6 predictions of new $y$ values; each prediction is based on setting values of each of the 6 predictors to 1, except for one predictor whose value is set to 2. Third, use those 6 predicted $y$ values as input to some further function. If so, answering your question will require more information on what you are trying to accomplish with that further, yet unspecified, function. – EdM Jun 24 '16 at 20:22
  • Yes, that's correct. I am using those six y values perceived as probabilities. I use these probability values into a specific function to derive some expected utility values which will be different with different emphasis on the independent variables each 6 times. – Eric Jun 24 '16 at 20:25
  • 1
    I have edited your question to incorporate critical issues that came up in the comments. If I have misinterpreted some of your comments feel free to re-edit. – EdM Jun 24 '16 at 21:52

1 Answers1

2

In this case it seems that multicollinearity might substantially affect the way that you will be able to interpret your ultimate results of "utility values" determined from the predicted probabilities, via your utility function. I would recommend that you look for a different approach to the overall problem that you are trying to address.

This is like an attempt to assess the relative importance among your independent variables (IVs), through their different effects on your utility values when you set the value of 1 IV to twice the values of the other 5, and perform this process for all 6 IVs. In general, assessing variable importance can be difficult, as discussed for example on this page. This particular way to evaluate your 6 IVs has some substantial problems.

In response to an earlier question on this matter, @DJohnson said:

Collinearity is a problem only if the model needs to enforce an assumption of the independence of predictors. It's known that the model coefficients usually aren't affected by collinearity, whereas std errors and t-values are ... Examples of where IV independence can be important include pricing models where the analyst needs to know what the impact of price changes are clear and free of the influence of the other predictors in the model.

In your use of this logistic regression model, it seems that you do need to "enforce an assumption of the independence of predictors" in order to make reliable interpretations of the utility values returned by the 6 predictions, each based on a change in a single predictor variable, that you propose.

If two IVs are highly correlated, it is unrealistic to expect in practice that you would ever find the value of one IV to double while its correlated partner IV stays unchanged. If your multicollinearity comes from more complicated relations among your IVs, then your setting the value of only one IV to double at a time is even more unrealistic. You do not know that the "impact" of any one IV is "clear and free of the influence of the other predictors in the model." In fact, you know the opposite: that some IVs are highly correlated.

The way that you are constructing the 6 predictions from your logistic regression model is something like extrapolating regression results beyond the bounds of the data: your are asking it to make predictions for circumstances unlike those from which the model was built, in terms of the combinations of IV values. So even if you could "ignore" multicollinearity for some purposes, like making predictions based on new real-world cases, it would seem unwise to ignore it here, given the way that you propose to use your model.

Furthermore, your proposed use of the logistic regression coefficients does not seem to take into account any of the uncertainties in the regression coefficient estimates, and it's not clear whether you are using techniques like cross validation or bootstrapping to evaluate the quality of your model building or its dependence on the peculiarities of your present data sample. There are probably better ways to accomplish your ultimate goal, if it is in some way related to evaluating the relative importance of your 6 IVs.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • Maybe my last question to ask. I think multicollinearity between independent variable and depedent variable is fine, is this correct? – Eric Jun 25 '16 at 08:52
  • 3
    An independent variable that helps predict, linearly, a dependent variable will be correlated with the dependent variable. So that's desired. Technically that is not "multicollinearity," a term properly reserved for relations among independent variables. – EdM Jun 25 '16 at 11:42