0

I'd like to get statistical support for hypotheses concerning the effects of independent variables $D,CF,CT,P,H,LA,LP$ on $C$. For regressor sets $\{D,CF,CT,P,LA,LP\}$ and $\{D,CF,CT,H,LA,LP\}$ all the independent variables are statistically significant:

Call:
clogit(C ~ D + CF + CT + P + LA + LP + strata(Case), clog_data)

                       coef exp(coef)  se(coef)      z       p
D                 -5.91e-03  9.94e-01  7.21e-05 -81.94 < 2e-16
CF                 3.78e-03  1.00e+00  8.10e-04   4.67 3.0e-06
CT                 2.60e-04  1.00e+00  5.18e-05   5.02 5.2e-07
P                  6.07e-01  1.84e+00  3.32e-02  18.32 < 2e-16
LA                 8.71e-02  1.09e+00  2.41e-02   3.62 0.00029
LP                -8.93e-02  9.15e-01  1.83e-02  -4.89 1.0e-06

Likelihood ratio test=11528  on 6 df, p=0
n= 71403, number of events= 12339 

Call:
clogit(C ~ D + CF + CT + H + LA + LP + strata(Case), clog_data)

                       coef exp(coef)  se(coef)      z       p
D                 -5.80e-03  9.94e-01  7.15e-05 -81.15 < 2e-16
CF                 7.87e-03  1.01e+00  7.73e-04  10.17 < 2e-16
CT                 3.13e-04  1.00e+00  5.10e-05   6.13 8.9e-10
H                 -9.47e-02  9.10e-01  2.01e-02  -4.71 2.5e-06
LA                 9.49e-02  1.10e+00  2.39e-02   3.97 7.2e-05
LP                -5.75e-02  9.44e-01  1.81e-02  -3.18  0.0015

Likelihood ratio test=11206  on 6 df, p=0
n= 71403, number of events= 12339 

For the full set of regressors $\{D,CF,CT,P,H,LA,LP\}$, one of the independent variables $(H)$ is not statistically significant

Call:
clogit(C ~ D + CF + CT + P + H + LA + LP + strata(Case), clog_data)

                       coef exp(coef)  se(coef)      z       p
D                 -5.90e-03  9.94e-01  7.24e-05 -81.53 < 2e-16
CF                 3.73e-03  1.00e+00  8.16e-04   4.57 4.8e-06
CT                 2.61e-04  1.00e+00  5.18e-05   5.03 4.9e-07
P                  6.03e-01  1.83e+00  3.40e-02  17.74 < 2e-16
H                 -1.10e-02  9.89e-01  2.08e-02  -0.53 0.59625
LA                 8.68e-02  1.09e+00  2.41e-02   3.61 0.00031
LP                -8.95e-02  9.14e-01  1.83e-02  -4.89 9.9e-07

Likelihood ratio test=11528  on 7 df, p=0
n= 71403, number of events= 12339 

Are my hypotheses statistically supported?

Update1: Correlation matrix

            Case     D     CF    CT    H     P     LP    LA     C
  Case      1.00    -0.01  0.00 -0.01 -0.01 -0.01  0.00 -0.01   0.00
  D        -0.01     1.00 -0.05  0.05  0.07  0.12  0.06  0.04  -0.35
  CF        0.00    -0.05  1.00 -0.18 -0.28  0.40  0.19  0.17   0.08
  CT       -0.01     0.05 -0.18  1.00  0.07 -0.03 -0.07 -0.04   0.00
  H        -0.01     0.07 -0.28  0.07  1.00 -0.31 -0.12 -0.12  -0.06
  P        -0.01     0.12  0.40 -0.03 -0.31  1.00  0.24  0.21   0.06
  LP        0.00     0.06  0.19 -0.07 -0.12  0.24  1.00  0.81   0.00
  LA       -0.01     0.04  0.17 -0.04 -0.12  0.21  0.81  1.00   0.01
  C         0.00    -0.35  0.08  0.00 -0.06  0.06  0.00  0.01   1.00

Update2: VIF

                      GVIF Df GVIF^(1/(2*Df))
D                 1.027972  1        1.013890
CF                1.210073  1        1.100033
CT                1.049540  1        1.024471
P                 1.245100  1        1.115841
H                 1.124559  1        1.060452
LA                2.684257  1        1.638370
LP                2.749423  1        1.658138
strata(Case)      3.470383  0             Inf
Warning message:
In vif.default(clogit(C ~ D + CF + CT + P + H +  :
No intercept: vifs may not be sensible.
8k14
  • 181
  • 7
  • 1
    What are your variables? In particular, what are H and P? It looks like they might be colinear. – Peter Flom Nov 02 '17 at 12:42
  • Thanks. All independent variables are continuous factors for choice $C$ which is binomial. Correlation between $H$ and $P$ is -0.31 – 8k14 Nov 02 '17 at 13:12
  • 1
    The problem is not limited to $H$ and $P$, as evidenced by the substantial changes in the estimates of $CF$ and $LP$ as $H$ and $P$ are added. – whuber Nov 02 '17 at 13:17
  • You didn't answer my question. Also, colinearity is not the same as correlation. – Peter Flom Nov 02 '17 at 21:02
  • @Peter Flom I'm sorry. What would you like to know about $H$ and $P$? – 8k14 Nov 03 '17 at 05:35
  • I want to know what all your variables are, what your research question is, what you are trying to find out. Like, is H blood pressure? Age? or what? And similarly for the other variables. – Peter Flom Nov 03 '17 at 11:43
  • @Peter Flom Thank you for your attention to my question. $C$ is choice (1/0) made by customers depending on some factors. I'd like to check which factors affect customer's choice. May I ask you why do the nature of the variables and my research question matter here? – 8k14 Nov 03 '17 at 12:04
  • They always matter. It's hard to say why they matter in your specific case, because we don't know what they are. But solving statistical problems without context is like boxing while blindfolded. You might knock your opponent out or you might bash your had on the ring post – Peter Flom Nov 03 '17 at 12:18

2 Answers2

1

With 12000+ events you are far from overfitting, so using the full set of regressors allows you to minimize potential problems from omitted-variable bias. In that full model, variable H does not pass the standard $p < 0.05$ test of statistical significance, while all the others do (even if you feel compelled to correct for multiple comparisons for the 7 coefficients). That's the model to focus on.

The results from the models that omit one of H and P aren't surprising, given the correlation matrix and the coefficients found for those two variables. They have a reasonably large negative correlation and regression coefficients of opposite signs. When you omit P from the regression, H then is able to pick up some of the influence of P even if its direct relation to outcome is minimal. That's consistent with omitted-variable bias in the regression omitting P. The simplest explanation is that H isn't closely related to outcome; it only appears to be if you ignore P, which is highly related both to outcome and to H.

VIF tells you how much the variance of the estimated coefficient might be inflated by correlations with other variables. A low VIF doesn't rule out omitted-variable bias.

This page is one of many on this site that discuss this general issue.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • Thanks a lot for the detailed answer. I think now the picture is clear for me. Back to my initial question: effects for all the independent variables except $H$ are statistically supported and effect for $H$ is not, right? – 8k14 Nov 02 '17 at 17:19
  • @8k14 Yes, that would be my interpretation of the analyses that you presented. – EdM Nov 02 '17 at 17:20
  • Thanks. On the other hand if I'm presented only with the second regression ($P$ omitted) I have solid reasons to believe that the hypothesis for $H$ is supported, right? – 8k14 Nov 02 '17 at 17:24
  • If you had no information about `P` and its relation both to outcome and to `H` then you might claim to have found statistical evidence for a role of `H` based on that limited model. That claim, however, would not be a good model of reality; with omitted-variable bias a "statistically significant" result can be grossly erroneous. – EdM Nov 02 '17 at 17:42
  • Thanks again. How can I know that this limited model is not a good model of reality? Are there any statistical methods for that or it's only in the nature of the process? – 8k14 Nov 02 '17 at 17:53
  • [This page](https://stats.stackexchange.com/q/30131/28500) suggests that if you have an "instrumental variable" you might be able to test for omitted-variable bias. I have no experience with instrumental variables. But that's for OLS, not logistic regression as in your example. In logistic regression you can have omitted-variable bias even if the omitted variable is uncorrelated to the predictors in your model. – EdM Nov 02 '17 at 18:58
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/68102/discussion-between-8k14-and-edm). – 8k14 Nov 02 '17 at 19:32
  • Let me ask you once again. $H$ and $P$ are product attributes and $C$ is the choice (1/0) the customer makes. Trying to catch the "inertia effect" of these attributes, I ran two conditional logistic regression including interaction terms $H:H0$ and $P:P0$, respectively, where $P0$ and $H0$ are the attributes for the product just purchased. The corresponding coefficient in the regression for $P$ is not statistically significant while for $H$ everything is fine. May I say that I have this effect for $H$ even if $H$ is not directly related to the outcome? – 8k14 Nov 11 '17 at 18:40
  • @8k14 It's possible to have a significant interaction effect without a significant main effect. If that's what your "conditional logistic regression" (evidently including information about a prior purchase) says, what you describe is OK. This, however, is a change from the model originally described in your question. With correlated predictors like these, changes in models may well lead to switches between which predictor is deemed "significant." A predictor can still be "directly related to the outcome" yet be deemed not "significant" due to collinearity, too few cases, etc. – EdM Dec 07 '17 at 19:30
  • @ EdM Thanks again for your answer but I have to say that didn't quite understand it. Is it legitimate to consider an interaction effect of a predictor that is not significant on its own? What model change do you mean? – 8k14 Dec 08 '17 at 03:06
  • @8k14 It is OK to consider an interaction with a predictor that is not "significant" on its own; its interaction with another predictor might be hiding any overall effect of the predictor on its own. For "model change" I meant your inclusion of interaction terms like H:H0 in the conditional logistic regression in your comment; such terms don't seem to appear in the original models presented in the question. So if that's the case, the models presented in the question are different from the models discussed in your comment. But I might have misunderstood. – EdM Dec 08 '17 at 18:01
  • @ EdM Thanks a lot for your explanation. You are right, the predictor H is not in my base model. To check the inertia effect I include the interaction term H:H0. Should I include in the model the single term H as well? Without it all the regressors are statistically significant and H is not, similarly to the case of the base model. – 8k14 Dec 15 '17 at 15:28
  • @8k14 the usual rule to avoid omitted-variable bias is to include all predictors that might reasonably be expected to have individual relations to the outcome or to affect the relations of other predictors to outcome, if you can do so without overfitting. You have a very large number of events relative to the number of predictors, so overfitting is not an issue here. Your best approach would be to include all these predictors and an appropriate set of interactions in the final model. – EdM Dec 15 '17 at 16:48
  • @ EdM Thank you again. Do you mean that I should include the factor H along with the factor P in my base model even if the former is not statistically significant and affects the outcome mostly through the latter? – 8k14 Dec 18 '17 at 06:26
  • @8k14 absolutely, yes. Otherwise you run a risk of omitted-variable bias, at least. – EdM Dec 18 '17 at 14:16
0

You should check VIF because probably You have variables P and H highly correlated, which broke assumption of the regression in 3rd model.

If a information from one variable is incorporated to the model the second didn't has a chance to be significant.

  • The high correlation between `P` and `H` doesn't break any assumptions about the logistic regression, but as you point out it does make it difficult to separate out any individual contributions of those 2 variables to the outcome. – EdM Nov 02 '17 at 13:45
  • Thanks. VIFs and correlation matrix are now included in my question. I may be wrong but I don't see collinearity here. – 8k14 Nov 02 '17 at 14:40