1

I'm trying to fit a glm to some data which are highly correlated (I can see this from the data). However, when I fit the glm it is giving a p-value of almost 1, which seems to indicate I'm not using the right test or have made a mistake. Does anyone know where I'm going wrong here? I've provided some example data that are perfectly correlated to illustrate my point. Example data/code:

x <- c(0,0,0,0,1,1,1,0,0,1,0,1,0,1,0,1,0,0,1,1,0,0,0,1,0,0,1,0,1)
y <- c(0,0,0,0,1,1,1,0,0,1,0,1,0,1,0,1,0,0,1,1,0,0,0,1,0,0,1,0,1)
fit <- glm(y ~ x, family = binomial('logit'))
summary(fit)
Gordon Smyth
  • 8,964
  • 1
  • 25
  • 43
unknown
  • 137
  • 1
  • 11

2 Answers2

2

The problem is that the logistic regression has fitted values near to 0 and 1, and the asymptotic formula for standard errors in a binary regression are not at all accurate in this situation. The regression itself is fine, it just means you have to use a likelihood ratio test instead of a z-test to test significance:

> x <- c(0,0,0,0,1,1,1,0,0,1,0,1,0,1,0,1,0,0,1,1,0,0,0,1,0,0,1,0,1)
> y <- c(0,0,0,0,1,1,1,0,0,1,0,1,0,1,0,1,0,0,1,1,0,0,0,1,0,0,1,0,1)
> fit <- glm(y ~ x, family = binomial('logit'))
> anova(fit, test="Chi")
Analysis of Deviance Table

Model: binomial, link: logit

Response: y

Terms added sequentially (first to last)

     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                    28     39.336              
x     1   39.336        27      0.000 3.568e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As you can see, the p-value for the regression of $y$ on $x$ is $3.6\times 10^{-10}$. Highly significant!

This problem occurs in logistic regression whenever the fit is "too good" and the regression coefficient becomes infinitely large. In this case, dividing the coefficient by its standard error to get a z-statistic becomes meaningless (infinity divided by infinity) so you have to switch to the much better likelihood ratio test provided by anova.

Gordon Smyth
  • 8,964
  • 1
  • 25
  • 43
1

Depending on the number of covariates you have ($p$), I suggest you try one of the following:

  1. Use LASSO regression to decrease $p$.

  2. Run PCA to get a hunch of the actual dimension of your data, or use SVD decomposition to see which singular values are close to 0.

  3. Look for the condition number of your covariates matrix in order to find collinearity.

  4. Examine a no-intersect model (R: glm(y ~ x -1, family = binomial('logit'))), especially if you have only one covariate.

Once you get a sense of the problem, a proper solution can be found. Personally I refrain from using Logistic Regression when $p=1$ as there are better classifiers (such as LDA).

Spätzle
  • 2,331
  • 1
  • 10
  • 25
  • Hi! Thanks for your answer. I'm not sure I really understand though, the problem isn't that I need to decrease p per se. The problem is that at the moment the result for the example data is showing that x has no effect on y, when in fact y can be predicted perfectly if you have x (i.e. if x = 1, y = 1 too). Given that, it seems to me that I am doing something wrong! – unknown Nov 06 '17 at 09:50
  • This IS NOT what it says. The noted p-value the the significance minimal level in which we'll reject the null hypothesis $H_0:\hat{\beta}_x=0$, under the model assumptions of Logistic Regression. If you only have vectors $y$ and $x$ and their correlation is high, there is no need for Logistic Regression. – Spätzle Nov 06 '17 at 10:03
  • I must really be missing the point here, sorry for being slow. The reason for doing Logistic regression is that there are three predictors in the model (the other two are continuous predictors). The model described is similar to that in this answer: https://stats.stackexchange.com/a/311580/52956 – unknown Nov 06 '17 at 10:24
  • Okay, so the model you use is nothing like the example you originally gave. Please post a screenshot of the glm summary or copy it, so I can see what's going on. – Spätzle Nov 06 '17 at 10:29
  • OK, I have edited the output from the glmm summary in the original question. Thanks for the help! (presencet0 is the binomial predictor) – unknown Nov 06 '17 at 10:33
  • Which predictor is highly correlated with y? is it `presencet01`? – Spätzle Nov 06 '17 at 10:37
  • Yes, presencet0 (I don't know where the 1 comes from in the output?) is a binomial predictor that is highly correlated with y – unknown Nov 06 '17 at 10:39
  • Well, as I've explanid before, the high p-value suggests it isn't very likely that `presencet01`'s coefficient isn't 0. Back to my original answer, try the 4th option and report back the results. – Spätzle Nov 06 '17 at 10:41
  • OK, that's good to know. I've posted the output in the original question now. Not sure what presencet01 and presencet00 are, because I only provide presencet0. – unknown Nov 06 '17 at 10:49
  • R sees `presencet0` as a factor, therefore `presencet00` is a vector to represent the places where `presencet0`=0 and the same applies for `presencet01`. Try running the following: `glm(formula= f_dists1$presence ~ f_dists1$length+ f_dists1$sizeF+ f_dists1$presencet0 -1, family=binomial(link="logit"))` – Spätzle Nov 06 '17 at 10:55
  • Ah I see! So the model finds significance for the vector of presencet0 = 0, but not when it is 1. Yep, I've posted that above now. – unknown Nov 06 '17 at 10:59
  • Damn it, I forgot something before. Try running `glm(formula= f_dists1$presence ~ f_dists1$length+ f_dists1$sizeF+ as.numeric(as.character(f_dists1$presencet0)) -1, family=binomial(link="logit"))`. It should treat presencet0 as a numeric now. – Spätzle Nov 06 '17 at 11:06
  • OK I've done that now. I think presence should be a factor though? Because it is binomial, so not a numeric predictor – unknown Nov 06 '17 at 11:08
  • OK, try running again **with** the intercept (i.e remove the `-` from your formula – Spätzle Nov 06 '17 at 11:18
  • OK, that's done now! – unknown Nov 06 '17 at 11:23
  • Well, all in all it seems that despite our efforts the correlated variable isn't helping with predicting the outcome. Given its relatively large coefficient (20.64), it probably means that it's also highly correlated with the other two predictors, so with the linear predictor getting already high, it just pushes towards higher certainty regarding $\hat{y}=1$. – Spätzle Nov 06 '17 at 11:31