Logistic glm with good predictors is giving p-values = 1

Question

I have the following dataframe on which I did logistic regression with response as outcome. There are some good predictors in these variables so I expected significant variables.

structure(list(response = c(0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 
    0L, 0L, 1L, 0L, 1L, 0L), HIST1H3F_rna = c(1.09861228866811, 0.693147180559945, 
    2.07944154167984, 1.09861228866811, 1.79175946922805, 0, 0, 0, 
    2.39789527279837, 1.38629436111989, 1.6094379124341, 1.6094379124341, 
    0.693147180559945, 1.79175946922805, 0), NCF1_rna = c(2.77258872223978, 
    3.09104245335832, 2.63905732961526, 2.19722457733622, 2.30258509299405, 
    2.56494935746154, 3.09104245335832, 3.98898404656427, 2.56494935746154, 
    4.06044301054642, 3.87120101090789, 2.07944154167984, 3.49650756146648, 
    3.17805383034795, 3.95124371858143), WDR66_rna = c(5.06890420222023, 
    4.49980967033027, 5.11799381241676, 3.40119738166216, 3.25809653802148, 
    4.02535169073515, 5.8348107370626, 5.89440283426485, 3.87120101090789, 
    5.67675380226828, 5.35185813347607, 4.15888308335967, 6.23441072571837, 
    5.91889385427315, 3.68887945411394), PTH2R_rna = c(0.693147180559945, 
    5.08759633523238, 0.693147180559945, 1.09861228866811, 0, 6.01126717440416, 
    6.56526497003536, 5.18178355029209, 0, 4.36944785246702, 2.19722457733622, 
    1.09861228866811, 3.49650756146648, 1.38629436111989, 5.93753620508243
    ), HAVCR2_rna = c(4.48863636973214, 3.40119738166216, 3.09104245335832, 
    2.94443897916644, 3.2188758248682, 3.76120011569356, 3.95124371858143, 
    2.83321334405622, 2.07944154167984, 4.36944785246702, 3.58351893845611, 
    1.94591014905531, 4.23410650459726, 3.43398720448515, 2.56494935746154
    ), CD200R1_rna = c(2.484906649788, 2.94443897916644, 0.693147180559945, 
    1.94591014905531, 0.693147180559945, 2.89037175789616, 2.56494935746154, 
    1.6094379124341, 1.6094379124341, 1.94591014905531, 2.19722457733622, 
    0.693147180559945, 4.26267987704132, 1.6094379124341, 0.693147180559945
    )), .Names = c("response", "HIST1H3F_rna", "NCF1_rna", "WDR66_rna", 
    "PTH2R_rna", "HAVCR2_rna", "CD200R1_rna"), row.names = c(NA, 
    -15L), class = "data.frame")

However, running the following lines and getting a summary of the model I find that all variables have a p-value of 1 and the standard errors seem so high. What's going on here?

fullmod <- glm(response ~ ., data=final_model,family='binomial')
summary(fullmod)
Call:
glm(formula = response ~ ., family = "binomial", data = final_model)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-6.515e-06  -2.404e-06  -2.110e-08   2.110e-08   7.470e-06  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)   1.460e+02  5.598e+05       0        1
HIST1H3F_rna  2.135e+01  5.145e+05       0        1
NCF1_rna     -4.133e+01  3.388e+05       0        1
WDR66_rna     1.296e+01  6.739e+05       0        1
PTH2R_rna     1.975e+00  3.775e+05       0        1
HAVCR2_rna   -2.477e+01  1.191e+06       0        1
CD200R1_rna  -1.420e+01  1.315e+06       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2.0190e+01  on 14  degrees of freedom
Residual deviance: 2.2042e-10  on  8  degrees of freedom
AIC: 14

Number of Fisher Scoring iterations: 25

In response to your comments I'll show the feature selection step (and the complete dataframe I'm working with below that).

# forward  feature selection 
library('boot')
z = c()
nullmod <- glm(response ~ 1, data=final_model, family='binomial') ## ‘empty’ 
fullmod <- glm(response ~ ., data=final_model, family='binomial') ## Full model
first = T
for(x in 1:ncol(final_model)){
  stepmod <- step(nullmod, scope=list(lower=formula(nullmod), upper=formula(fullmod)),
                  direction="forward", data=final_model, steps=x, trace=F)
  cv.err  <- cv.glm(data=final_model, glmfit=stepmod, K=nrow(final_model))$delta[1]
  if (first == T){
    first=F
    final_features <- stepmod
  }else{
    if (cv.err < min(z)){ final_features <- stepmod }
  }
  z[x] <- cv.err
  print(paste(x,cv.err))
  print(colnames(final_features$model))
}

plot(z, main='Forward Feature Selection GLM Final Model', 
     xlab='Number of Steps', ylab='LOOCV-error', col='red', type='l')
points(z)
colnames(final_features$model)
summary(final_features)

structure(list(response = c(0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 1L, 1L, 1L), HIST1H3F_rna = c(1.09861228866811, 2.07944154167984, 
1.09861228866811, 1.79175946922805, 0, 0, 0, 2.39789527279837, 
1.38629436111989, 1.6094379124341, 1.6094379124341, 0.693147180559945, 
2.19722457733622, 2.39789527279837, 2.89037175789616), NCF1_rna = c(2.77258872223978, 
2.63905732961526, 2.19722457733622, 2.30258509299405, 2.56494935746154, 
3.09104245335832, 3.98898404656427, 2.56494935746154, 4.06044301054642, 
3.87120101090789, 2.07944154167984, 3.49650756146648, 2.07944154167984, 
2.07944154167984, 1.09861228866811), WDR66_rna = c(5.06890420222023, 
5.11799381241676, 3.40119738166216, 3.25809653802148, 4.02535169073515, 
5.8348107370626, 5.89440283426485, 3.87120101090789, 5.67675380226828, 
5.35185813347607, 4.15888308335967, 6.23441072571837, 4.0943445622221, 
4.21950770517611, 3.95124371858143), PTH2R_rna = c(0.693147180559945, 
0.693147180559945, 1.09861228866811, 0, 6.01126717440416, 6.56526497003536, 
5.18178355029209, 0, 4.36944785246702, 2.19722457733622, 1.09861228866811, 
3.49650756146648, 0, 0.693147180559945, 1.38629436111989), 
HAVCR2_rna = c(4.48863636973214, 
3.09104245335832, 2.94443897916644, 3.2188758248682, 3.76120011569356, 
3.95124371858143, 2.83321334405622, 2.07944154167984, 4.36944785246702, 
3.58351893845611, 1.94591014905531, 4.23410650459726, 1.38629436111989, 
1.09861228866811, 1.38629436111989), CD200R1_rna = c(2.484906649788, 
0.693147180559945, 1.94591014905531, 0.693147180559945, 2.89037175789616, 
2.56494935746154, 1.6094379124341, 1.6094379124341, 1.94591014905531, 
2.19722457733622, 0.693147180559945, 4.26267987704132, 1.94591014905531, 
0, 0.693147180559945), GDF7 = c(0.2232, -0.7281, 0.0655, -0.7919, 
0.175, 0.0891, 0.4396, -0.2774, -0.4079, 0.4069, 0.3057, 0.7371, 
-0.4978, -0.5096, -0.0827), HS1BP3 = c(0.2232, -0.7281, 0.0655, 
-0.7919, 0.175, 0.0891, 0.4396, -0.2774, -0.4079, 0.4069, 0.3057, 
0.7371, -0.4978, -0.5096, -0.0827), NKAIN3 = c(0.4072, 0.3216, 
-0.5466, -0.1588, 0.4515, 0.2849, 0.1675, 0.0847, 0.6601, 0.6331, 
-0.135, 1.3532, -0.503, -0.1241, 0.2061), UG0898H09 = c(0.4072, 
0.3216, -0.5466, -0.1588, 0.4515, 0.2849, 0.1675, 0.0847, 0.6601, 
0.6331, -0.135, 1.3532, -0.503, -0.1241, 0.2061), C15orf41 = c(0.122, 
-0.7519, -1.1267, -0.7882, -0.1117, -0.5105, -0.3905, -0.6834, 
-0.5944, 0.0714, -0.8134, -0.0115, -1.1112, -1.1488, -0.4878), 
    FAM98B = c(-0.1871, -0.7519, -1.1267, -0.7882, -0.1117, -0.5105, 
    -0.3905, -0.6834, -0.5944, 0.0714, -0.8134, -0.0115, -1.1112, 
    -1.1488, -0.4878), SPRED1 = c(-0.1871, -0.7519, -1.1267, 
    -0.7882, -0.1117, -0.5105, -0.3905, -0.6834, -0.5944, 0.0714, 
    -0.8134, -0.0115, -1.1112, -1.1488, -0.4878), MPDZ_ex = c(1, 
    0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0), TPR_ex = c(0, 
    0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), BUB1B_ex = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0), APC_ex = c(0, 
    0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), ATM_ex = c(0, 
    0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0), DYNC1LI1_ex = c(0, 
    0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), TTK_ex = c(0, 
    0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0), PSMG2_ex = c(1, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), NegRegMitosis = c(1, 
    0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0), brca1ness = c(0.037719, 
    0.900878, 0.013261, 0.900878, 0.659963, 0.005629, 9.8e-05, 
    0.996336, 0.910072, 0.850776, 0.000613, 0.104428, 0.978114, 
    0.938767, 0.041696), Methylation = c(0L, 0L, 0L, 1L, 1L, 
    1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L), LinoleicAcid_Metab = structure(c(2L, 
    2L, 2L, 2L, 1L, 3L, 2L, 2L, 1L, 5L, 2L, 5L, 1L, 2L, 2L), .Label = c("CYP2E1_high", 
    "CYP2E1_med", "high", "low", "PLA2G2A_high"), class = "factor"), 
    Neuro_lr = structure(c(2L, 2L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 
    1L, 1L, 3L, 3L, 1L, 1L), .Label = c("1", "2", "3", "4"), class = "factor"), 
    NOX_signalling = structure(c(2L, 2L, 2L, 2L, 1L, 2L, 1L, 
    2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L), .Label = c("high", "low"
    ), class = "factor")), .Names = c("response", "HIST1H3F_rna", 
"NCF1_rna", "WDR66_rna", "PTH2R_rna", "HAVCR2_rna", "CD200R1_rna", 
"GDF7", "HS1BP3", "NKAIN3", "UG0898H09", "C15orf41", "FAM98B", 
"SPRED1", "MPDZ_ex", "TPR_ex", "BUB1B_ex", "APC_ex", "ATM_ex", 
"DYNC1LI1_ex", "TTK_ex", "PSMG2_ex", "NegRegMitosis", "brca1ness", 
"Methylation", "LinoleicAcid_Metab", "Neuro_lr", "NOX_signalling"
), row.names = c(NA, -15L), class = "data.frame")

Summary now gives the following:

Call:
glm(formula = response ~ NegRegMitosis, family = "binomial", 
    data = final_model)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-3.971e-06  -3.971e-06   3.971e-06   3.971e-06   3.971e-06  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)       25.57   76367.61       0        1
NegRegMitosis    -51.13  111790.71       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2.0728e+01  on 14  degrees of freedom
Residual deviance: 2.3655e-10  on 13  degrees of freedom
AIC: 4

Number of Fisher Scoring iterations: 24

Again even in a single predictor model, my p-value is 1. The predictor in this case is equal to the response, so it should predict perfectly. Then why is my pvalue 1?

See `table(final_model$response, final_model$NegRegMitosis)`, IV and DV are exclusive. In combination with so few observation, I would assume that the model just won't fit well. — Daniel, Nov 24 '15 at 13:43
Hmm I thought that binary predictors could definitely be used in a logistic model. This should be an easy non-erroneous task to fit such data. — Ansjovis86, Nov 24 '15 at 13:48
It's not about binary predictors in general, but in your special case that you have no observed values (in `NegRegMitosis`) for your "event" (`response = 1`). This may be just another reason why your model fits so bad. — Daniel, Nov 24 '15 at 14:15
I don't think it's that. If I invert 1s and 0s in response it's the same thing again. If you plot that table of yours, than by eye I could fit a logistic function on that without making errors, so it must be something else. — Ansjovis86, Nov 24 '15 at 14:35
Not so. No logistic function ever goes through 0 and 1, as already pointed out below my answer. Your data do not give enough information to do anything without major uncertainty. — Nick Cox, Nov 24 '15 at 14:55
Ok, somebody with the same problem [here](http://stats.stackexchange.com/questions/165158/glm-high-standard-errors-but-variables-are-definitely-not-collinear). So I guess we've got a Hauck-Donner phenomenon here. — Ansjovis86, Nov 24 '15 at 17:12
Why not? Don't I have a perfect separating variable as well? — Ansjovis86, Nov 24 '15 at 17:38

Nick Cox · Answer 1 · 2015-11-24T12:53:30.840

6

Here are your data shown more plainly:

   response HIST1H3F_rna NCF1_rna WDR66_rna PTH2R_rna HAVCR2_rna CD200R1_rna
1         0    1.0986123 2.772589  5.068904 0.6931472   4.488636   2.4849066
2         0    0.6931472 3.091042  4.499810 5.0875963   3.401197   2.9444390
3         1    2.0794415 2.639057  5.117994 0.6931472   3.091042   0.6931472
4         1    1.0986123 2.197225  3.401197 1.0986123   2.944439   1.9459101
5         1    1.7917595 2.302585  3.258097 0.0000000   3.218876   0.6931472
6         0    0.0000000 2.564949  4.025352 6.0112672   3.761200   2.8903718
7         0    0.0000000 3.091042  5.834811 6.5652650   3.951244   2.5649494
8         0    0.0000000 3.988984  5.894403 5.1817836   2.833213   1.6094379
9         1    2.3978953 2.564949  3.871201 0.0000000   2.079442   1.6094379
10        0    1.3862944 4.060443  5.676754 4.3694479   4.369448   1.9459101
11        0    1.6094379 3.871201  5.351858 2.1972246   3.583519   2.1972246
12        1    1.6094379 2.079442  4.158883 1.0986123   1.945910   0.6931472
13        0    0.6931472 3.496508  6.234411 3.4965076   4.234107   4.2626799
14        1    1.7917595 3.178054  5.918894 1.3862944   3.433987   1.6094379
15        0    0.0000000 3.951244  3.688879 5.9375362   2.564949   0.6931472

You have a small sample with 15 observations, so throwing 6 predictors at them is asking too much. There are rules of thumb in the literature and people have friendly arguments about what to advise, but I know of no advice that 2.5 observations per predictor is enough. Indeed given the enormous standard errors (essentially 3, 4 or 5 orders of magnitude larger than the coefficient estimates) the model found is useless (and it's a puzzle to me that it converged at all; I couldn't reproduce these results in Stata).

I see no evidence in simple models that any of these individually is a good predictor. Evidently logit is the default link in R for what you asked and I agree that it seems the appropriate model with which to start. With Stata I find the P-values for individual predictors in one-predictor models:

HIST1H3F_rna  0.065   
    NCF1_rna  0.053 
   WDR66_rna  0.118   
   PTH2R_rna  0.140   
  HAVCR2_rna  0.061   
 CD200R1_rna  0.051

Although in principle you might do better with two predictors, you immediately run into the problem with which we started, that this is too small a sample. Naturally it could well be that each data point is a lot of work scientifically. Others who know more about the science (not difficult, as I think I recognise "rna" as a nucleic acid reference, but no more) may be able to add (or subtract) usefully.

edited Nov 24 '15 at 12:53

answered Nov 24 '15 at 12:14

Nick Cox

48,377
8
110
156

To my understanding, a often repeated recommendation is 10 outcome events per predictor variable. This might be too strict in some circumstances, but in this case, there is only 1 outcome event per predictor variable: http://www.ncbi.nlm.nih.gov/pubmed/17182981 – JonB Nov 24 '15 at 12:26
@JonB Indeed; number of observations is only a start as a criterion. – Nick Cox Nov 24 '15 at 12:28
I've also incorporated a forward feature selection that select the predictors. Also in this new dataframe there are more features as the above dataframe was just a selection. However still I get a model with now just 2 predictors but again P values of 1. Can this be due to the fact that maybe that combination is a perfect predictor of the response? – Ansjovis86 Nov 24 '15 at 12:43
Forward selection can't help you here; nor is it evident in your syntax, unless there is some previous stage you are not showing us. It makes no difference to the underlying problem. Nor am I getting better results in Stata, unless not converging is deemed better than what you got. – Nick Cox Nov 24 '15 at 12:47
Comments are crossing here. We can't see your extra results, but they sound no better than before. – Nick Cox Nov 24 '15 at 12:49
I edited my post to underline that the P-values I cited are for separate single-predictor models, not a combined model. – Nick Cox Nov 24 '15 at 12:54
Wait, let me expand the data and feature selection. – Ansjovis86 Nov 24 '15 at 12:55
Please don't overwrite what you first posted. Otherwise the answer and the comments won't make much sense. Add further material if you think you can push the discussion forward, but 15 observations are totally inadequate for what you are trying and adding more predictors or being smarter about which to use won't really help. – Nick Cox Nov 24 '15 at 13:01
Ok it's added and not overwritten. I feel that the single predictor model that comes out of the feature selection with the NegRegMitosis is a different problem. Because If I apply this feature selection to the first dataset I showed, I do get a model back with a significant predictor. – Ansjovis86 Nov 24 '15 at 13:09
Look at your data closely: when `response` is 1 then `NegRegMitosis` is 0 and vice versa. That's not what you say, but setting that aside, your intuition is perhaps based on fitting straight lines. In this case the model cannot go through the data points at all as no logistic curve can ever predict 0 or 1. I'd suggest seeking statistical help locally as I don't seem able to convince you. – Nick Cox Nov 24 '15 at 13:43
The concurrent thread http://stats.stackexchange.com/questions/183337/mse-huge-when-estimating-regression-from-small-samples is pertinent too. – Nick Cox Nov 24 '15 at 13:47
For the record, I agree totally with Nick Cox. The number of observations are far too few for what you're trying to do, regardless of what model you select. I think that your best option is to just present the results of the bivariate regression models and omit the multivariable analysis. – JonB Nov 24 '15 at 15:03

score 6 · Accepted Answer · edited Apr 13 '17 at 12:44

Your expanded dataset still has only $N = 15$ observations, so @NickCox's point still applies. The rule of thumb I've heard for logistic regression is that you should have $15$ of the less frequent outcome for each variable to avoid a model that approaches saturation. (That is, a minimum $N = 30$ to support a single variable when your response is perfectly $50\%-50\%$.) This contrasts with the rule of thumb for linear regression where you need $10$ data per variable. Something to bear in mind here is that there is little information in a binary data point. You have enough data to estimate the marginal proportion, but not enough to add a single predictor variable to your model.
Stepwise variable selection is generally not a good thing to do. (It may help you to read my answer here: Algorithms for automatic model selection.) You have tried to mitigate against those problems by using cross validation. That is certainly an improvement. But it is nonetheless possibly valid only for out of sample predictive accuracy (and your model is almost certainly overfit even for that). You cannot use or interpret p-values after stepwise variable selection.
You do have separation in your dataset, albeit not in any single variable. There are several hints that this is the case: the number of Fisher scoring iterations is 24, a very high number (4-5 is typical); also your coefficient estimates are large with huge standard errors. Here are some bivariate scatterplots of some of your predictor variables with the responses marked by color and symbol, and the separation marked with a dashed gray line:

The separation is the single biggest cause of your high p-values. (Stepwise selection makes p-values invalid, but will make them too low.) Consider—even with insufficient data—you get standard errors that are on the order of $10^0$ (not $10^5$), just 5 Fisher scoring iterations, and lower p-values using only WDR66_rna and HAVCR2_rna (which overlap a good deal):
```
summary(glm(response~WDR66_rna+HAVCR2_rna, d, family=binomial))
# ...
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)  
# (Intercept)   7.8350     4.3765   1.790   0.0734 .
# WDR66_rna    -0.3627     0.7132  -0.509   0.6111  
# HAVCR2_rna   -2.0144     1.2934  -1.557   0.1194  
# ...
#     Null deviance: 20.190  on 14  degrees of freedom
# Residual deviance: 13.882  on 12  degrees of freedom
# AIC: 19.882
# 
# Number of Fisher Scoring iterations: 5
```

+1. This dives deeper and explains much more than my earlier answer. The graphical diagnosis of separation is especially instructive. — Nick Cox, Nov 25 '15 at 09:17

Logistic glm with good predictors is giving p-values = 1

2 Answers2

Linked