stepwise logistic regression non significative variables(high p-values)

Question

i am doing stepwise for logistic regression then the p-values of all variables selected were high then 0.05. According to this publication Stepwise regression in R – Critical p-value I changed the code to the following

step(glm(y~.,data=mydat,family="binomial"),direction="both",k=9)
 9.22952=qchisq(0.05,3.84,lower.tail=FALSE)

but the problem persist mean that i have the following p-values

0.21
0.19

What shall i do? Thanks a lot in advance for any help.

qchisq(0.05, 1, lower.tail = FALSE) does not equal 9 in my version of R. I suppose it is pointless telling you that the post you link to does say this (stepwise modelling) is a bad idea anyway. — mdewey, Dec 04 '16 at 14:23
yes mdewey you are right i modified the post, why stepwise modelling is a bad idea ? — prep, Dec 04 '16 at 14:29

score 7 · Answer 1 · answered Dec 04 '16 at 16:41

7

I think there is a more serious issue here than the use of stepwise regression

but the use is a must for me ! how can i get 0.05 p values means significant variables using stepwise ?

Science is not a quest for p < 0.05. Science is a quest for discovering repeatable and understandable patterns in our world. If you go into research looking for p < 0.05, you will find it with enough effort. Unfortunately, to do so, you sell out the soul of true science, and your results will no longer be scientific.

The idea between the p < 0.05 threshold is to guarantee that at most $5\%$ of research findings are false positives. But this guarantee makes a lot of assumptions about the honesty and integrity of the scientists using the statistical tools. Dredging your data to find p < 0.05 is about the worst thing you can do, it annihilates all of the guarantees the statistical framework is supposed to provide.

So yes, we could tell you how to torture your data until you get the magical p < 0.05, but we will not do so. To do so would be to sell out the thing we truly love, science.

answered Dec 04 '16 at 16:41

Matthew Drury

33,314
2
101
132

2

I agree with Frank. You need a deeper understanding of the procedures you are using. You need to understand why p-values are invalidated by stepwise selection. There are many, many explanations of this in the history of this site. There is no simple fix, and there is no recipe for correct science. Only a deeper understanding will help you. – Matthew Drury Dec 04 '16 at 17:43
Honestly, if you are having trouble dredging for p-values meeting an arbitrary threshold, it seems likely that the scientifically honest conclusion is that your desired research result is probably false. – Matthew Drury Dec 04 '16 at 17:44
so i can use what in place of stepwise or what comments i should put for these p-values ? other things to add are that i am searching the influence of quantitative variables in a qualitative variable which is (normal/abnormal) this is why i am using logistic regression and stepwise selection ! – prep Dec 04 '16 at 17:44
2

Perhaps the OP would like to look at http://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection which is one of the 'many, many explanations' – mdewey Dec 04 '16 at 17:51
i should may be add the detail that when i did this code `glm(y~.,data=mydat,family="binomial")` i got this message glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred and when i did summary of the result all the variables where significants means p-values were < 0.05 then after doing stepwise none of the selected variable is significant ! – prep Dec 04 '16 at 17:56
did this influence results ? – prep Dec 04 '16 at 20:12
1

Just to point this out a p<0.05 threshold does certainly not guarantee that at most 5% of research findings are false positives. I guarantees that at most 5% of p-values will be <0.05 under the null hypothesis, but depending on how many of the investigated null hypotheses hold the false positive rate could be way higher than 5%. – Björn Dec 05 '16 at 11:06
did the lrtest() solve the problem ? – prep Dec 05 '16 at 10:33
Fair point @Björn, my classical hypothesis testing is weak. If you'd like to edit for more truthfulness I would appreciate it. – Matthew Drury Dec 05 '16 at 14:27
@prep i dont know what you mean "solve the problem". Lack of signifigance is not in itself a problem. – Matthew Drury Dec 05 '16 at 16:04
so what mtthew i will not have a model if all variables are not significative ! ? – prep Dec 05 '16 at 20:14
You still have a model if all the variables is not significant. It only means that, in that model, the contribution of each individual predictor is not distinguishable from noise. The overall model may still be predictive. If your goal is to answer a research question such as "this variable has a statistically significant effect on this thing", then, indeed, you have not proven that to be the case. – Matthew Drury Dec 05 '16 at 21:39
good morning , but what surprise me is that all variables were significant before stepwise selection ! what can i conclude from this result should i keep the first model or the model after selection ? – prep Dec 06 '16 at 09:19
any response please i said that i have 50 variables in first step i did the following glm(y~all variables,family=binomial,data=mydat) then summary and all the 50 variables were significative then i did stepwise it selected only 27 variables from 50 but all were not significative what is the conclusion then for example in case one variable x were significative and after stepwise non significative i will say x influence the qualitative y or not ? thanks in advance – prep Dec 06 '16 at 17:59
@prep is the full model what gave you the error message: "fitted probabilities numerically 0 or 1 occurred"? – Matthew Drury Dec 06 '16 at 19:11
@MatthewDrury yes the full model gave me this message and the model with stepwise also gave me 50 warnings the algorithm did nit converge and fitted probabilities numerically 0 or 1 occurred thanks a lot for any help – prep Dec 07 '16 at 09:45
what should i do please ? – prep Dec 09 '16 at 10:08
Given your error messages, everything reported from your regression is not usable. Please look up "perfect separation" on the site, this is probably what is causing your convergence/identification issues. – Matthew Drury Dec 09 '16 at 16:33
@matthew i looked to the posts but did not know the cause in fact in addition i did confusion matrix for both model for the first the full one accuracy was equal to 0.99 end then with the stepwise model containing non significative variables the accuracy was equal to 1 what can i conclude or correct please ? – prep Dec 09 '16 at 19:42
@prep We're at the point in this conversation where you should ask a separate question. Please post as much detail as you can, including data, code, and a complete description of your intent. – Matthew Drury Dec 09 '16 at 19:57

score 5 · Answer 2 · answered Dec 04 '16 at 14:42

5

Without penalizing for the variable selection algorithm your results are very likely to be overstated, misleading, and P-values will be too low and confidence intervals too narrow.

answered Dec 04 '16 at 14:42

Frank Harrell

74,029
5
148
322

hi frank, i don't undersand what shall i do ? – prep Dec 04 '16 at 14:51
4

Avoid stepwise regression until you understand all the ramifications of data dredging. My course notes go into some detail: http://biostat.mc.vanderbilt.edu/rms – Frank Harrell Dec 04 '16 at 14:52
but the use is a must for me ! how can i get 0.05 p values means significant variables using stepwise ? thank you in advance – prep Dec 04 '16 at 14:57
3

Learning any field takes time. I would not try to be an expert in your field with a few minutes of study. – Frank Harrell Dec 04 '16 at 15:02
yes i know that it takes time , but, i need a solution for my problem what is the meaning of "penalizing for the variable selection algorithm " ? – prep Dec 04 '16 at 15:18
E.g. using penalized maximum likelihood estimation such as lasso or elastic net. – Frank Harrell Dec 04 '16 at 15:23
4

How can the use of stepwise regression be "a must"? One usually starts with a question to be investigated and then identifies an appropriate statistical method. Starting with a method (that also happens to be inappropriate for any other purpose than somehow generating p<0.05 no matter whether it means anything) is the wrong way around. – Björn Dec 05 '16 at 11:09

stepwise logistic regression non significative variables(high p-values)

2 Answers2

Linked