I have data sample size of almost 15,000 cases. The dependent variable is a dichotomous variable stating whether the patient has the disease or not, Yes=1, and No=0. I have 12 more independent variables which are continuous as well as dichotomous. My question here is, before I apply the logistic regression, I need to know which parameters have a great impact on the DV; how do I do that? Secondly, there are 7 different methods of using binary logistic regression like enter, forward conditional, forward LR, etc; what is the basic difference between these? I couldn't find any matter on this on the internet.

- 63,378
- 26
- 142
- 467

- 11
- 1
-
1You don't need to know beforehand which have a great impact on the dependent variable - that's what the regression's for. How many "Yes"s have you? Too few compared to the degrees of freedom for independent variables & there's danger of over-fitting. – Scortchi - Reinstate Monica Aug 05 '14 at 10:57
-
@Scortchi Thanks for your reply, I have 1240 yes's and 13527 No's, shall I reduce my sample size to equal no. of Yes's n No's? Please help, i really need some guidance – Zehra Aug 06 '14 at 01:51
-
1(1) The usual rules of thumb suggest 10 - 20 observations in the minority class (here that's "Yes"s) per degree of freedom for the regression should avoid serious overfitting. Unless you've lots of interactions or non-linear effects you're well within that guideline. Fit the full model & validate it. (2) If you've already gone to the trouble of measuring the independent variables on all 14,767 then don't throw any data away. If you haven't then (depending on how troublesome it will be) you might want to knock off after 2 to 6 thousand or so "No"s. – Scortchi - Reinstate Monica Aug 06 '14 at 09:07
-
1(3) I'd guess you're using SPSS: "Enter" refers to fitting a model with the independent variables you specify; the other options carry out variable selection by [stepwise methods](http://stats.stackexchange.com/questions/20836/). I'd strongly suggest not carrying out any kind of automatic variable selection based on relationship with the dependent variable unless you have a clear idea what you want to get out of it, & know how to check whether you got that or not, as well as understanding the disadvantages. – Scortchi - Reinstate Monica Aug 06 '14 at 09:39
-
@Scortchi (1) There are non-linear effects so I guess it should fit in. (2) I can easily knock-off 2-6 thousand of No's as I did not take any trouble of gathering it(3) Yes im using SPSS, what if I use the 'Enter' and 'Stepwise' methods and publish results of both? Im a PhD student and new to regression so please excuse my ignorance, thanks anyways – Zehra Aug 07 '14 at 02:09
-
(1) Doesn't make sense. Do you mean "there are only a few non-linear effects"? In any case work out the total degrees of freedom used for regression. (2) The point's whether you already have all the data or not. If so there's no sense in discarding any. If not there's diminishing returns to collecting more "No"s. (3) If you're new to regression you need to study it before using it in earnest. Read [Steyerberg (2009), *Clinical Prediction Models*](http://www.clinicalpredictionmodels.org/) then [Harrell (2001), *Regression Modelling Strategies*](http://biostat.mc.vanderbilt.edu/wiki/Main/RmS). – Scortchi - Reinstate Monica Aug 07 '14 at 10:08
-
2@Scortchi Consider answering this question with your comments so it counts as answered "([When _shouldn't_ I comment?](https://stats.stackexchange.com/help/privileges/comment): Answering a question or providing an alternate solution to an existing answer; instead, post an actual answer (or edit to expand an existing one)" :) – Firebug Jan 29 '18 at 15:30
1 Answers
The methods of "doing the regression" that you refer to (with SPSS terminology) are really methods of variable selection. You shouldn't do any kind of automatic variable selection. You should just use all your candidate variables in the model, and let them stay there. Using data for variable selection
does have a very poor track record of actually finding the correct variables,
invalidates posterior inference. Very thorough discussion.
The regression model is what tells you which variables are important and which not. For logistic regression it is often advised that with the effective number of observations $N$ (number of obs in the minority class), $N$ should be at least 15 times the number of parameters. More details here. With 15000 observations that should not be a problem for you, even with only 10% in the minority class. With 1% in the minority class you could have problems. Just fit the full model and validate it!
You say you have some non-linearities. That can be solved by representing variables who acts nonlinearly with splines. For instance, I would always represent age
via splines. See also the useful information in the comments.

- 63,378
- 26
- 142
- 467