Logistic regression gets better but classification gets worse?

Question

I am currently doing an analysis for my Master Thesis and encountered some results I cannot explain.

In my paper, I am trying to explore factors that decide whether people joined a local energy initiative or not. Since I have a lot of different variables, my instructor suggested a model building approach. Concretely, I am adding sets of predictors to my logistic regression and only keep those that are significant in the model, before adding the next set. To assess model fit, I was told to use classification tables.

My problem now is the following:

I start with a set of dummies to control for participants coming from different neighbourhoods. This basic model classifies 56% of cases correctly. Now I add the second set of predictors and some of them are significant, so I keep those in the model. If I now use the classification table again, my classification got worse. Even worse than chance! (48%).

How can I find significant predictors but my model gets worse than chance?

EDIT FOR ADDITIONAL INFO:

My Dataset consits of 636 cases. 318 are partakers of the initiative, 318 are not partakers. The sets of variables I use are structured as follows:

1) "Control": People come from 30 different neighbourhoods, so I added 29 dummy variables to control for differences due to neighbourhood membership (not the best approach, I know, but I´m just following orders on this one)

2) Individual predictors: 15 demographic and psychological variables

3) Assessment of group predictors: 8 variables that measure how individuals perceive the group of potential partakers

I used the classification tables on the same data that I used for building the model, unfortunately I only have this one dataset and I´m trying to figure out which predictors are most promising for future (causational) research.

Did you split your data into training and test data or is the 48% result derived from the same data that trained the regression? How many predictors for how many people? — Bernhard, Sep 12 '16 at 11:24
Why did you add the [stepwise-regression] tag? and by "significant" what do you mean? what are your two sets? Could you add some supplementary informations? — Metariat, Sep 12 '16 at 11:30
Search this site for proper scoring function! zero-one loss as used in classification is not a proper scoring rule. Have a look at this and the links it contains: http://stats.stackexchange.com/questions/127042/why-isnt-logistic-regression-called-logistic-classification/127044#127044 — kjetil b halvorsen, Sep 12 '16 at 11:42
This model building approach is the statistical equivalent of firing a gun at the side of a barn and drawing a target around the bullet, then claiming you are a good shot. Stepwise variable selection is usually disastrous. — Frank Harrell, Sep 12 '16 at 12:18
I supplied some additional info in the original post. I am well aware that this approach is very far from ideal. My supervisor however insisted in this topic with this approach because they timeframe didn´t allow us to dig into better methods (I know...). Any hopes of publishing this project are long buried for me. I just want to pass this so I can move on with my other stuff. So any info on how to do this halfway decently without usin other methods would help me the most. — Ju Ko, Sep 12 '16 at 12:30
Why don't you use other 'selection' methods such as elastic net or LASSO ? It would estimate the parameters of some variables to 0 (effectively rejecting them) while using the whole dataset and not being stepwise selection — Riff, Sep 12 '16 at 12:36
@Nicolas: That sounds interesting. Do you have a resource that explains the basic of LASSO and how to interpret the results? Wikipedia was not really helpful for me in this case... — Ju Ko, Sep 12 '16 at 12:48
This http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#intro (taken from the manual of R package 'glmnet' implementing elastic net -and LASSO-) is kinda good for basic stuff. And there's also http://statweb.stanford.edu/~tibs/ElemStatLearn/ which, among many other things, reviews LASSO and other related methods — Riff, Sep 12 '16 at 12:52
Are these additional predictors correlated? Regression models can become ustable if the variables included have strong correlations. — tomaz, Sep 12 '16 at 13:21

score 6 · Answer 1 · answered Sep 12 '16 at 13:00

With 318 cases in each group you can examine about 20 predictors without too much risk of overfitting. Your second and third sets of variables combine for 23; a big problem is counting each of your neighborhoods in variable set 1 as a fixed effect, using up another 29 degrees of freedom.

The simplest short-term solution might be to treat neighborhoods as random effects instead of as fixed effects in your logistic regression, using for example the glmer function in the R lme4 package. That takes into account the differences among neighborhoods, as you have been instructed, but only uses up 1 degree of freedom in the analysis as you are modeling the distribution of effects among neighborhoods rather than the individual neighborhood effects. That might allow a straightforward analysis of all the other variables in a single model without the dangers of stepwise selection. LASSO would certainly be a useful way to further select among the remaining predictors if necessary.

You also, however, must be open to the possibility that the predictors you measured bear no relation to the choice of participation.

The OP's questions is "How can I find significant predictors but my model gets worse than chance?" and not "What is the solution?". But still, I give you one upvote for the good answer :) — Metariat, Sep 12 '16 at 13:28

Logistic regression gets better but classification gets worse?

1 Answers1