How to deal with linear regression intercepts with high p-values in dichotomic classifier?

Question

I have used a simple multivariable logistic regression (as you would get by default glm() with logit in R) in a problem of dichotomic classifier with approx. 100 predictors, i.e. quite a lot of variables in the regression. My approach was quite naive, just wanted to quickly see some classification results, and they were not that bad, about 85% classification accuracy.

Many of the variables' intercepts have reasonably low p-value (<.05) but there are quite a few of them that have p-value >0.6, sometimes even 0.8. I would intuitively say to myself never mind these variables and get rid of them because you can't be sure that the estimated intercepts are reasonably correct. But when I removed randomly couple of the variables whose intercept p-value was high (e.g. >0.5), the eventual classification accuracy dropped (even on validation dataset).

Should I then keep those variables in the model just because of the classification accuracy? Even though it's quite likely their intercept is just 0 (or simply quite different than the one estimated by the model)? Why is this happening? Maybe I'm overfitting the model? Or maybe I'm just freaking out because there's something peculiar in the way how to properly interpret p-values that I don't know about? Or is it just a coincidence that I happened to come across with my particular training/testing datasets, or does this (i.e. leaving out presumably insignificant variables while causing classification deterioration) happen more often in similar situations?

Thank you very much in advance.

If by "dichotomic classifier" you mean that each case belongs to one of 2 separate classes and you are trying to predict class membership from the predictor variable, you should be using a logistic regression (with `glm` rather than `lm` in R). That might not solve your problem, but at least then the problem will be better defined. — EdM, Jul 13 '15 at 22:29
yes you're right, sorry I didn't write it correctly, I actually used glm with logit, so I just corrected the question — jrx1301, Jul 14 '15 at 12:39
OK. How many cases are you analyzing with your 100 predictors, and what do you want to do with the model after you build it? Is your main interest in predicting classifications of future cases based on these predictors, or something else? — EdM, Jul 14 '15 at 13:07
I've answered based on some assumptions I made about your problem, as I'll have diminished electronic access over the next few days. Comment on my answer if it doesn't address some of your needs. — EdM, Jul 15 '15 at 13:35
Thanks a lot for your answer - it addresses exactly my case. It's about predicting future cases - gender based on some behavioral patterns. — jrx1301, Jul 20 '15 at 06:01
As you go forward, make sure to test the ability of your model to generalize to the overall population. I know you have a validation data set, but multiple cross-validation or bootstrapping can sometimes give more reliable results. The "perfect separation" problem noted in my answer can be a particular problem with regards to generalizing results of logistic regression. — EdM, Jul 20 '15 at 11:25

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

The issues that you report are fairly common with large numbers of predictor variables. Multiple regression provides relations of each of your predictor variables to your classification with the other predictor variables taken into account via a linear model. This leads to all sorts of interesting behavior, as seen in many Cross Validated pages including one on a predictor uncorrelated with outcome becoming significant in multiple regression, on two orthogonal predictors affecting each others' significance in multiple regression, on the suppression effect, and even with the sign of a regression coefficient flipping after other predictors are included.

If your main interest is in predicting classification of future cases you are probably better off erring on the side of keeping too many predictors. That's even true for those that do not appear "significant" on their own, as their exclusion may affect the relations of the "significant" variables to the classification and hurt the overall performance of your model, as you have seen.

If the number of cases relative to the number of predictors leads to worries about overfitting or about perfect separation in your logistic regression, you might consider a technique like ridge regression, combined with standard model validation techniques like cross-validation or bootstrapping, to minimize those problems. The Elements of Statistical Learning, freely available, is a good starting point.

How to deal with linear regression intercepts with high p-values in dichotomic classifier?

1 Answers1