Why does sign of coefficient of categorical variable changes in logistic regression when running a model on test sample

Question

Hi this is my first post on this site, I am building a logistic regression model in SAS using proc logistic, my target is 0 and 1, I have 10-15 predictors 2 are continuous and rest are categorical, I divided total sample in two parts train(80%) test(20%), when I am checking the stability of coefficients on test sample I found that few categorical predictor changes their sign, also few turned insignificant on test data.

I am using categorical variables as it is (not creating dummy) in class statement in proc logistics. so in short what would be reason for sign change on test data?

Thanks, Abhishek

(1) As you're expecting the coefficients to change somewhat, why *wouldn't* you expect some small positive coefficients to become small negative ones, or some barely significant coefficients to become nearly significant, even if the changes in magnitude are quite small? (2) Don't know what "using categorical variables as is" means - sum-to-zero coding? Doesn't matter, so long as you're not inadvertently treating categorical variables as continuous. — Scortchi - Reinstate Monica, Jun 06 '14 at 10:24
If you are using PROC LOGISTIC with a CLASS statement the default is effect coding. Personally, I find reference coding much clearer and more intuitive; you can get it with the PARAM = REF option; although all use dummy variables of one sort or another. See [SAS help](http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_introcom_a0000003002.htm) or [my SAS presentation](http://www.biostat.umn.edu/~will/6470stuff/Class23-12/Flom%20-%20proc%20logistic%20traps.pdf) for more. — Peter Flom, Jun 06 '14 at 10:55
Hi Scortchi, thanks for reply,1) ok lets expect that sign of few predictors can change, but then it's a bad model isn't it? should I re-split the sample and test it? problem with sign change is insight that I give to business would be reverse. 2) "using categorical variables as is" means at backend SAS creates dummy variable with reference coding i.e. for sex what is change in Male w.r.t Female. — Abhishek, Jun 09 '14 at 06:35
Hi Peter, yes you are right I am using reference coding and not effect coding, I also feel reference coding is better as it gives you change from reference parameter, but my question is what to do when coefficient changes sign on test data...as it doubts on model stability. — Abhishek, Jun 09 '14 at 06:40
@Abhishek: By the way, if you put the "at" symbol in front of someone's user name (as I just did with yours), they'll be notified that you've responded to their comment. — Scortchi - Reinstate Monica, Jun 09 '14 at 10:18

Scortchi - Reinstate Monica · Answer 1 · 2014-06-27T18:38:03.697

1

The direction of an effect may be of particular interest to the business, but that doesn't imply that a change in the direction of its estimate between training & test is a stronger indication of problems with the model selection & fitting process than any other change of similar magnitude. Test would be expected to show some differences owing to randomness, plus overall shrinkage from undoing the bias introduced by model selection in training; wildly different estimates are a sign of problems. It's usual to address such concerns using a validation set on which you merely evaluate the performance of the final model without estimating anything. Good performance metrics are in general proper scoring rules, though you may want to use e.g. the area under the receiver operating characteristic curve if only discrimination is important to you. (Note that unless your sample size runs into the thousands, cross-validation or bootstrap validation are better approaches.)

Note also that model selection invalidates significance tests; so there's no reason to be performing them in training in the first place.

PS I think a more common, perhaps more correct, nomenclature is "train, validation, test" for my "train, test, validation"—sorry for any confusion.

edited Jun 27 '14 at 18:38

answered Jun 09 '14 at 10:02

Scortchi - Reinstate Monica

27,560
8
81
248

@Scorthi: Thanks for guiding me on validation process, I will use validation set for only performance evaluation without estimating anything. can you please explain more on your later part? "Note also that model selection invalidates significance tests; so there's no reason to be performing them in training in the first place." – Abhishek Jun 09 '14 at 10:35
1

Tests, p-values, & confidence intervals all assume a pre-specified model, & don't take into account that you're picking the "best" model out of several in training. Even the point estimates from training will have a bias towards larger absolute values. That's the point of re-fitting the "best" model on the test set - now it *is* pre-specified. See @gung's answer [here](http://stats.stackexchange.com/questions/20836/) for an intuitive explanation. – Scortchi - Reinstate Monica Jun 09 '14 at 10:59
@Scorthi: Do you know any document or paper which will guide me through model validation problem and process? – Abhishek Jun 09 '14 at 12:45
[Hastie et al. (2009), *The Elements of Statistical Learning*](http://statweb.stanford.edu/~tibs/ElemStatLearn/) – Scortchi - Reinstate Monica Jun 09 '14 at 13:16

Why does sign of coefficient of categorical variable changes in logistic regression when running a model on test sample

1 Answers1