Pros/Cons of recoding ordinal/nominal variables to target mean for logistic regression?

Question

Say I have an independent variable with the following relationship to the binary dependent variable, DV:

 ___________________________________________________________________________
|verx_s                          |  # Recs |    % Recs   |  # DV   | DV Rt  |
|________________________________|_________|_____________|_________|________|
|0                               |   75,700|      6.4467%|      941| 1.243% |
|1                               |  277,129|     23.6009%|    1,471| .5308% |
|2                               |   51,662|      4.3996%|      219| .4239% |
|3                               |  769,737|     65.5526%|    2,269| .2948% |
|All                             |1,174,228|    100.0000%|    4,900| .4173% |
|________________________________|_________|_____________|_________|________|

It's common practice at my company to recode the values of verx_s to the value of DV Rt and treat it as a continuous variable when modeling logistic regression. Confidence intervals are not important. All we care about is whether the model validations on an out-of-time sample. Is there anything inherently wrong with taking this shortcut?

It should also be mentioned that in most cases the independent variable is crafted in such a way that it makes intuitive sense to our customers. Therefore the ordering of the target mean is important. Hence, why we can't use simple dummy vars.

I don't understand. If DVRt is the dependent variable and verx_s is the independent variable, then you don't need logistic regression. But you can't recode your IV to match your DV! That would make no sense. It would help to provide context here. What are these variables? What are you trying to do? — Peter Flom, Apr 09 '13 at 21:33
My apologies, DV is the dependent variable. It is a 1,0 field and DV Rt is the mean of DV in each bucket of verx_s. Verx_s is a candidate variable for the logistic regression model and it is ordinal. I am trying to predict a binary response. — Zelazny7, Apr 09 '13 at 21:43
Kneejerk reactions (in order of importance): (1) It seems fishy to let the IV coding be so intimately tangled with the DV - danger of overfit. (2) You say the ordering of the IV is important, but your method can re-order its categories by the average value of the DV in each. (3) No account's taken of other IVs or the intercept - wouldn't recoding as odds be more sensible for a logistic link? — Scortchi - Reinstate Monica, Apr 09 '13 at 22:19
Thanks for your questions, the `verx_s` variable is the result of several crossings and interactions with other fields. It is a compound attribute that has been created in an intuitive manner (for the consumer credit risk world). The relationship HAS to be monotonic for legal reasons. A person has a right to know why they are denied credit, for example, and the monotonic relationship supports that kind of disclosure. Your point about recoding as odds is one I've brought up to my colleagues. The pros/cons of such choices is what I'm interested in from a theoretical perspective. THanks! — Zelazny7, Apr 09 '13 at 23:36

score 4 · Accepted Answer · answered Apr 10 '13 at 07:55

If I understand you correctly you are assigning values to categories of your explanatory/independent/right-hand-side/x variable based on average values of your explained/dependent/left-hand-side/y variable. I suspect that the purpose of that excercise is to assign values to the categories of verx_s such that the linear effect of the resulting variable in your model is (close to) maximal. If that is the case then what you are looking for is a sheaf coefficient (Heise 1972). However, for one categorical or ordinal variable this just boils down to a different way of presenting your results when you added your variable to the model as a set of indicator (dummy) variables. If all you care about is out of sample predictions, then the easiest way to achieve your goal is to just add verx_s as a set of indicator variables.

As Peter Flom already suggested, you could try and do some programming to impose monotonicity, but the resulting program will in all likelihood fail to converge or converge at unreasonable values when the pattern in the data is not monotonic.

Heise, David R. 1972. "Employing nominal variables, induced variables, and block variables in path analysis." Sociological Methods & Research 1(2): 147-173.

score 3 · Answer 2 · answered Apr 09 '13 at 21:52

3

You can use dummy codes with ordinal independent variables. The effect may not be monotonic, but that's OK; in fact, it may be revealing.

I am not aware of any standard methods that impose a monotonic relationship with an ordinal independent variable; there may be some. You could also probably write some function that would do it, if you are ingenious enough (e.g. if in SAS, using PROC NLMIXED' or programming something inR`. I wouldn't recommend that, however.

Recoding an ordinal variable to a continuous one is possible, too, but it needs substantive justification (and should probably be tested with sensitivity analysis). I have done this, e.g. with specified Likert variables such as

0 - Never
1 - Once a week
2 - 2 or 3 x a week
3 - Daily
4 - Twice a day

or something like that, which could be recoded to "times per week" and then 0 = 0, 1 = 1, 2 = 2.5, 3 = 7 and 4 = 14.

answered Apr 09 '13 at 21:52

Peter Flom

94,055
35
143
276

1

You don't necessarily want to impose a monotonic relationship, but if you have a lot of categories, a smooth relationship might be nice. You can penalize differences in the coefficients of adjacent categories: http://cran.r-project.org/web/packages/ordPens/ordPens.pdf – Scortchi - Reinstate Monica Apr 09 '13 at 22:00
Thanks, Peter. Doesn't the recoding to mean take care of the kind of transformation you describe in your example? By coding to the mean DV, a natural distance arises between the levels of the variable. That's why my boss uses the method. I'm curious if it's a valid approach or if it's not sound. Your point about monotonicity is valid, but unfortunately we are governed by the Fair-Credit Reporting Act and "counter-intuitive" relationships cannot be use. (even though the aggregate effect of a variable might trend the right way, we still can't use it if it comes in "backwards") – Zelazny7 Apr 09 '13 at 23:41
1

Point 1: Maybe I am missing something, but you can't code the IV to the mean DV! Isn't that assuming a perfect relationship? Point 2: If counter-intuitive results can't be used, I would refuse to do the work and, if this is for the government, I'd make a stink. – Peter Flom Apr 09 '13 at 23:45

score 2 · Answer 3 · answered Apr 11 '13 at 10:25

2

This approach seems more futile than wrong. Why not go the whole hog & recode the independent variable as the log-odds of the dependent variable in each category? Then you have a perfect linear relationship, with all coefficients unity & zero intercept. But you haven't actually gained anything over using dummy variables; you've just re-described the relationship in a confusing way, & any apparent parsimony is only apparent.

I can't follow the argument about monotonicity. If the DV should (in whatever sense of 'should') change monotonically with the IV as originally coded, & doesn't, it's simply evading the issue to create a different coding which re-orders the IV's categories. And of course if it does change monotonically with the IV as originally coded, there's no problem in using dummy variables.

answered Apr 11 '13 at 10:25

Scortchi - Reinstate Monica

27,560
8
81
248

Your point about the log-odds is well taken and one I've raised it to my colleagues in the past. I think we've settled on this approach because it forces the monotonic relationship. The audience for these models is not statisticians but consumers and regulators. Therefore, "should" in reference to monotonicity means I can point to a variable and say, you lost this many points because you had this condition. They aren't going to understand the subtleties of interactions and aggregate effects. – Zelazny7 Apr 11 '13 at 13:32
(1) My point about the log-odds was meant more as a reductio ad absurdum than as a suggestion. (2) What's the difference between pointing to an independent variable that's been recoded to be monotonic with the dependent variable, and pointing straight at the dependent variable? – Scortchi - Reinstate Monica Apr 11 '13 at 18:29

Pros/Cons of recoding ordinal/nominal variables to target mean for logistic regression?

3 Answers3

Linked