One Hot Encoding of ranges of data vs. leaving data as is for Logistic Regression

Question

Recently whilst doing an assignment using the PIMA Diabetes set I ran Logistic Regression using, amongst others:

the age predictor as is
segmented the age into ranges and applied OHE (with and without scaling).

There was a slight increase in some trained and validates models when using OHE.

My question is: why would OHE be better than the (scaled) age predictor? I cannot find a suitable explanation.

It is probably an artifact of the data set and the variables you used. Fine-grained continuous variables will always be better than OHE dummies -- particularly for updating the weights. — John Stud, Jan 25 '21 at 16:25
OK, so why is it that many examples and courses attempt to do this then? I agree with you. @JohnStud — thebluephantom, Jan 25 '21 at 16:28
Not sure. It makes little theoretical sense, on those grounds alone, to "throw out" continuous data in exchange for a dummy. Modeling decisions should always be justified, as we can predict most data sets that we have by just generating enough random Xs. — John Stud, Jan 25 '21 at 16:32

Ben Reiniger · Accepted Answer · 2021-01-25T19:09:59.010

If the relationship between the predictor and the target is not linear in log-odds, then binning and one-hot-encoding may perform better because the model gets to learn unrelated weights for each bin. This will be particularly noticeable if the true relationship is not monotonic. There will be a tradeoff in the size/number of bins as well: too small and the bins will make it easy to overfit, but too few and you lose too much of the continuous information.

Probably a better approach is to fit a spline, which accounts for the nonlinearity and does not throw away the information of the continuous predictor. There are several answers/comments here on the downsides of binning, e.g. What is the benefit of breaking up a continuous predictor variable? ; see related/linked questions and the tag binning for more.

One Hot Encoding of ranges of data vs. leaving data as is for Logistic Regression

1 Answers1