How to categorize a predictor measured on the continuous scale?

Asked Sep 21 '15 at 12:47

Active Sep 21 '15 at 12:47

Viewed 42 times

How can I select the optimal number of categories that better represent a continuous predictor in the single-variable linear regression model?

I constructed this scatter plot:

plt.scatter(train['pred'], train['resp'])
plt.show()

But it does not give me a clear idea. Is there any "automated" way?

asked Sep 21 '15 at 12:47

Klausos

See also [Best way to bin continuous data](http://stats.stackexchange.com/q/166592/17230). But don't do it: see [What is the benefit of breaking up a continuous predictor variable?](http://stats.stackexchange.com/q/68834/17230). – Scortchi - Reinstate Monica Sep 21 '15 at 12:56
@Scortchi: Do you know the procedure in Python, not R? – Klausos Sep 21 '15 at 13:05
@Scortchi: To me, binning can improve accuracy of the model by reducing noise or helping model nonlinearity. That's why I'm asking. – Klausos Sep 21 '15 at 13:09
I do discuss those misconceptions in the post I linked to. I don't know of any Python implementations of optimal binning algorithms, but you'd be most likely to find one in the `scikit-learn` library. Or if there's one implemented in R that you like, call it with `rpy2`. – Scortchi - Reinstate Monica Sep 21 '15 at 13:17

0 Answers0