0

How can I select the optimal number of categories that better represent a continuous predictor in the single-variable linear regression model?

I constructed this scatter plot:

plt.scatter(train['pred'], train['resp'])
plt.show()

enter image description here

But it does not give me a clear idea. Is there any "automated" way?

Klausos
  • 499
  • 1
  • 6
  • 11
  • See also [Best way to bin continuous data](http://stats.stackexchange.com/q/166592/17230). But don't do it: see [What is the benefit of breaking up a continuous predictor variable?](http://stats.stackexchange.com/q/68834/17230). – Scortchi - Reinstate Monica Sep 21 '15 at 12:56
  • @Scortchi: Do you know the procedure in Python, not R? – Klausos Sep 21 '15 at 13:05
  • @Scortchi: To me, binning can improve accuracy of the model by reducing noise or helping model nonlinearity. That's why I'm asking. – Klausos Sep 21 '15 at 13:09
  • I do discuss those misconceptions in the post I linked to. I don't know of any Python implementations of optimal binning algorithms, but you'd be most likely to find one in the `scikit-learn` library. Or if there's one implemented in R that you like, call it with `rpy2`. – Scortchi - Reinstate Monica Sep 21 '15 at 13:17

0 Answers0