0

I am having around 40000 records of non-finite continuous weight data(starting from 0.15kg to 5300 kg) .The data is right sided skewed .

I tried binning the data with fixed width .Binned data is used as input for binary classifier.

Please tell me the best way approach the binning.

Binning method i tried. Binned data

Rahul
  • 1
  • 7
    [Don't bin your continuous data at all.](https://stats.stackexchange.com/q/68834/1352) Feed them into your algorithm as-is; potentially transform them using (e.g.) restricted cubic splines (see, e.g., Frank Harrell's *Regression Modeling Strategies*) to capture any nonlinearity. – Stephan Kolassa Oct 20 '17 at 11:48
  • 4
    Orthogonal to the question, but still likely to be practical. (a) Look at these data on logarithmic scale. Here max/min gives >4 orders of magnitude. (b) If you must have a histogram, make sure the bars touch (because the bins do). (c) You can have more bins than this with your sample size. – Nick Cox Oct 20 '17 at 12:54

1 Answers1

7

You should not bin continuous features in a regression model, it only reduces the model's ability to fit the data.

I discuss this extensively here, where I compare binning with other methods of capturing non-linearity in regression modeling. The high level is

Binning, which conceptually simple, was seen to suffer from a few issues in comparison to the other methods

With a small number of estimated parameters, the binned regression suffered from a higher bias than its competitors. It was seen across multiple experiments to achieve its minimal hold out error at a larger number of estimated parameters than the other methods, and furthermore, often the minimal error achieved by the binning method was larger than the minimum achieved by the other methods.

The binned regression’s hold out error often had a higher variance than the other method’s. This means that, even if it performed just as well as another method on average, any individual binned regression is less trustworthy than if using another method.

Additionally, the binned regression method has the disadvantage of producing discontinuous functions, while we expect most processes we encounter in nature or business to vary continuously in their inputs. This is philosophically unappealing, and also accounts for some of the bias seen when comparing the binning regressions to the other basis expansions.

Just don't do it.

Matthew Drury
  • 33,314
  • 2
  • 101
  • 132
  • The question is not about regression. It says "data is used as input for binary classifier." – G5W Feb 28 '20 at 15:39
  • 1
    I was using regression in the sense of "logistic regression", so a parametric predictive model for the conditional probability of the target. – Matthew Drury Feb 28 '20 at 20:56