Questions tagged [binning]

Binning means grouping a continuous variable into discrete categories. It is particularly used in reference to histograms, but could also be used more generally in the sense of coarsening.

Various rules have been proposed to choose a number of bins in a histogram; as is often the case, it is a tradeoff: With too many bins, the histogram will be very bumpy and reliant on the particular data set. With too few, necessary detail is lost. This is discussed in this thread

One problem with histograms is that different binning can result in histograms that appear quite different.

233 questions

116

votes

4 answers

Assessing approximate distribution of data based on a histogram

Suppose I want to see whether my data is exponential based on a histogram (i.e. skewed to the right). Depending on how I group or bin the data, I can get wildly different histograms. One set of histograms will make is seem that the data is…

asked Mar 08 '13 at 17:58

guestoeijreor

1,161
3
8
3

votes

8 answers

What is the benefit of breaking up a continuous predictor variable?

I'm wondering what the value is in taking a continuous predictor variable and breaking it up (e.g., into quintiles), before using it in a model. It seems to me that by binning the variable we lose information. Is this just so we can model…

regression continuous-data regression-strategies binning faq

asked Aug 31 '13 at 05:32

Tom

1,511
1
12
17

votes

2 answers

When should we discretize/bin continuous independent variables/features and when should not?

When should we discretize/bin independent variables/features and when should not? My attempts to answer the question: In general, we should not bin, because binning will lose information. Binning is actually increasing the degree of freedom of the…

machine-learning continuous-data feature-engineering binning

asked Aug 19 '16 at 17:31

Haitao Du

32,885
17
118
213

votes

4 answers

Benefits of using QQ-plots over histograms

In this comment, Nick Cox wrote: Binning into classes is an ancient method. While histograms can be useful, modern statistical software makes it easy as well as advisable to fit distributions to the raw data. Binning just throws away detail that is…

references histogram binning qq-plot

asked Jul 11 '13 at 12:00

MvG

votes

2 answers

Impact of data-based bin boundaries on a chi-square goodness of fit test?

Leaving aside the obvious issue of the low power of the chi-square in this sort of circumstance, imagine doing a chi-square goodness of test for some density with unspecified parameters, by binning the data. For concreteness, let's say an…

chi-squared-test goodness-of-fit binning

asked Oct 04 '13 at 01:48

Glen_b

257,508
32
553
939

votes

3 answers

Best way to put two histograms on same scale?

Let's say I have two distributions I want to compare in detail, i.e. in a way that makes shape, scale and shift easily visible. One good way to do this is to plot a histogram for each distribution, put them on the same X scale, and stack one…

data-visualization histogram density-function binning

asked Mar 03 '11 at 16:28

dsimcha

7,375
7
32
29

votes

5 answers

Why should binning be avoided at all costs?

So I've read a few posts about why binning should always be avoided. A popular reference for that claim being this link. The main getaway being that the binning points (or cutpoints) are rather arbitrary as well as the resulting loss of information,…

classification categorical-data continuous-data splines binning

asked Feb 04 '19 at 11:32

Readler

votes

2 answers

Optimal Binning with respect to a given response variable

I'm looking for optimal binning method (discretization) of a continuous variable with respect to a given response (target) binary variable and with maximum number of intervals as a parameter. example: I have a set of observations of people with…

r dataset optimization discrete-data binning

asked Apr 29 '15 at 00:03

Dominix

votes

3 answers

Is binning data valid prior to Pearson correlation?

Is it acceptable to bin data, calculate the mean of the bins, and then derive the Pearson correlation coefficient on the basis of these means? It seems a somewhat fishy procedure to me in that (if you think of the data as a population sample) the…

correlation binning

asked Jun 02 '13 at 18:53

James

votes

4 answers

Interpolating binned data such that bin average is preserved

Say I have this binned data as input. The average value $\bar{y}_i$ is given for each successive $\Delta x_i$ interval. For simplicity, let's assume sampling density is uniform within each bin. Now I want to estimate the underlying function $y$($x$)…

algorithms interpolation binning

asked Apr 26 '16 at 10:04

Jean-François Corbett

votes

2 answers

How to 'intelligently' bin a collection of sorted data?

I am trying to intelligently bin a sorted collection. I have a collection of $n$ pieces of data. But I know that this data fits into $m$ unequally sized bins. I don't know how to intelligently choose the endpoints to properly fit the data. for…

clustering histogram binning

asked Aug 13 '12 at 17:36

Matthew Kemnetz

votes

5 answers

Interpretation of Bayes Theorem applied to positive mammography results

I'm trying to wrap my head around the result of Bayes Theorem applied to the classic mammogram example, with the twist of the mammogram being perfect. That is, Incidence of cancer: $.01$ Probability of a positive mammogram, given the patient has…

bayesian binning faq diagnosis

asked Dec 09 '15 at 00:11

user2666425

votes

3 answers

Number of bins when computing mutual information

I want to quantify the relationship between two variables, A and B, using mutual information. The way to compute it is by binning the observations (see example Python code below). However, what factors determines what number of bins is reasonable? I…

information-theory mutual-information binning

asked Nov 01 '15 at 15:45

pir

4,626
10
38
73

votes

1 answer

Should we bin continuous variables?

I know this has been asked before, and I have read through the responses to the earlier queries related to binning continuous variables. I do understand that generally we should avoid binning, given that it potentially results in throwing away…

regression modeling splines binning

asked May 21 '15 at 15:15

Dataminer

votes

5 answers

Binning By Equal-Width

I have a dataset: 5, 10, 11, 13, 15, 35, 50 ,55, 72, 92, 204, 215 The formula for binning into equal-widths is this (as far as I know) $$width = (max - min) / N$$ I think N is a number that divides the length of the list nicely. So in this case it…

data-mining binning

asked Nov 07 '13 at 09:31

Mike John

2 3

…

15 16 Next