Questions tagged [binning]

Binning means grouping a continuous variable into discrete categories. It is particularly used in reference to histograms, but could also be used more generally in the sense of coarsening.

Various rules have been proposed to choose a number of bins in a histogram; as is often the case, it is a tradeoff: With too many bins, the histogram will be very bumpy and reliant on the particular data set. With too few, necessary detail is lost. This is discussed in this thread

One problem with histograms is that different binning can result in histograms that appear quite different.

233 questions
116
votes
4 answers

Assessing approximate distribution of data based on a histogram

Suppose I want to see whether my data is exponential based on a histogram (i.e. skewed to the right). Depending on how I group or bin the data, I can get wildly different histograms. One set of histograms will make is seem that the data is…
guestoeijreor
  • 1,161
  • 3
  • 8
  • 3
98
votes
8 answers

What is the benefit of breaking up a continuous predictor variable?

I'm wondering what the value is in taking a continuous predictor variable and breaking it up (e.g., into quintiles), before using it in a model. It seems to me that by binning the variable we lose information. Is this just so we can model…
Tom
  • 1,511
  • 1
  • 12
  • 17
29
votes
2 answers

When should we discretize/bin continuous independent variables/features and when should not?

When should we discretize/bin independent variables/features and when should not? My attempts to answer the question: In general, we should not bin, because binning will lose information. Binning is actually increasing the degree of freedom of the…
Haitao Du
  • 32,885
  • 17
  • 118
  • 213
24
votes
4 answers

Benefits of using QQ-plots over histograms

In this comment, Nick Cox wrote: Binning into classes is an ancient method. While histograms can be useful, modern statistical software makes it easy as well as advisable to fit distributions to the raw data. Binning just throws away detail that is…
MvG
  • 480
  • 4
  • 11
18
votes
2 answers

Impact of data-based bin boundaries on a chi-square goodness of fit test?

Leaving aside the obvious issue of the low power of the chi-square in this sort of circumstance, imagine doing a chi-square goodness of test for some density with unspecified parameters, by binning the data. For concreteness, let's say an…
Glen_b
  • 257,508
  • 32
  • 553
  • 939
14
votes
3 answers

Best way to put two histograms on same scale?

Let's say I have two distributions I want to compare in detail, i.e. in a way that makes shape, scale and shift easily visible. One good way to do this is to plot a histogram for each distribution, put them on the same X scale, and stack one…
dsimcha
  • 7,375
  • 7
  • 32
  • 29
13
votes
5 answers

Why should binning be avoided at all costs?

So I've read a few posts about why binning should always be avoided. A popular reference for that claim being this link. The main getaway being that the binning points (or cutpoints) are rather arbitrary as well as the resulting loss of information,…
13
votes
2 answers

Optimal Binning with respect to a given response variable

I'm looking for optimal binning method (discretization) of a continuous variable with respect to a given response (target) binary variable and with maximum number of intervals as a parameter. example: I have a set of observations of people with…
Dominix
  • 231
  • 1
  • 2
  • 5
12
votes
3 answers

Is binning data valid prior to Pearson correlation?

Is it acceptable to bin data, calculate the mean of the bins, and then derive the Pearson correlation coefficient on the basis of these means? It seems a somewhat fishy procedure to me in that (if you think of the data as a population sample) the…
James
  • 223
  • 2
  • 4
12
votes
4 answers

Interpolating binned data such that bin average is preserved

Say I have this binned data as input. The average value $\bar{y}_i$ is given for each successive $\Delta x_i$ interval. For simplicity, let's assume sampling density is uniform within each bin. Now I want to estimate the underlying function $y$($x$)…
11
votes
2 answers

How to 'intelligently' bin a collection of sorted data?

I am trying to intelligently bin a sorted collection. I have a collection of $n$ pieces of data. But I know that this data fits into $m$ unequally sized bins. I don't know how to intelligently choose the endpoints to properly fit the data. for…
Matthew Kemnetz
  • 213
  • 1
  • 2
  • 7
11
votes
5 answers

Interpretation of Bayes Theorem applied to positive mammography results

I'm trying to wrap my head around the result of Bayes Theorem applied to the classic mammogram example, with the twist of the mammogram being perfect. That is, Incidence of cancer: $.01$ Probability of a positive mammogram, given the patient has…
user2666425
  • 229
  • 2
  • 4
11
votes
3 answers

Number of bins when computing mutual information

I want to quantify the relationship between two variables, A and B, using mutual information. The way to compute it is by binning the observations (see example Python code below). However, what factors determines what number of bins is reasonable? I…
pir
  • 4,626
  • 10
  • 38
  • 73
11
votes
1 answer

Should we bin continuous variables?

I know this has been asked before, and I have read through the responses to the earlier queries related to binning continuous variables. I do understand that generally we should avoid binning, given that it potentially results in throwing away…
Dataminer
  • 365
  • 3
  • 12
10
votes
5 answers

Binning By Equal-Width

I have a dataset: 5, 10, 11, 13, 15, 35, 50 ,55, 72, 92, 204, 215 The formula for binning into equal-widths is this (as far as I know) $$width = (max - min) / N$$ I think N is a number that divides the length of the list nicely. So in this case it…
Mike John
  • 624
  • 3
  • 6
  • 19
1
2 3
15 16