Is binning data valid prior to Pearson correlation?

Question

Is it acceptable to bin data, calculate the mean of the bins, and then derive the Pearson correlation coefficient on the basis of these means? It seems a somewhat fishy procedure to me in that (if you think of the data as a population sample) the scatter of these means will be the standard error of the mean and hence very tight if $n$ is large. So you will probably get a much better correlation coefficient than from the primary data, and that seems wrong. On the other hand, people often average replicate measurements before a correlation calculation which isn't very different.

What would be the purpose of binning in this particular case? — chl, Jun 02 '13 at 20:41
There is no evident point to binning before correlation unless you are directly interested in looking at the relationship between binned variables. — Nick Cox, Jun 02 '13 at 20:51
Binning data that is continuous and then computing a correlation is like cutting off your leg and then getting crutches. — Peter Flom, Jun 02 '13 at 21:04
My guess is that the binning was done to make the correlation look better than it was in reality. The primary data gave a poor correlation but, when binned and averaged, it looked much better. I think that because each bin mean value will have a tiny standard error (there were 100s of points in each bin), the meaned values give an apparently beautiful correlation. — James, Jun 02 '13 at 21:28
No - it is a published paper (in chemistry) I've recently read that uses this approach and I was wondering whether it is valid. — James, Jun 03 '13 at 07:57
Why stop there? By using just two bins you can always get a correlation coefficient of $100$% :-). In contrast, averaging replicate measurements *is* different because it invokes a different model of data behavior and leads to a different inference (about the expectations of the replicates rather than the replicates themselves). — whuber, Aug 30 '13 at 21:59
I'd be interested to know more (or pointed to a reference) regarding the differences between binning and replicates. While intuitively there is a difference, I can't define it precisely. — James, Sep 04 '13 at 18:49

score 2 · Accepted Answer · answered May 20 '14 at 01:49

2

Not exactly the same as your question, but on a related note, I remember reading an article a while back (either The American Statistician or Chance magazine, sometime between 2000 and 2003) that showed that for any dataset of 2 variables where they are pretty much uncorrelated you can find a way to bin the "predictor" variable, then take the average of the response variable within each bin and depending on how you do the binning show either a positive relationship or a negative relationship in a table or simple plot.

answered May 20 '14 at 01:49

Greg Snow

46,563
2
90
159

5

The excellent article you allude to is @Article{wai06fin, author = {Wainer, Howard}, title = {Finding what is not there through the unfortunate binning of results: {The} {Mendel} effect}, journal = {Chance}, year = 2006, volume = 19, number = 1, pages = {49-56}, annote = {can find bins that yield either positive or negative association;especially pertinent when effects are small;``With four parameters, I can fit an elephant; with five, I can make it wiggle its trunk.'' - John von Neumann} } – Frank Harrell May 20 '14 at 02:29
@FrankHarrell, thanks for the reference, I remembered a couple of years off. – Greg Snow May 20 '14 at 14:04
@GregSnow maybe you could edit for the reference? https://www.tandfonline.com/doi/abs/10.1080/09332480.2006.10722771 – Tim Jun 16 '21 at 06:57

score 0 · Answer 2 · answered Jun 16 '21 at 00:58

Correlation coefficient is a measure of uncertainty in predicting value Y from value X measured for an individual sampled object, such as a patient. In focusing on this prediction you do not 'bin' individual readouts: your blood sample is your blood sample not other's. If however the nature of your data permits binning / averaging it normally means that you do not care about individual readouts (happy to mix them together) but want to see if there is a TREND in your sample. Here, you look for a (linear) regression and its significance instead because the correlation coeffcient would depend directly on the way you bin your data. Somehow, most biological papers ignore this.

score -2 · Answer 3 · answered May 20 '14 at 01:46

The main reason to bin data is to allow for the possibility of a nonlinear relationship between the variables. The Pearson correlation measures strength of linear association, so it doesn't work well when the relationship is nonlinear.

There are obviously much better ways to handle this issue than binning. For example, you might fit a nonlinear or local regression model and correlate the predicted and actual response values (although this assumes that a predictor-response approach is valid, whereas correlation is symmetric). Binning is just a way of solving the problem of nonlinearity that people without a statistics background or statistical tools might use.

Binning has absolutely nothing with helping to find a nonlinear relationship. — Frank Harrell, May 20 '14 at 02:27

Is binning data valid prior to Pearson correlation?

3 Answers3