How to calculate the optimal number of bins for severly skewed data

Question

I have a data set with a sample size over three million numeric values. Close to 20% are either 0 or 1, with the maximum being nearly 18500. So the data is clearly quite heavily positively skewed.

I am trying to categorize some of this data by putting it into bins of equal width, to use the Chi-sqaure test and Cramers V to look for associations between this variable and a categorical one, so I decided to try and find the optimal number of bins. Using the Freedman-Diaconis rule it gave me a value of 126044.0262335108, this is clearly a ridiculously large number of bins for the data.

Breaking the set into the Inter-decile range also proved fruitless giving me [0, 1, 1, 2, 3, 5, 8, 17, 47]

Reading elsewhere the square root of the sample size was suggested, this gave 1732.05081 which is more reasonable. However the method is quite crude.

I also looking into Doane's formula given here. But reading up on this method it seems to have been based on an incorrect hypothesis.

How should I deal with this level of skew in the data?

What is the best way to categorize this data?

Is there any reason you've asked this question twice from different accounts? https://stats.stackexchange.com/questions/273880/calculating-the-optimal-number-of-bins-for-severly-skewed-data — einar, Apr 16 '17 at 11:07
I asked the question as a guest the first time and was auto logged in on this account, didn't realise. As to why I asked it again, I left out some reasoning as to why I was doing this and couldn't edit the original — kopo222, Apr 16 '17 at 11:54
Please visit https://stats.stackexchange.com/help/merging-accounts to link your accounts: that will enable you to edit the original. — whuber, Apr 16 '17 at 12:18

How to calculate the optimal number of bins for severly skewed data

0 Answers0