1

I have a data set with a sample size over three million numeric values. Close to 20% are either 0 or 1, with the maximum being nearly 18500. So the data is clearly quite heavily positively skewed.

I am trying to categorize some of this data by putting it into bins of equal width, to use the Chi-sqaure test and Cramers V to look for associations between this variable and a categorical one, so I decided to try and find the optimal number of bins. Using the Freedman-Diaconis rule it gave me a value of 126044.0262335108, this is clearly a ridiculously large number of bins for the data.

Breaking the set into the Inter-decile range also proved fruitless giving me [0, 1, 1, 2, 3, 5, 8, 17, 47]

Reading elsewhere the square root of the sample size was suggested, this gave 1732.05081 which is more reasonable. However the method is quite crude.

I also looking into Doane's formula given here. But reading up on this method it seems to have been based on an incorrect hypothesis.

How should I deal with this level of skew in the data?

What is the best way to categorize this data?

kopo222
  • 43
  • 2
  • 1
    Try a different transform to make the data more normal. – Carl Apr 16 '17 at 11:01
  • Is there any reason you've asked this question twice from different accounts? https://stats.stackexchange.com/questions/273880/calculating-the-optimal-number-of-bins-for-severly-skewed-data – einar Apr 16 '17 at 11:07
  • I asked the question as a guest the first time and was auto logged in on this account, didn't realise. As to why I asked it again, I left out some reasoning as to why I was doing this and couldn't edit the original – kopo222 Apr 16 '17 at 11:54
  • Please visit https://stats.stackexchange.com/help/merging-accounts to link your accounts: that will enable you to edit the original. – whuber Apr 16 '17 at 12:18

0 Answers0