Why group them? Instead, how about estimate the probability density function (PDF) of the distributions from which the data arise? Here's an R-based example:
set.seed(123)
dat <- c(sample(2000000, 500), rnorm(100, 1000000, 1000),
rnorm(150, 1500000, 100),rnorm(150, 500000, 10),
rnorm(180, 10000, 10), rnorm(10, 1000, 5), 1:10)
dens <- density(dat)
plot(dens)
If the data are strictly bounded (0, 2,000,000) then the kernel density estimate is perhaps not best suited. You could fudge things by asking it to only evaluate the density between the bounds:
dens2 <- density(dat, from = 0, to = 2000000)
plot(dens2)
Alternatively there is the histogram - a coarse version of the kernel density. What you specifically talk about is binning your data. There are lots of rules/approaches to selecting equal-width bins (i.e. the number of bins) from the data. In R the default is Sturges rule, but it also includes the Freedman-Diaconis rule and Scott's rule. There are others as well - see the Wikipedia page on histograms.
hist(dat)
If you are not interested in the kernel density plot or the histogram per se, rather just the binned data, then you can compute the number of bins using the nclass.X
family of functions where X
is one of Sturges
, scott
or FD
. And then use cut()
to bin your data:
cut.dat <- cut(dat, breaks = nclass.FD(dat), include.lowest = TRUE)
table(cut.dat)
which gives:
> cut.dat
[-2e+03,2.21e+05] (2.21e+05,4.43e+05] (4.43e+05,6.65e+05] (6.65e+05,8.88e+05]
247 60 215 61
(8.88e+05,1.11e+06] (1.11e+06,1.33e+06] (1.33e+06,1.56e+06] (1.56e+06,1.78e+06]
153 51 205 50
(1.78e+06,2e+06]
58
in R.
However, binning is fraught with problems, most notably; How do you know that your choice of bins hasn't influenced the resulting impression you get of the way the data are distributed?