How do I group a list of numeric values into ranges?

Question

I do have a big list of numeric values (including duplicates) and I do want to group them into ranges in order to see if how do they distribute.

Let's say there are 1000 values ranging from 0 to 2.000.000 and I do want to group them.

How can I achieve this, preferably in Excel or SQL.

Gavin Simpson · Answer 1 · 2010-11-09T13:57:35.120

Why group them? Instead, how about estimate the probability density function (PDF) of the distributions from which the data arise? Here's an R-based example:

set.seed(123)
dat <- c(sample(2000000, 500), rnorm(100, 1000000, 1000), 
         rnorm(150, 1500000, 100),rnorm(150, 500000, 10),
         rnorm(180, 10000, 10), rnorm(10, 1000, 5), 1:10)
dens <- density(dat)
plot(dens)

If the data are strictly bounded (0, 2,000,000) then the kernel density estimate is perhaps not best suited. You could fudge things by asking it to only evaluate the density between the bounds:

dens2 <- density(dat, from = 0, to = 2000000)
plot(dens2)

Alternatively there is the histogram - a coarse version of the kernel density. What you specifically talk about is binning your data. There are lots of rules/approaches to selecting equal-width bins (i.e. the number of bins) from the data. In R the default is Sturges rule, but it also includes the Freedman-Diaconis rule and Scott's rule. There are others as well - see the Wikipedia page on histograms.

hist(dat)

If you are not interested in the kernel density plot or the histogram per se, rather just the binned data, then you can compute the number of bins using the nclass.X family of functions where X is one of Sturges, scott or FD. And then use cut() to bin your data:

cut.dat <- cut(dat, breaks = nclass.FD(dat), include.lowest = TRUE)
table(cut.dat)

which gives:

> cut.dat
  [-2e+03,2.21e+05] (2.21e+05,4.43e+05] (4.43e+05,6.65e+05] (6.65e+05,8.88e+05] 
                247                  60                 215                  61 
(8.88e+05,1.11e+06] (1.11e+06,1.33e+06] (1.33e+06,1.56e+06] (1.56e+06,1.78e+06] 
                153                  51                 205                  50 
   (1.78e+06,2e+06] 
                 58

in R.

However, binning is fraught with problems, most notably; How do you know that your choice of bins hasn't influenced the resulting impression you get of the way the data are distributed?

(+1) Thanks for the hint about `nclass.*`. Why isn't your left-most interval left bounded (should set `include.lowest` to TRUE maybe)? — chl, Nov 09 '10 at 13:06
@chl - that's what you get when doing something quickly. Will edit the code to add `include.lowest` — Gavin Simpson, Nov 09 '10 at 13:56
A good solution is to create an interactive application that dynamically displays a graphic of the binned counts (*e.g.*, a histogram) or appropriate statistics (*e.g.*, chi-squared) in response to the user's manipulation of the cutpoints, so that this issue can be quickly explored. This is dead easy in Excel, the OP's platform of choice, and also fairly easy in some high-end software like *Mathematica*. — whuber, Nov 09 '10 at 15:11
@whuber I voted up the response although it is certainly not the response expected by the OP, but I like the point about binning and density. Interactive visualization can be made really easy in R too, at least for this kind of viz. app. The Tk-based `rpanel` package provides basic functionalities (see e.g., `rp.slider()` for an illustration with a density plot); `playwith` is a Gtk-based interactive plotting device, and is very handy. Now, I have no doubt that this should be far more easier with Excel (which is the reason why I've also +1 your answer). — chl, Nov 09 '10 at 16:33

Carlos Accioly · Answer 2 · 2010-11-09T12:56:17.023

8

I'll assume that you've already determined the number of categories you'll use. Let's say you want to use 20 categories. Then they will be:

Category 1: [0 - 100,000)
Category 2: [100,000 - 200,000)
Category 3: [200,000 - 300,000)
...
Category 19: [1,800,000 - 1,900,000)
Category 20: [1,900,000 - 2,000,000]

Note that the label of each category can be defined as

FLOOR (x / category_size) + 1

This is trivial to define as a computed column in SQL or as a formula in Excel.

Note that the last category is infinitesimally larger than the others, since it is closed on both sides. If you happen to get a value of exactly 2,000,000 you might erroneously classify it as falling into category 21, so you have to treat this exception with an ugly "IF" (in Excel) or "CASE" (in SQL).

edited Nov 09 '10 at 12:56

answered Nov 09 '10 at 12:25

Carlos Accioly

4,715
4
25
28

Good start! This solution *assigns* ranges; now you have to group them by following up your calculation with a summarize operation. – whuber Nov 09 '10 at 15:13
1

@whuber Grouping is something I'm assuming lonut can do if he regularly works with Excel and SQL, and there are many ways to do it. In Excel, for example, he can use a PivotTable or the COUNTIF() function; in SQL he'll probably use COUNT(*) with a GROUP BY CATEGORY clause (where CATEGORY is the computed column). – Carlos Accioly Nov 09 '10 at 17:44

score 5 · Answer 3 · answered Nov 16 '10 at 14:27

Another option using Excel that gives you a fair amount of flexibility over the number & size of bins is the =frequency(data_array, bins_array) function. It is an array function that expects two arguments. The first argument is your data, the second is your bins that you define.

Let's assume your data is in cells A1 - A1000 and create bins in cells B1 - B20. You would want to highlight cells C1 - C21 and then type something like =FREQUENCY(A1:A100, B1:B21). Unlike normal functions, array functions must be entered with the key combination SHIFT + CTRL + ENTER. You should see the counts fill down for all bins, if not - you most likely only hit enter and the first cell is calculated.

There are plenty of good tutorials online explaining this in more detail, here's a decent one.

whuber · Answer 4 · 2010-11-13T01:32:33.880

You have requested an Excel or SQL solution. The easiest way in Excel is to use its "Analysis" add-in to create a histogram. It will automatically create the bins (ranges of values) but, optionally, accepts a list of bin cutpoints as input and uses them. The output includes a parallel list of bin counts. This is especially handy for irregular-width bins.

This is a one-off calculation: if the data change or the cutpoints change, you have to go through the entire dialog again. A more flexible option is to use COUNTIF to count all values less than or equal to any given bin cutpoint. The first differences of such an array give the bin counts.

Here is a working example. The data are in a column named "Simulation_Z" (which in this particular case is defined to be an entire column, such as $C:$C). The formulae shown below are copied from columns L2:N10 of a sheet in the same workbook. They were created by copying the first one downward (but notice the special formula for the first count in N3).

Cut Count up                            Count
-3.0    =COUNTIF(Simulation_Z, "<=" & L3)   =M3
-2.0    =COUNTIF(Simulation_Z, "<=" & L4)   =M4-M3
-1.0    =COUNTIF(Simulation_Z, "<=" & L5)   =M5-M4
 0.0    =COUNTIF(Simulation_Z, "<=" & L6)   =M6-M5
 1.0    =COUNTIF(Simulation_Z, "<=" & L7)   =M7-M6
 2.0    =COUNTIF(Simulation_Z, "<=" & L8)   =M8-M7
 3.0    =COUNTIF(Simulation_Z, "<=" & L9)   =M9-M8
=MAX(Simulation_Z)  =COUNTIF(Simulation_Z, "<=" & L10)  =M10-M9

Column L ("Cut") stipulates the upper limits of each bin.

This procedure simultaneously defines the bins and computes their counts, which are then available for further testing (e.g., $\chi\text{-squared}$) or plotting.

score 1 · Answer 5 · answered Aug 06 '14 at 01:00

Although the FLOOR answer is probably nicer, you can do this with a vlookup.

The first column of the lookup table (shown below) is the lower bound of each of your bins. Then you enter a formula to make the labels in the second column (formula shown, and result of the formula in the third column -- obviously you'd only have two columns.) Note the last two rows of the table are wonky -- the second last has a different formula because you want it to be the unbounded "greater than" case (unless you don't want that, then change the formula - you can set this up however you like).

You would enter this formula (assuming the values you want to bin are in Col A, and the lookup table is in cols F:G somewhere. Note the TRUE parameter, so vlookup doesn't need to find an exact match.

=VLOOKUP(A2,$F$2:$G$24,2,TRUE)

This is what the lookup table would look like:

Floor   Formula             Result of Formula (label) 
    0   =F2 & " - " &F3-1   0 - 49999
50000   =F3 & " - " &F4-1   50000 - 99999
100000  =F4 & " - " &F5-1   100000 - 149999
150000  =F5 & " - " &F6-1   150000 - 199999
200000  =F6 & " - " &F7-1   200000 - 249999
250000  =F7 & " - " &F8-1   250000 - 299999
300000  =F8 & " - " &F9-1   300000 - 349999
....
1100000 =" >= " &F25        >= 1100000

score 1 · Answer 6 · answered Nov 10 '10 at 14:07

1

In Excel, a simple way to group numeric data into bins is via the Pivot Table. Pull the numeric variable into the "row labels". Now right-click on any of the values in this right column and choose "Group". You can set the min and max of the overall range and the bin size (equal bins widths for all data).

answered Nov 10 '10 at 14:07

Galit Shmueli

1,090
8
10

This kinda worked but Excel did sort the ranges as text generating a total mess. I wasn't able to convince Excel (2010) to show the ranges in the logical order. – sorin Nov 10 '10 at 14:29
Do you mean that Excel doesn't recognize that your data are numerical? If you look at the cell format, is it defined as numbers? Just to make sure I understand your comment. – Galit Shmueli Nov 11 '10 at 13:16

score 0 · Answer 7 · answered Jul 21 '14 at 09:30

0

that is simply to use this formula in excel:

Ceiling(Cell(i,j),1000)/1000

Then you have group of 1000: 1, 2, 3, ...

answered Jul 21 '14 at 09:30

Anh

1

Velcome to our site! – kjetil b halvorsen Jul 21 '14 at 09:48
The concern is that perhaps almost all of the numbers might lie in a few bins, such as from 0 through 999. Your solution will then destroy almost all information about the distribution. It would be helpful to be able to *adapt* the binning to the data distribution in order to reveal as much as possible about the distribution. – whuber Jul 21 '14 at 14:01

score 0 · Answer 8 · answered Feb 17 '14 at 04:10

To answer this for SQL, you really need to have a database that understands a window function (part of SQL:2003), such as MS SQL Server or PostgreSQL.

To group val from table data into, say, 13 equal-width bins, use the ntile window function:

SELECT val, ntile(13) OVER (ORDER BY val)
FROM data;

       val        | ntile
------------------+-------
 8908.96283090115 |     1
 9090.72533249855 |     1
  9620.1803535223 |     1
  11068.768799305 |     1
...
 1994248.56621772 |    13
 1994786.21594608 |    13
 1995945.97052783 |    13
 1997640.62557369 |    13
(1000 rows)

How do I group a list of numeric values into ranges?

8 Answers8

Linked