I'm trying to think of a way to measure that a categorical distribution of any size is concentrated in only a few bins, so not uniform. The best way I can think of is checking entropy, but that's kind of hard to assess, because either it's close to a uniform distribution or it's not unless there's something I'm missing. I also heard about kurtosis, but that's more a metric of tailedness. I'd like that, for example, if I have 5 classes and 2 of the classes make up 80% of the distribution, then I have a metric that can reflect this.

- 63,378
- 26
- 142
- 467

- 640
- 2
- 12
-
1You stated you wanted a metric that measures against uniformity of the bins. Therefore your expected size of each bin is equal, then you can measure against the observed sizes. See Chi-squared test. Whether this makes sense probably depends on the context. – epp Nov 13 '21 at 03:03
-
2Maybe look at Wikipedia on [diversity indexes](https://en.wikipedia.org/wiki/Diversity_index), especially Simpson Index. – BruceET Nov 13 '21 at 08:59
-
2What's wrong with entropy? Or with the sum of squared probabilities (or its complement, or its reciprocal) invented and re-invented by many under different names for about a century? – Nick Cox Nov 13 '21 at 10:40
-
So I am not sure if you have a misunderstanding, but statistical metrics are typically deviations from ' boring', the null hypothesis. And so eg you would measure deviation from uniform, rather than ' clumpiness' directly. As others have said have a look at chisquared tests and see also g-tests https://en.wikipedia.org/wiki/G-test, which might make connection to entropy more clear – seanv507 Nov 14 '21 at 18:54
3 Answers
Imagine your categorical variable (with many levels) is species (in biostatistics context). Then your question can be formulated as about how to measure biodiversity. The same question can be asked in many contexts, for instance, in economics about income inequality. So you are asking for a way of measuring diversity or inequality.
There are many such indices, for instance Gini coefficient or Simpson coefficient. Below some other posts here about diversity indices:

- 63,378
- 26
- 142
- 467
-
1This is the only answer so far that answers the question. Others change the question to how to test given some null hypothesis, e..g uniformity. – Nick Cox Dec 17 '21 at 10:35
If you want to know if several categories have significantly
different proportions of the data, then you might use prop.test
in R.
(Because this test is much the same as a chi-squared test, this
is essentially a repeat of the suggestion in @epp's Comment.
For example. if you have four categories with counts 10, 42, 20, and 15
out of 87, then you could use prop.test
as below. The very small P-value tells you that there are significant differences among the
proportions 0.115, 0.483, 0.230, and 0.172.
barplot(c(10, 42, 20, 15))
prop.test(c(10, 42, 20, 15), rep(87, 4))
4-sample test for equality of proportions
without continuity correction
data: c(10, 42, 20, 15) out of rep(87, 4)
X-squared = 36.582, df = 3, p-value = 5.639e-08
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.1149425 0.4827586 0.2298851 0.1724138
Thus, it is worth looking to see if the largest proportions is markedly larger than the next smaller one.
prop.test(c(42, 20), c(62, 62))$p.val
[1] 0.0001621318
To avoid false discovery from repeated tests on the same data (ad hoc testing), it may be best to answer Yes only if the P-value is less than 5%/4 = 1.25%.'
By contrast, it is not clear to me what you would think of the data 10, 32, 30, and 15. Here the proportions are clearly different, but the largest is not significantly larger than the next smallest. Is there a "peak"?
prop.test(c(10, 32, 30, 15), rep(87, 4))$p.val
[1] 6.943152e-05
prop.test(c(32,30), c(62,62))$p.val
[1] 0.8574624
Sometimes the ad hoc P-value may help you decide, but not always. If you have 100 times as much data, "everything" is significant [and the bar plot (omitted) looks just the same except for the numbers on the vertical axis.]
prop.test(c(1000, 3200, 3000, 1500), rep(8700, 4))$p.val
[1] 0
prop.test(c(3200,3000), c(6200,6200))$p.val
[1] 0.0003513735
In general, I wouldn't want to use the chi-squared test statistic as a 'metric'. See @NickCox's comment below.

- 47,896
- 2
- 28
- 76
-
2The chi-square test here is, evidently, a test of uniformity, but it's not an especially good measure. Even in its own terms a chi-square statistic can't be assessed without knowing the associated number of degrees of freedom. – Nick Cox Nov 13 '21 at 10:38
-
-
2
You can use the classical occupancy test
Another test you can use here is the classical occupancy test which uses the classical occupancy distribution (see e.g., O'Neill 2021). If the true distribution is uniform over the categories then the number of occupied bins with $n$ balls allocated over $m$ bins is the classical occupancy distribution. Moreover, any deviation from uniformity will tend to decrease the number of occupied bins, since it tends to cause balls to concentrate in a smaller number of bins. Consequently, the occupancy number can be used as a test statistic, with lower values more conducive to the alternative hypothesis of non-uniformity.
Implementation: Here is some code in R
to create the classical occupancy test. The test computes the p-value using the occupancy distribution in the occupancy
package. The function takes in a value n
for the number of balls, m
for the number of bins and occupancy
for the occupancy number in the data. The null hypothesis of the test is that the allocation is uniform and the alternative hypothesis is that it is non-uniform.
occupancy.test <- function(n, m, occupancy) {
#Check inputs
if (!is.numeric(n)) { stop("Error: Input n should be a positive integer") }
if (length(n) != 1) { stop("Error: Input n should be a single positive integer") }
if (n != as.integer(n)) { stop("Error: Input n should be a positive integer") }
if (n <= 0) { stop("Error: Input n should be a positive integer") }
if (!is.numeric(m)) { stop("Error: Input m should be a positive integer") }
if (length(m) != 1) { stop("Error: Input m should be a single positive integer") }
if (m != as.integer(m)) { stop("Error: Input m should be a positive integer") }
if (m <= 0) { stop("Error: Input m should be a positive integer") }
if (!is.numeric(occupancy)) { stop("Error: Input occupancy should be an integer") }
k <- as.integer(occupancy)
if (length(k) != 1) { stop("Error: Input occupancy should be a single integer") }
if (k != occupancy) { stop("Error: Input occupancy should be an integer") }
if (occupancy < 0) { stop("Error: Input occupancy should be a positive integer") }
if (occupancy > min(n,m)) { stop("Error: Input occupancy cannot be larger than n or m") }
#Set test content
method <- 'Classical occupancy test'
data.name <- paste0('Occupancy number ', occupancy, ' from allocating ', n,
' balls to ', m, ' bins')
alternative <- 'Allocation distribution is non-uniform'
statistic <- k
attr(statistic, 'names') <- 'occupancy number'
p.value <- occupancy::pocc(k, size = n, space = m)
#Create htest object
TEST <- list(method = method, data.name = data.name,
null.value = NULL, alternative = alternative,
statistic = statistic, p.value = p.value)
class(TEST) <- 'htest'
TEST }
Here is an example of a classical occupancy test for $n = 18$ data points over $m = 12$ bins. In the example we generate data from a random categorical distribution using uniformity over the categories. We get a p-value of $0.8264271$, so we accept the null hypothesis of uniformity.
#Generate some random categorical data (for the uniform case)
set.seed(1)
DATA <- sample.int(12, size = 18, replace = TRUE)
#Compute and print occupancy number
OCC <- length(unique(DATA))
OCC
[1] 10
#Conduct the classical occupancy test
occupancy.test(n = 18, m = 12, occupancy = OCC)
Classical occupancy test
data: Occupancy number 10 from allocating 18 balls to 12 bins
occupancy number = 10, p-value = 0.8264
alternative hypothesis: Allocation distribution is non-uniform
Here is another example of a classical occupancy test for $n = 18$ data points over $m = 12$ bins. In the example we generate data from a random categorical distribution where the probabilities are concentrated on only a few bins. We get a p-value of $0.0001046$, so we reject the null hypothesis of uniformity.
#Set non-uniform bin probabilities (concentrated on only a few bins)
PROBS <- c(0.01, 0.31, 0.01, 0.01, 0.01, 0.01, 0.44, 0.16, 0.01, 0.01, 0.01, 0.01)
#Generate some random categorical data (for the non-uniform case)
set.seed(1)
DATA2 <- sample.int(12, size = 18, replace = TRUE, prob = PROBS )
#Compute and print occupancy number
OCC2 <- length(unique(DATA2))
OCC2
[1] 5
#Compute and print p-value for classical occupancy test
occupancy.test(n = 18, m = 12, occupancy = OCC2)
Classical occupancy test
data: Occupancy number 5 from allocating 18 balls to 12 bins
occupancy number = 5, p-value = 0.0001046
alternative hypothesis: Allocation distribution is non-uniform

- 91,027
- 3
- 150
- 376