3

I'm trying to think of a way to measure that a categorical distribution of any size is concentrated in only a few bins, so not uniform. The best way I can think of is checking entropy, but that's kind of hard to assess, because either it's close to a uniform distribution or it's not unless there's something I'm missing. I also heard about kurtosis, but that's more a metric of tailedness. I'd like that, for example, if I have 5 classes and 2 of the classes make up 80% of the distribution, then I have a metric that can reflect this.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user8714896
  • 640
  • 2
  • 12
  • 1
    You stated you wanted a metric that measures against uniformity of the bins. Therefore your expected size of each bin is equal, then you can measure against the observed sizes. See Chi-squared test. Whether this makes sense probably depends on the context. – epp Nov 13 '21 at 03:03
  • 2
    Maybe look at Wikipedia on [diversity indexes](https://en.wikipedia.org/wiki/Diversity_index), especially Simpson Index. – BruceET Nov 13 '21 at 08:59
  • 2
    What's wrong with entropy? Or with the sum of squared probabilities (or its complement, or its reciprocal) invented and re-invented by many under different names for about a century? – Nick Cox Nov 13 '21 at 10:40
  • So I am not sure if you have a misunderstanding, but statistical metrics are typically deviations from ' boring', the null hypothesis. And so eg you would measure deviation from uniform, rather than ' clumpiness' directly. As others have said have a look at chisquared tests and see also g-tests https://en.wikipedia.org/wiki/G-test, which might make connection to entropy more clear – seanv507 Nov 14 '21 at 18:54

3 Answers3

2

Imagine your categorical variable (with many levels) is species (in biostatistics context). Then your question can be formulated as about how to measure biodiversity. The same question can be asked in many contexts, for instance, in economics about income inequality. So you are asking for a way of measuring diversity or inequality.

There are many such indices, for instance Gini coefficient or Simpson coefficient. Below some other posts here about diversity indices:

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 1
    This is the only answer so far that answers the question. Others change the question to how to test given some null hypothesis, e..g uniformity. – Nick Cox Dec 17 '21 at 10:35
1

If you want to know if several categories have significantly different proportions of the data, then you might use prop.test in R. (Because this test is much the same as a chi-squared test, this is essentially a repeat of the suggestion in @epp's Comment.

For example. if you have four categories with counts 10, 42, 20, and 15 out of 87, then you could use prop.test as below. The very small P-value tells you that there are significant differences among the proportions 0.115, 0.483, 0.230, and 0.172.

barplot(c(10, 42, 20, 15))

enter image description here

prop.test(c(10, 42, 20, 15), rep(87, 4))

        4-sample test for equality of proportions 
        without continuity correction

data:  c(10, 42, 20, 15) out of rep(87, 4)
X-squared = 36.582, df = 3, p-value = 5.639e-08
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3    prop 4 
0.1149425 0.4827586 0.2298851 0.1724138 

Thus, it is worth looking to see if the largest proportions is markedly larger than the next smaller one.

prop.test(c(42, 20), c(62, 62))$p.val
[1] 0.0001621318

To avoid false discovery from repeated tests on the same data (ad hoc testing), it may be best to answer Yes only if the P-value is less than 5%/4 = 1.25%.'

By contrast, it is not clear to me what you would think of the data 10, 32, 30, and 15. Here the proportions are clearly different, but the largest is not significantly larger than the next smallest. Is there a "peak"?

enter image description here

prop.test(c(10, 32, 30, 15), rep(87, 4))$p.val
[1] 6.943152e-05
prop.test(c(32,30), c(62,62))$p.val
[1] 0.8574624

Sometimes the ad hoc P-value may help you decide, but not always. If you have 100 times as much data, "everything" is significant [and the bar plot (omitted) looks just the same except for the numbers on the vertical axis.]

prop.test(c(1000, 3200, 3000, 1500), rep(8700, 4))$p.val
[1] 0
prop.test(c(3200,3000), c(6200,6200))$p.val
[1] 0.0003513735

In general, I wouldn't want to use the chi-squared test statistic as a 'metric'. See @NickCox's comment below.

BruceET
  • 47,896
  • 2
  • 28
  • 76
  • 2
    The chi-square test here is, evidently, a test of uniformity, but it's not an especially good measure. Even in its own terms a chi-square statistic can't be assessed without knowing the associated number of degrees of freedom. – Nick Cox Nov 13 '21 at 10:38
  • Agreed. Hence my final paragraph. – BruceET Nov 13 '21 at 10:40
  • 2
    Glad we agree, but I think the point deserves a little emphasis. – Nick Cox Nov 13 '21 at 10:40
1

You can use the classical occupancy test

Another test you can use here is the classical occupancy test which uses the classical occupancy distribution (see e.g., O'Neill 2021). If the true distribution is uniform over the categories then the number of occupied bins with $n$ balls allocated over $m$ bins is the classical occupancy distribution. Moreover, any deviation from uniformity will tend to decrease the number of occupied bins, since it tends to cause balls to concentrate in a smaller number of bins. Consequently, the occupancy number can be used as a test statistic, with lower values more conducive to the alternative hypothesis of non-uniformity.


Implementation: Here is some code in R to create the classical occupancy test. The test computes the p-value using the occupancy distribution in the occupancy package. The function takes in a value n for the number of balls, m for the number of bins and occupancy for the occupancy number in the data. The null hypothesis of the test is that the allocation is uniform and the alternative hypothesis is that it is non-uniform.

occupancy.test <- function(n, m, occupancy) {
  
  #Check inputs
  if (!is.numeric(n))      { stop("Error: Input n should be a positive integer") }
  if (length(n) != 1)      { stop("Error: Input n should be a single positive integer") }
  if (n != as.integer(n))  { stop("Error: Input n should be a positive integer") }
  if (n <= 0)              { stop("Error: Input n should be a positive integer") }
  if (!is.numeric(m))      { stop("Error: Input m should be a positive integer") }
  if (length(m) != 1)      { stop("Error: Input m should be a single positive integer") }
  if (m != as.integer(m))  { stop("Error: Input m should be a positive integer") }
  if (m <= 0)              { stop("Error: Input m should be a positive integer") }
  if (!is.numeric(occupancy)) { stop("Error: Input occupancy should be an integer") }
  k <- as.integer(occupancy)
  if (length(k) != 1)         { stop("Error: Input occupancy should be a single integer") }
  if (k != occupancy)        { stop("Error: Input occupancy should be an integer") }
  if (occupancy < 0)         { stop("Error: Input occupancy should be a positive integer") }
  if (occupancy > min(n,m))  { stop("Error: Input occupancy cannot be larger than n or m") }

#Set test content
method      <- 'Classical occupancy test'
data.name   <- paste0('Occupancy number ', occupancy, ' from allocating ', n, 
                      ' balls to ', m, ' bins')
alternative <- 'Allocation distribution is non-uniform'
statistic   <- k
attr(statistic, 'names') <- 'occupancy number'
p.value     <- occupancy::pocc(k, size = n, space = m)

#Create htest object
TEST        <- list(method = method, data.name = data.name,
                    null.value = NULL, alternative = alternative,
                    statistic = statistic, p.value = p.value)
class(TEST) <- 'htest'
TEST }

Here is an example of a classical occupancy test for $n = 18$ data points over $m = 12$ bins. In the example we generate data from a random categorical distribution using uniformity over the categories. We get a p-value of $0.8264271$, so we accept the null hypothesis of uniformity.

#Generate some random categorical data (for the uniform case)
set.seed(1)
DATA <- sample.int(12, size = 18, replace = TRUE)

#Compute and print occupancy number
OCC <- length(unique(DATA))
OCC
[1] 10

#Conduct the classical occupancy test
occupancy.test(n = 18, m = 12, occupancy = OCC)

        Classical occupancy test

data:  Occupancy number 10 from allocating 18 balls to 12 bins
occupancy number = 10, p-value = 0.8264
alternative hypothesis: Allocation distribution is non-uniform

Here is another example of a classical occupancy test for $n = 18$ data points over $m = 12$ bins. In the example we generate data from a random categorical distribution where the probabilities are concentrated on only a few bins. We get a p-value of $0.0001046$, so we reject the null hypothesis of uniformity.

#Set non-uniform bin probabilities (concentrated on only a few bins)
PROBS <- c(0.01, 0.31, 0.01, 0.01, 0.01, 0.01, 0.44, 0.16, 0.01, 0.01, 0.01, 0.01)

#Generate some random categorical data (for the non-uniform case)
set.seed(1)
DATA2 <- sample.int(12, size = 18, replace = TRUE, prob = PROBS )

#Compute and print occupancy number
OCC2 <- length(unique(DATA2))
OCC2
[1] 5

#Compute and print p-value for classical occupancy test
occupancy.test(n = 18, m = 12, occupancy = OCC2)

        Classical occupancy test

data:  Occupancy number 5 from allocating 18 balls to 12 bins
occupancy number = 5, p-value = 0.0001046
alternative hypothesis: Allocation distribution is non-uniform
Ben
  • 91,027
  • 3
  • 150
  • 376