testing whether categorical distributions differ

Question

i have data on how a large population (N ~ 1e8) is distributed into (many) categories (i.e. i have count of instances in each category).

some categories have low counts, many categories have a high number of counts. there is no meaningful ordering of the categories.

i also have information on how a particularly selected subset (n ~ 10k) from this population is distributed into the same categories (though the subset has zero counts in some of the population's categories).

i want to test the research hypothesis that the subset has a different distribution into the categories from that of the population. my null hypothesis is that the subset is a uniform random sample from the given population.

should the null hypothesis be rejected, i would furthermore like to identify which of the categories are significantly under/over represented in the subset compared to the population.

to this end i have tried this:

trimmed the set of categories under consideration to only include the categories realised by the subset.
computed the category ranking of each set
tried to fit the problem into a friedman test.

now, my questions to you are:

what is the most appropriate test statistic for the given hypothesis?
does the friedman test apply here?
how would you find which categories are over or under populated assuming the distributions are found to differ?

score 0 · Accepted Answer · answered Jan 01 '19 at 14:01

0

First, I don't think you mean a "uniform" random sample from a larger population, I think you mean a "simple random sample".

Second, I would not trim the population to match the sample.

Third, I wouldn't rank the results, that throws away information.

Fourth, I don't really see how a Friedman test applies here.

Fifth, I would consider a one-way chi square test, treating the population as fixed and seeing if the sample matched it. You cant then try partitioning chi square, see e.g. this thread.

Finally, I would be leery of p values and look at effect size.

answered Jan 01 '19 at 14:01

Peter Flom

94,055
35
143
276

hello peter, thank you for your answer (i was not expecting one so soon!). yes, i meant a simple random sample. i take your point on not trimming, but believe i will still need some trimming, as even in the population some categories contain only few observations so their expected count in the sample is tiny, <<1. perhaps it is cleaner to collect all such categories/counts into a new bin (labelled 'other'). thanks for the link to the partitioned chi square test - i took a look at the links therein and will consider those methods for identifying the outliers. – Óskar Halldórsson Holm Jan 01 '19 at 16:34

testing whether categorical distributions differ

1 Answers1