0

i have data on how a large population (N ~ 1e8) is distributed into (many) categories (i.e. i have count of instances in each category).

some categories have low counts, many categories have a high number of counts. there is no meaningful ordering of the categories.

i also have information on how a particularly selected subset (n ~ 10k) from this population is distributed into the same categories (though the subset has zero counts in some of the population's categories).

i want to test the research hypothesis that the subset has a different distribution into the categories from that of the population. my null hypothesis is that the subset is a uniform random sample from the given population.

should the null hypothesis be rejected, i would furthermore like to identify which of the categories are significantly under/over represented in the subset compared to the population.

to this end i have tried this:

  • trimmed the set of categories under consideration to only include the categories realised by the subset.
  • computed the category ranking of each set
  • tried to fit the problem into a friedman test.

now, my questions to you are:

  • what is the most appropriate test statistic for the given hypothesis?
  • does the friedman test apply here?
  • how would you find which categories are over or under populated assuming the distributions are found to differ?

1 Answers1

0

First, I don't think you mean a "uniform" random sample from a larger population, I think you mean a "simple random sample".

Second, I would not trim the population to match the sample.

Third, I wouldn't rank the results, that throws away information.

Fourth, I don't really see how a Friedman test applies here.

Fifth, I would consider a one-way chi square test, treating the population as fixed and seeing if the sample matched it. You cant then try partitioning chi square, see e.g. this thread.

Finally, I would be leery of p values and look at effect size.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • hello peter, thank you for your answer (i was not expecting one so soon!). yes, i meant a simple random sample. i take your point on not trimming, but believe i will still need some trimming, as even in the population some categories contain only few observations so their expected count in the sample is tiny, <<1. perhaps it is cleaner to collect all such categories/counts into a new bin (labelled 'other'). thanks for the link to the partitioned chi square test - i took a look at the links therein and will consider those methods for identifying the outliers. – Óskar Halldórsson Holm Jan 01 '19 at 16:34