Is chi-squared necessary if comparing entire populations?

Question

I'm using 2010 Census data on race, which contains (to the best of their ability), the complete population of the U.S. rather than just a sample.

I've divided the U.S. up into four geographic areas based on fire risk, and then calculated the ratio between the population of each race within an area and the entire U.S. population of that race (e.g. 20% of the U.S. White population and 7% of the Asian population lives in Area A).

Initially, I wanted to compare these ratios in each area and see if there were significant differences between them and their expected values. This would help test my hypothesis that certain racial groups live disproportionately in areas with higher fire risk.

An example: if 15% of the U.S. population lives in Area A, we would expect 15% of Race A and 15% of Race B to live there. If 25% of Race B lives there, and that is significantly greater than the expected 15%, that might mean Race B lives disproportionately in Area A.

The problem is that because these are proportions of populations in the whole U.S., not just in the area of interest, the predicted %'s don't add to 1. So chi-square wouldn't work.

My question: is chi-square even necessary? These are counts of the entire population, not a sample, so there isn't any random sampling variation that could account for the differences. The differences are true.

The role of random variation in population statistics is a matter of some judgement. — David Smith, Jun 09 '17 at 18:36
If you can elaborate on your objective ("make a statement about race and fire risk"?) a bit, you may get more useful answers. — GeoMatt22, Jun 09 '17 at 19:23
@GeoMatt22 Good suggestion, I've amended my question to be more clear. — la_leche, Jun 09 '17 at 20:56
Your null model seems like the following. You are given a pile of balls that vary in color but are the same size, and a set of 4 buckets that vary in size but can just hold all the balls when taken together. Then iterate as follows until the pile is empty: pick a ball at random from the pile, and place it in a randomly chosen un-filled bucket. So bucket assignments are exhaustive, but ignore color. There are multiple assignments consistent with this scheme, and your observed assignment is among them, with some probability. So there is randomness. — GeoMatt22, Jun 09 '17 at 21:21
If you cannot separate the finite population and infinite population in statistics, you will be confused by this kind of problems forever. — user158565, Jun 10 '17 at 05:01

David Smith · Answer 1 · 2017-06-09T21:44:47.020

The role of random variation in population statistics is a matter of some judgement.

Many people believe that possible variation in complete counts, including the census, has no bearing on analysis.

I believe that if you just want to count and describe what happened and aren't really interested in a model, reasons for variation, and likely future patterns of events then you don't need to account for any random variation or, perhaps more accurately, deviation from a model.

I have never had a problem like this. Nearly all analyses of census data need to account for variation over time in the population, particularly so in small subpopulations.

If you are interested in making any statements that generalize in any way beyond the one time experience of a population, then measuring and accounting for variation, often called random, is necessary.

There are a few circumstances in which you are only interested in what did happen. If you are a tax collector, for example. But even a tax collector wants to make a budget for next year and has to take some account of possible change.

I

For the OP, say all counts are exact, and populations were randomly assigned to regions without considering demographics. Shouldn't there still be randomness in region demographics? (see my comment to OP) — GeoMatt22, Jun 09 '17 at 21:41

score 1 · Answer 2 · answered Jun 09 '17 at 23:48

This is not really my area, but I hope the following may be helpful (and not misleading!).

It is important to distinguish between uncertainty due to sampling error and that due to random assignment. If all population counts are exact there is no sampling error. But if all people were randomly assigned to a region, without regard to demographic categories, there would still be variability in the region demographics.

It is this second type of randomness that is addressed by the chi-squared test.

For the simple case of 2 regions and 2 demographic categories, you could use an exact test that fully accounts for finite-population effects. But for contingency tables larger than 2 by 2, exact tests become less computationally feasible, so the chi-squared approximation is commonly used.

Is chi-squared necessary if comparing entire populations?

2 Answers2

Linked