1

I'm using 2010 Census data on race, which contains (to the best of their ability), the complete population of the U.S. rather than just a sample.

I've divided the U.S. up into four geographic areas based on fire risk, and then calculated the ratio between the population of each race within an area and the entire U.S. population of that race (e.g. 20% of the U.S. White population and 7% of the Asian population lives in Area A).

Initially, I wanted to compare these ratios in each area and see if there were significant differences between them and their expected values. This would help test my hypothesis that certain racial groups live disproportionately in areas with higher fire risk.

An example: if 15% of the U.S. population lives in Area A, we would expect 15% of Race A and 15% of Race B to live there. If 25% of Race B lives there, and that is significantly greater than the expected 15%, that might mean Race B lives disproportionately in Area A.

The problem is that because these are proportions of populations in the whole U.S., not just in the area of interest, the predicted %'s don't add to 1. So chi-square wouldn't work.

My question: is chi-square even necessary? These are counts of the entire population, not a sample, so there isn't any random sampling variation that could account for the differences. The differences are true.

la_leche
  • 111
  • 5
  • The role of random variation in population statistics is a matter of some judgement. – David Smith Jun 09 '17 at 18:36
  • If you can elaborate on your objective ("make a statement about race and fire risk"?) a bit, you may get more useful answers. – GeoMatt22 Jun 09 '17 at 19:23
  • @GeoMatt22 Good suggestion, I've amended my question to be more clear. – la_leche Jun 09 '17 at 20:56
  • See https://stats.stackexchange.com/q/68886/17230. – Scortchi - Reinstate Monica Jun 09 '17 at 21:20
  • Your null model seems like the following. You are given a pile of balls that vary in color but are the same size, and a set of 4 buckets that vary in size but can just hold all the balls when taken together. Then iterate as follows until the pile is empty: pick a ball at random from the pile, and place it in a randomly chosen un-filled bucket. So bucket assignments are exhaustive, but ignore color. There are multiple assignments consistent with this scheme, and your observed assignment is among them, with some probability. So there is randomness. – GeoMatt22 Jun 09 '17 at 21:21
  • If you cannot separate the finite population and infinite population in statistics, you will be confused by this kind of problems forever. – user158565 Jun 10 '17 at 05:01

2 Answers2

4

The role of random variation in population statistics is a matter of some judgement.

Many people believe that possible variation in complete counts, including the census, has no bearing on analysis.

I believe that if you just want to count and describe what happened and aren't really interested in a model, reasons for variation, and likely future patterns of events then you don't need to account for any random variation or, perhaps more accurately, deviation from a model.

I have never had a problem like this. Nearly all analyses of census data need to account for variation over time in the population, particularly so in small subpopulations.

If you are interested in making any statements that generalize in any way beyond the one time experience of a population, then measuring and accounting for variation, often called random, is necessary.

There are a few circumstances in which you are only interested in what did happen. If you are a tax collector, for example. But even a tax collector wants to make a budget for next year and has to take some account of possible change.

I

David Smith
  • 800
  • 4
  • 12
1

This is not really my area, but I hope the following may be helpful (and not misleading!).

It is important to distinguish between uncertainty due to sampling error and that due to random assignment. If all population counts are exact there is no sampling error. But if all people were randomly assigned to a region, without regard to demographic categories, there would still be variability in the region demographics.

It is this second type of randomness that is addressed by the chi-squared test.

For the simple case of 2 regions and 2 demographic categories, you could use an exact test that fully accounts for finite-population effects. But for contingency tables larger than 2 by 2, exact tests become less computationally feasible, so the chi-squared approximation is commonly used.

GeoMatt22
  • 11,997
  • 2
  • 34
  • 64