26

From what I understand hypothesis testing is done to identify if a finding in the sample population is statistically significant. But if I have a census data, do we really need hypothesis testings?

I was thinking may be I should perform multiple random sampling from the census data and see if there is any random behavior.

  • 6
    No, there is no hypothesis testing if you have all the population, it is exactly like the data shows. Whether that is "significant" is up to you to decide. – user2974951 Jul 21 '20 at 05:24
  • 3
    But you might use statistical ideas to summarize or graph the data. – BruceET Jul 21 '20 at 06:11
  • 3
    How many censuses have perfectly accurate data? – whuber Jul 21 '20 at 19:32
  • How big is your data? This might be relevant [Practical Significance](https://online.stat.psu.edu/stat200/lesson/6/6.4#:~:text=Practical%20significance%20refers%20to%20the,may%20depend%20on%20the%20context.) – user2974951 Jul 22 '20 at 06:22
  • 1
    Check out randomization inference. It's a prime example of statistical inference that is not about sampling from a larger population of individuals but rather sampling possible treatment assignments applied to the same set of individuals, be it a population or sample. – Noah Jul 23 '20 at 08:21
  • 2
    Suppose I have reliable measures on *every* resident of all 50 United States. Moreover, suppose I have these measures for every year back to, say, 1962. **I still have need of statistical inference because I *do not* have any measures of *future* years in these (and any future) US states, and I care about *predicting* or *explaining* future experiences.** (Of course statistical inference is not limited to p-values, but the point still holds.) – Alexis Jul 23 '20 at 20:53

7 Answers7

52

It all depends on your goal.

If you want to know how many people smoke and how many people die of lung cancer you can just count them, but if you want to know whether smoking increases the risk for lung cancer then you need statistical inference.

If you want to know high school students' educational attainments, you can just look at complete data, but if you want to know the effects of high school students' family backgrounds and mental abilities on their eventual educational attainments you need statistical inference.

If you want to know workers' earnings, you can just look at census data, but if you want to study the effects of educational attainment on earnings, you need statistical inference (you can find more examples in Morgan & Winship, Counterfactuals and Causal Inference: Methods and Principles for Social Research.)

Generally speaking, if you are only looking for summary statistics in order to communicate the largest amount of information as simply as possible, you can just count, sum, divide, plot etc.

But if you wish to predict what will happen, or to understand what causes what, then you need statistical inference: assumptions, paradigms, estimation, hypothesis testing, model validation, etc.

Sergio
  • 5,628
  • 2
  • 11
  • 27
  • 6
    Good answer, but I'd argue that in cases where you are trying to generate predictions or develop a causal model, you are typically applying it to unseen data, so these are not actually cases where you "have all the population". Generating "predictions" on data where you already know the answer is a purely academic exercise - it's only done in practice if there are unseen members of the population. Whether smoking is associated with lung cancer is interesting mainly because we can infer something about unseen members of the population with unknown cancer status. – Nuclear Hoagie Jul 21 '20 at 20:32
  • Would you include predicting what has already happened in the set of things that require inference? Suppose I have the population of sales going back to $t=0$. I am trying to decide if today's sales are low given that history. Would I need inference to answer this purely descriptive, non-causal question? – dimitriy Jul 21 '20 at 20:49
  • @DimitriyV.Masterov What does "low" mean? It is not a purely descriptive question. And there are predictive questions and causal questions. If "low" means "less than expected", I should have predicted higher sales, and prediction is not description. – Sergio Jul 21 '20 at 21:08
  • To me, "low" means in the left tail of some prediction interval for today's data, that makes promises about its coverage. This doesn't seem causal to me, and it's not exactly a prediction about the future. – dimitriy Jul 21 '20 at 22:14
  • Given that OP has census data we might conclude that the "sample" is big, and if it is big then any hypothesis tests are likely pointless because all the results are likely to be highly significant (the point of practical significance). – user2974951 Jul 22 '20 at 06:16
  • 2
    @NuclearWang: Or from another perspective, our population (in the statistical sense) is not the current population (in the demographic sense), but all possible future populations (demographic) under some assumptions (such as equal tobacco consumption). – Wrzlprmft Jul 22 '20 at 11:58
  • I think this answer would be even better by including a more appropriate definition of statistical population as @Wrzlprmft gives. The point being that it is actual impossible to have the whole statistical population in your examples unless you have all recorded data at the end of time. – Fnguyen Jul 22 '20 at 12:57
  • @Fnguyen It's very simple: I may apply predictions or causal models to unseen data, but this happens because _I can_. You can't predict or look for causes by merely summarizing data. Summary statistics can't be applied to unseen data. You need statistical inference. And statistical inference is always based on available data, not on future, unknown data :) – Sergio Jul 22 '20 at 18:04
  • @Sergio I fully understand I was just adding that your answer would be even better if you added the definition of population to it. OP is basically wondering why there is unseen data in the "population data" (as in census population) which isn't the statistical population. Your answer explains very well the difference between summary and interference but not why we still need interference even if we survey the whole (demographical) population. So no need to explain it to me again just a suggestion to improve your answer. – Fnguyen Jul 22 '20 at 18:24
  • @Fnguyen May be, but "it is actual impossible to have the whole statistical population in your examples unless you have all recorded data at the end of time" does not make sense to me :) – Sergio Jul 22 '20 at 20:12
  • @Sergio take your smoking and cancer example. Why is there unseen data at all? Because the "population" isn't everyone living now but rather every smoker who has ever lived or will ever live until the end of time. If we actual had this data we would not net interference, we'd really have the whole population and therefore summary statistic would be all we'd need. Since we do not we need interference to make decisions how to best deal with statistical uncertainty and how to predict on unseen data. – Fnguyen Jul 22 '20 at 20:22
  • @Fnguyen You can see perpetual correlation between smoking and lung cancer, but that would not be causation, because there might be a confounding variable (Fisher's objection). I can't agree with you. Sorry. – Sergio Jul 22 '20 at 20:29
  • @Sergio Fair enough to stop this at this point but it's hard for me to see you fail to understand me as I'm 100% agreeing with you and just adding a different aspect you haven't considered in explaining what OP isn't understanding. – Fnguyen Jul 22 '20 at 21:17
  • I posted answer expanding my line of thought (CC @Fnguyen). – Wrzlprmft Jul 23 '20 at 08:13
  • "If you want to know how many people smoke and how many people die of lung cancer you can just count them, but if you want to know whether smoking increases the risk for lung cancer then you need statistical inference." Please note that the reason for this being true is that you probably want to use those results to talk about a different population in the future, and you are using the current population as a sample of the population of "all potential humans that could exist" – David Jul 23 '20 at 08:16
20

To illustrate my points, I will assume that everybody has been asked whether they prefer Star Trek or Doctor Who and has to choose one of them (there is no neutral option). To keep things simple, let’s also assume that your census data is actually complete and accurate (which it rarely ever is).

There are some important caveats about your situation:

  1. Your demographic population hardly ever is your statistical population. In fact, I cannot think of a single example where it is reasonable to ask the kind of questions answered by statistical tests about a statistical population that is a demographic population.

    For example, suppose you want to settle once and for all the question whether Star Trek or Doctor Who is better, and you define better via the preference of everybody alive at the time of the census. You find that 1234567 people prefer Star Trek and 1234569 people prefer Doctor Who. If you want to accept this verdict as it is, no statistical test is needed.

    However, if you want to find out whether this difference reflects actual preference or can be explained by forcing undecided people to make a random choice. For example, you can now investigate the null model that people choose between the two randomly and see how extreme a difference of 2 is for your demographic population size. In that case, your statistical population is not your demographic population, but the aggregated outcome of an infinite amount of censuses performed on your current demographic population.

  2. If you have data the size of the population of a reasonably sized administrative region and for the questions usually answered by it, you should focus on effect size, not on significance.

    For example, there are no practical implications whether Star Trek is better than Doctor Who by a small margin, but you want to decide practical stuff like how much time to allot to the shows on national television. If 1234567 people prefer Star Trek and 1234569 people prefer Doctor Who, you would decide to allot both an equal amount of screen time, whether that tiny difference is statistically significant or not.

    On a side note, once you care about effect size, you may want to know the margin of error of this, and this can be indeed be determined by some random sampling as you are alluding to in your question, namely bootstrapping.

  3. Using demographic populations tends to lead to pseudoreplication. Your typical statistical test assumes uncorrelated samples. In some cases you can avoid this requirement if you have good information on the correlation structure and build a null model based on this, but that’s rather the exception. Instead, for smaller samples, you avoid correlated samples by explicitly avoiding to sample two people from the same household or similar. When your sample is the entire demographic population, you cannot do this and thus you inevitably have correlations. If you treat them as independent samples nonetheless, you commit pseudoreplication.

    In our example, people do not arrive at a preference of Star Trek or Doctor Who independently, but instead are influenced by their parents, friends, partners, etc. and their fates align. If the matriarch of some popular clan prefers Doctor Who, this is going to influence many other people thus leading to pseudoreplication. Or, if four fans are killed in a car crash on their way to a Star Trek convention, boom, pseudoreplication.

To give another perspective at this, let’s consider another example that avoids the second and third problem as much as possible and is somewhat more practical: Suppose you are in charge of a wildlife reserve featuring the only remaining pink elephants in the world. As pink elephants stand out (guess why they are endangered), you can easily perform a census on them. You notice that you have 50 female and 42 male elephants and wonder if this indicates a true imbalance or can be explained by random fluctuations. You can perform a statistical test with the null hypothesis that the sex of pink elephants is random (with equal probability) and uncorrelated (e.g., no monozygotic twins). But here again, your statistical population is not your ecological population, but all pink elephants ever in the multiverse, i.e., it includes infinite hypothetical replications of the experiment of running your wildlife reserve for a century (details depend on the scope of your scientific question).

Wrzlprmft
  • 2,225
  • 1
  • 18
  • 35
  • 2
    Thank you for adding this explanation and perspective as well! I think this gets perfectly to the point of OP's confusion why having the "population" isn't enough. – Fnguyen Jul 23 '20 at 08:31
  • Sounds all well, but - what about New Who vs. Old Who? – Hagen von Eitzen Jul 23 '20 at 09:59
  • 1
    @HagenvonEitzen: It gets complicated due to age dependency, and I am not sure whether Elizabeth Mountbatten-Windsor’s preference on this is known. – Wrzlprmft Jul 23 '20 at 10:02
  • 3
    +1 for your point (1). Statistical population means "the population of all possible Americans", not just the finite number of Americans who happen to exist (EDIT: ooops, I assumed the OP was from the US. Realise now they never actually stated that) – Michael Reid Jul 23 '20 at 14:01
  • @MichaelReid In census data you actually have the finite number of Americans who happen to exist. Statistical inference requires "the population of all possibile Americans", because it does not simply summarize data, but tends to draw inference that can be applied to unseen data. E.g.: how many people _will_ prefer _Star Trek_? – Sergio Jul 23 '20 at 18:22
7

Funny. I spent years explaining to clients that in cases with true census information there was no variance and therefore statistical significance was meaningless.

Example: If I have data from 150 stores in a supermarket chain that says 15000 cases of Coke and 16000 cases of Pepsi were sold in a week, we can definitely say that more cases of Pepsi were sold. [There might be measurement error, but not sampling error.]

But, as @Sergio notes in his answer, you might want an inference. A simple example might be: is this difference between Pepsi and Coke larger than it typically is? For that, you'd look at the variation in the sales difference versus the sales difference in previous weeks, and you'd draw a confidence interval or do a statistical test to see if this difference was unusual.

zbicyclist
  • 3,363
  • 1
  • 29
  • 34
  • 3
    There's still error, but any error is systematic. The CLT depends on error being reasonably independent, so modeling systematic error as Gaussian is problematic. Sometimes Zipf's law is more accurate. – Acccumulation Jul 22 '20 at 06:53
4

In typical applications of hypothesis testing, you do not have access to the whole population of interest, but you want to make statements about the parameters that govern the distribution of the data in the population (mean, variance, correlation,...). Then, you take a sample from the population, and assess if the sample is compatible with the hypothesis that the population parameter is some pre-specified value (hypothesis testing), or you estimate the parameter from you sample (parameter estimation).

However, when you really have the whole population, you are in the rare position that you have direct access to the true population parameters - for example, the population mean is just the mean of all the values of the population. Then you don't need to perform any further hypothesis testing or inference - the parameter is exactly what you have.

Of course, the situations where you really have data from the whole population of interest are exceptionally rare, and mostly constrained to textbook examples.

2

Let's say you are measuring height in the current world population and you want to caompare male and female height.

To check the hypothesis "average male height for men alive today is higher than for women alive today", you can just measure every man and woman on the planet and compare the results. If male height is on average 0.0000000000000001cm bigger even with a standard deviation trillions of times bigger, your hypothesis is proven correct.

However, such a conclusion is probably not useful in practice. Since people are constantly being born and dying, you probably don't care about the current population, but about a more abstract population of "potentially existing humans" or "all humans in history" of which you take people alive today as a sample. Here you need hypothesis testing.

David
  • 2,422
  • 1
  • 4
  • 15
1

I would be very wary about anyone claiming to have knowledge about the complete population. There is a lot of confusion about what this term means in a statistical context, leading to people claiming they have the complete population, when they actually don't. And where the complete population is known, the scientific value is not clear.

Assume you want to figure out if higher education leads to higher income in the US. So you get the level of education and the annual income of every person in the US in 2015. That's your demographic population.

But it isn't. The data is from 2015 but the question was about the relation in general. The actual population would be the data from every person in the US in every year in the past and yet to come. There is no way to ever get data for this statistical population.

Also, if you look at the definition of a theory given e.g. by Popper, then a theory is about predicting something unknown. That is, you need to generalize. If you have a complete population, you are merely describing that population. That may be relevant in some fields but in theory driven fields, it doesn't have much value.

In psychology there have been some researchers that abused this misunderstanding between population and sample. There have been cases where researchers claimed that their sample is the actual population, i.e. the results only apply to those people that have been sampled, and therefore a failure to replicate the results is just due to the use of a different population. Nice way out, but I really don't know why I should read a paper that only makes a theory about a small number of annonymous people that I will probably never ever encounter and that may not be applicable to anyone else.

LiKao
  • 2,329
  • 1
  • 17
  • 25
0

Let me add something at the good answers above. Some of them address mainly the problem of reliability of the condition “have all the population”, as the accepted one, and related practical points. I propose a more theoretical perspective, related to the Sergio’s answer but not equal.

If you say you “have all the population”, I focus on the case where the population is finite. I also consider the case of infinite data in the following. Another aspect seems me relevant also. The data are about one variable only (case 1) or several variables are collected (case 2):

  1. If the data is about one variable, you can perfectly compute all the moments and all indicators you want. Moreover you know/see, by plotting, the exact distribution. Note that, if the variable is continuous, finite data hardly fits perfectly any parametric distribution. Ideally, if the data is infinite, all incorrect distributions are definitely rejectable by some test and only the correct one is not rejected (the test can remain useful only because it possible to lose something by plotting). In this case, parameters also an be perfectly computed. Hypothesis testing about reliability of some statistical quantity (its proper meaning) becomes senseless.

  2. If several variables are collected, the above considerations above hold, but another must be added. In a purely descriptive situation, like case 1, it is relevant to note that multivariate concepts like correlations and any other dependencies metrics become perfectly known.

    However I don’t love description in the multivariate case because in my experience any multivariate measure, above all the regression, leads to think about some kind of effect that has more to do with causation and/or prediction than description (see: Regression: Causation vs Prediction vs Description). If you want to use the data to answer causal questions, the fact that you know the entire population (exact joint distribution) does not warrant anything. Causal effects that you can try to measure with your data by regression or other metrics, can be completely wrong. The standard deviation of these effects is $0$, but a bias can remain.

    If your goal is prediction, the question gets a bit more complicated. If the population is finite, nothing remains to predict. If the data is infinite, you cannot have all of it. In the purely theoretical point of view, let me remain in regression case, you can have an infinite amount of data that permit you compute (more than estimate) the parameters. So you can predict some new data. However, what data you have matters yet. It is possible to show that, if we have an infinite amount of data, the best prediction model coincides with the true model (data-generating process) like in the causal question (see the reference in the previous link). Then your prediction model can be far from the best one. Like before, the standard deviation is $0$, but a bias can remain.

Wrzlprmft
  • 2,225
  • 1
  • 18
  • 35
markowitz
  • 3,964
  • 1
  • 13
  • 28
  • You talk quite a lot about the case of infinite data. How is this ever relevant? Obviously you cannot collect an infinite amount of samples. – Wrzlprmft Aug 27 '20 at 16:53
  • I focused primarily about finite case, then about infinite case also; several useful links exist. I stayed focused on theoretical ground, even if several practice suggestions can be found. “How is this ever relevant? Obviously you cannot collect an infinite amount of samples” This sentences sound like, “the case of samples of infinite dimension is not relevant”. I disagree. Obviously we never can collect an infinite amount of data in practice. However this fact rule out any utility of reasoning about the case of samples of infinite dimension? – markowitz Aug 28 '20 at 09:40
  • No. Infact the entire asymptotic theory deal with them. Asymptotic results give us the possibility to understand what happened in large samples. To say how large these samples should be in practice, in order to clarify the reliability of any particular result, is another question. More in general yet, the concept of “infinity” even if only theoretical is largely used and very useful in science. – markowitz Aug 28 '20 at 09:41
  • I don’t dispute the usefulness of the concept of infinity in general (in fact I wrote [an answer about this](https://math.stackexchange.com/a/1888971/65502)). My issue is rather that your answer does not make clear how your insights for the infinite case translate to the reality of very large sample sizes. Moreover, some of your insights do not appear to survive this translation. – Wrzlprmft Aug 28 '20 at 13:39
  • My answer is about theoretical points, analized informally. In few words i said that at population level precision problem rule out, however not any statistical problem so. In particular in multivariate case correlation and others dependency measure are perfectly precise, however frequently we are interested in something else. – markowitz Aug 28 '20 at 15:04
  • Said that. Links between theory and reality are almost always disputable. Precisely, what of i wrote seems you problematic? – markowitz Aug 28 '20 at 15:06
  • Well, to begin with that it’s not very clear what the link is, in particular to somebody who is not knee-deep into the topic like (presumably) the asker,. – Wrzlprmft Aug 28 '20 at 15:15
  • You addressed mainly the definitional problem about population and census, and reliability of that concepts. If asker accepted your answer means that it was satisfied. Good for you. I stayed focused on the title primarily and I have considered census and population as synonym. Then i hope that my reply can be useful for asker but also for any person that read the title/question. Said that, your question about link became too general in my view, even more here in the comments. Focused reply depends on the scope of the aker. – markowitz Aug 29 '20 at 07:14
  • Said that. About the question above i suppose that data are ginen and finite and multivariate. So moments, corrections, ecc can be computed precisely. No hypnosis test needs. Questions about predictions disappear. Questions about causality remains debatable. – markowitz Aug 29 '20 at 07:18
  • If the sample is very large but not at population level, under usual assumptions, precision problems tend to disappear. Predictions and causal questio remain debatable. I say only these things at the asker. – markowitz Aug 29 '20 at 07:21