18

I have always been confused about the use of the term “population” in statistics. In my first statistics course I was taught that we need a sample, because surveying the whole population is too costly. So there is the whole population and there is small sample from it which we study.

The problem is that this intuition is just wrong outside of a few toy examples, when population is literally the whole population of the US (or world). Actually even in those few examples it is probably wrong, since world population is just one of hypothetical repeated random samples from DGP. So when in the following statistics courses we started estimating multivariate models, I was struggling to understand what the population is now and how it differs from the sample.

So I am really confused by the way statistics is taught. I feel like people use the term “population” partly because of historical reasons, partly because it makes it easier to explain the concept of the sample in Stat 101. The problem is that it teaches wrong intuition, which students have to unlearn later and creates a hole in understanding of the most fundamental statistical concepts. On the other hand, the concept of DGP is harder to introduce in elementary statistics course, but after students understand it, they will have solid conceptual foundation in statistics.

I have two questions:

  1. I would guess that there is ongoing discussion among statisticians on this issue, so can anybody give me references on this?

  2. And more importantly, do you know any examples of introductory-level statistics textbooks, which forego “population” and introduce statistics, based on concepts of DGP and sample? Ideally, such textbook will devote large space to explaining conceptual foundations of statistics and statistical inference.

Ben
  • 91,027
  • 3
  • 150
  • 376
  • 2
    _Population,_ can be used in statistics in either a literal or an abstract sense. This is also true of many other words in the sciences. I think it is a horrible idea to invent new terminology (especially something as awkward-sounding 'data generating process' with yet another unnecessary acronym DGP) when there is no clear need. Seems a lot of that going around latedly. Maybe I will start referring to such as NRT (needlessly redundant terminology). – BruceET May 08 '21 at 19:10
  • @BruceET The problem is that population in abstract sense has nothing to do with population in literal sense. This problem is made worse by attempt of most introductory statistics textbooks to equate these two meanings when introducing idea of population and sample. – Moysey Abramowitz May 08 '21 at 19:33
  • @BruceET I feel like part of the problem is that textbooks are written by professors, who have been teachings stats for decades and think that it is obvious that when they say “population”, it means “DGP” or “statistical model”. For vast majority of students this is not obvious. Especially if you are teaching students outside of stat department (i.e., students for whom introductory statistics will probably be the only statistics class they will take), how can you say “population” and assume that they understand that you mean DGP? – Moysey Abramowitz May 08 '21 at 19:33
  • 2
    That some textbooks may have weaknesses is not in dispute. This issue is whether making up new terminologies will help cure that. Euclid's abstract lines of zero width have survived nicely for a few centuries--some badly written geometry textbooks notwithstanding. – BruceET May 08 '21 at 19:50
  • 5
    You have two statements that are extreme, and you provide no evidence or citations in support. These are (1) "The problem is that it teaches wrong intuition...and creates a hole in understanding of the most fundamental statistical concepts", and (2) "the concept of DGP is harder to introduce in elementary statistics course, but after students understand it, they will have solid conceptual foundation in statistics." Please provide some type of justification/evidence for these claims. – Gregg H May 09 '21 at 23:22
  • @ Gregg H The framework “Population-Sample” obscures the nature of statistics. Statistical inference is about using data to learn something useful about the model. In other words, the most fundamental concept in statistics is dichotomy between theoretical world (i.e., model/DGP) and real world (data/sample). Introductory statistics textbooks define population as a set of items of interest and sample as a subset of population. Obviously, this is not the way statisticians/empirical researchers think about statistical inference. Population/DGP is a model, not a bunch of observations. – Moysey Abramowitz May 10 '21 at 21:56
  • @ Gregg H This way of teaching statistics creates a hole in statistical education, since students fail to understand conceptual framework of statistics. They fail to understand difference between the data and the model. Difference between the sample and the DGP. Difference between statistics and probability. Students memorize bunch of equations without understanding what statistical inference is and how it is related to scientific method. – Moysey Abramowitz May 10 '21 at 21:57
  • @ Gregg H Regarding references, the only highly cited paper I found is Kass (2011) “Statistical Inference: the Big Picture”. Actually, my question is about asking for references. – Moysey Abramowitz May 10 '21 at 22:04
  • 2
    @ Moysey Abramowitz When I teach survey research or intro to research methods I want students to literally understand that if they conduct a survey using a random(ish) sample there is a possibility that the answer they get from the survey will be different from what they got if they surveyed the whole population, and that there are ways to quantify this possibility. I want them to understand why political polls have margins of error around them and why a psych experiment with N=20 might find one group did better due to dumb luck. Sample/Population seems to helpfully describe these situations. – Graham Wright May 10 '21 at 23:34
  • @ Graham Wright I agree, sample/population framework works for political polls. No wonder that this framework was dominant 60 years ago when political polls were major applications of statistics. These days probably more than 90% applications of statistics (Big Data, ML, AI and other buzzwords) are about estimating statistical models, usually for predictive purposes. Population/sample framework does not seem useful for 21-century statistics. – Moysey Abramowitz May 10 '21 at 23:56
  • @Moysey Abramowitz You seem to have a lot of strong views about how (frequentist?) statistics works and I'm curious why you seem so certain about these things. I'll freely admit that I have no freaking idea what the "true" nature of stats is, or how I would even argue that claim. And I have no idea how you would justify a claim that 90% of it is done one way or another. How do you measure the "amount" of stats that is being done by pollsters vs sociologists vs google? Does it even make sense to try and talk of one "best" way to do statistics when it is used for so many different things? – Graham Wright May 11 '21 at 12:07
  • @Moysey Abramowitz the goal of statistics is definitely not to use data to learn something about the model. First, model-free inference exists. Second, models are just models, they model stuff, we want to learn about stuff and models might or right not help us with that, but the primary goal is never to learn something about the model – rep_ho May 13 '21 at 09:23
  • The introductory stats course traditionally focuses on quantifying *sampling error*. This is definitely not all that statisticians deal with! But many of us think of it as a foundational prereq. to understanding other issues that could arise from general DGPs (measurement error etc.) So, the "population vs sample" setting is a simple concrete way to get across the key ideas involved in sampling error in an intro course. That's not to say you *couldn't* teach a DGP-first course, but it's my guess as to why we start with populations and only introduce other DGPs later when needed. – civilstat Jan 07 '22 at 01:37
  • You mention Kass (2011) “Statistical Inference: the Big Picture” above. That's a great paper -- but it sounds like you're suggesting we start with Kass' Figure 4 instead of Figure 3. As a teacher I find it's hard enough to get across Figure 3 already :) – civilstat Jan 07 '22 at 01:39
  • Finally, you say "world population is just one of hypothetical repeated random samples from DGP." On the contrary, for many national statistical agencies (like US Census Bureau), the current population of their country *is* the target of inference. They really do use samples to estimate how many people in geographic area B fall into demographic category B or are eligible for economic support program C, etc. Downstream data users might *also* treat it as a sample from a hypothetical super-population, but the original data collection and stat inference is deliberately about the finite pop. – civilstat Jan 07 '22 at 01:43

1 Answers1

11

There are certainly already many contexts where statisticians do refer to a process rather than a population when discussing statistical analysis (e.g., when discussing a time-series process, stochastic process, etc.). Formally, a stochastic process is a set of random variables with a common domain, indexed over some set of values. This includes time-series, sequences of random variables, etc. The concept is general enough to encompass most situations where we have a set of random variables that are of interest in a statistical problem, and so statistics already has a sufficiently well-developed language to refer to hypothesised stochastic "processes", and also refer to actual "populations" of things.

Whilst statisticians do refer to and model "processes", these are abstractions that are formed by considering infinite sequences (or continuums) of random variables, and so they involve hypothesising quantities that are not all observable. The term "data-generating process" is itself problematic (and not as helpful as the existing terminology of a "stochastic process"), and I see no reason that its wide deployment would add greater understanding to statistics. Specifically, by referring to the generation of "data" this terminology pre-empts the question of what quantities are actually observed or observable. (Imagine a situation in which you want to refer to a "DGP" but then stipulate that some aspect of that process is not directly observable. Is it still appropriate to call the values in that process "data" if they are not observable?) In any case, setting aside the terminology, I see deeper problems in your approach, which go back to base issues in philosophy and the formulation of research questions.


Existents vs processes in empirical research: I see a number of premises in your view that strike me as problematic, and appear to me to misunderstand the goal of most empirical research that uses statistics. When we undertake empirical research, we often want to know about relationships between things that exist in reality, not hypothesised "processes" that exist only in our models (i.e., as mathematical abstractions from reality). Indeed, in sampling problems it is usually the case that we merely wish to estimate some aspect of the distribution of some quantity pertaining to a finite population. In this context, when we refer to a "population" of interest, we are merely designating a set of things that are of interest to us in a particular research problem. Consequently, if we are presently interested in all the people currently living in the USA, we would call this group the "population" (or the "population of interest"). However, if we are interested only in the people currently living in Maine, then we would call this smaller group the "population". In each case, it does not matter whether the population can be considered as only part of a larger group --- if it is the group of interest in the present problem then we will designate it as the "population".

(I note that statistical texts often engage in a slight equivocation between the population of objects of interest, and the measurements of interest pertaining to those objects. For example, an analysis on the height of people might at various times refer to the set of people as "the population" but then refer to the corresponding set of height measurements as "the population". This is a shorthand that allows statisticians to get directly to describing a set of numbers of interest.)

Your philosophical approach here is at odds with this objective. You seem to be adopting a kind of Platonic view of the world, in which real-world entities are considered to be less real than some hypothesised "data-generating process" that (presumptively) generated the world. For example, in regard to the idea of referring to all the people on Earth as a "populuation", you claim that "...it is probably wrong, since world population is just one of hypothetical repeated random samples from DGP". This bears a substantial similarity to Plato's theory of forms, where Plato regarded observation of the world to be a mere imperfect observation of eternal Forms. In my view, a much better approach is the Aristolelian view that the things in reality exist, and we abstract from them to form our concepts. (This is a simplification of Aristotle, but you get the basic idea.)$^\dagger$

Plato and Aristotle

If you would like to get into literature on this issue, I think you will find that it goes deeper into the territory of philosophy (specifically metaphysics and epistemology), rather than the field of statistics. Essentially, your views here are about the broader issue of whether the things existing in reality are the proper objects of relevance to human knowledge, or whether (contrarily) they are merely an epiphenomenon of some broader hypothesised "process" that is the proper object of human inference. This is a philosophical question that has been a major part of the history of Western philosophy going back to Plato and Aristotle, so there is an enormous literature that could potentially shed light on this.

I hope that this answer will set you off on the interesting journey into the field of epistemology. For present purposes, you might wish to take a practical view that also considers the objectives that researchers set for themselves in their research. Ask yourself: would researchers generally prefer to know about properties of the people living on Earth, or would they prefer to try to find out about your (hypothesised) "hypothetical repeated random samples" of people who might have lived on Earth instead of us?


$^\dagger$ To avoid any possible confusion among those lacking historical knowledge, please note that these are not real quotes from Plato and Aristotle --- I have merely taken poetic license to liken their philosophical positions to the present issue.

Ben
  • 91,027
  • 3
  • 150
  • 376
  • 2
    Not sure I follow the argument in your 2nd paragraph. *Data-generating process* isn't intended to oust *stochastic process*, but just refers to the particular stochastic process used to model the data, the observations. So you might have one stochastic process that generates times of death, another that generates times of censoring; you put the two together to get a process that generates times of death/censoring for each patient: that's your data-generating process. – Scortchi - Reinstate Monica May 13 '21 at 10:25
  • That is fine, but even then, notice that you had to be careful not to refer to the intermediate processes as DGPs because they do not produce observable "data". That is my point in the second paragraph. In any case, the OP seems to consider the DGP much more broadly than this. – Ben May 25 '21 at 22:53