5

Wikipedia defines a sample as:

a subset of a population.

While exploring the reason why we divide by $(n-1)$ instead of $n$ when calculating standard deviation (discussed in this question), I came across this PDF demonstrating why $(n-1)$ is better.

When listing all possible samples of $n=2$ from a population of three cards numbered 0, 2, and 4, it includes the samples (0,0), (2,2), and (4,4). I am having trouble reconciling this with the definition of a sample that I thought I knew (and that is given by Wikipedia).

A sample of 2 playing cards from a population of 52 would not include the Three of Hearts twice, would it? Likewise, I'd guess a survey of a sample of voters would not include the same voter multiple times.

Other sources back the method described in the PDF. What am I misunderstanding here?

Corey
  • 51
  • 1
  • 7
    Have you read the introduction section of the [wikipedia page](http://en.wikipedia.org/wiki/Simple_random_sample) on simple random sample? There the difference between sampling _with replacement_ and sampling _without replacement_ is mentioned. – Macro Feb 21 '13 at 13:49
  • 2
    @Macro Good point. But the OP is correct in intimating that Wikipedia is wrong (or, at best, inconsistent) in defining a sample as a "subset." At most it could be said a sample is a "multisubset." Even then I think such purely mathematical terminology fails to capture how statistics thinks of the word "sample," because our usage makes it clear that a sample is a result of some *procedure* used to obtain data (or to model their relationship to reality). Ignoring details of that procedure is perhaps the most common and most critical mistake made by people with data to analyze. – whuber Feb 21 '13 at 15:29

2 Answers2

2

Considering your thoughtful question and the comments stream I think the answer is:

The Wikipedia article is (or rather, "was") incorrect. A correct definition would be:

A sample is a set of observations drawn from a population by a defined procedure. It may be drawn without replacement, in which case it is a subset of the population; or with replacement, in which case it is a multisubset.

Peter Ellis
  • 16,522
  • 1
  • 44
  • 82
0

The problem is the confusion of "plain" English with specialist jargon. All academic disciplines and other groupings of people do it, e.g. military, individual companies, govt departments, sports, etc. Within a discipline it is perfectly reasonable to use a term with rather more specialist overtones than in everyday language, provided that one remembers that when communicating with the general public a qualifier is usually advisable, e.g. statistical significance. Even within a discipline, as you have just found and others have already pointed out, there can be cases where a qualifier is appropriate, because the mathematical abstraction of the physical nature of a sample lends itself to potential constructs that would otherwise be physically impossible.

Robert Jones
  • 598
  • 1
  • 3
  • 6