8

Tutorials demonstrating simple random sampling, first list the full number of population members (i.e., sampling frame) in a column and then assign a random number from $0$ to $1$ from a uniform distribution to each.

Question: BUT WHY (as shown in this tutorials), before getting a sample (e.g., of $n = 40$) from our sampling frame we must "sort" the sampling frame based on the random numbers for each population member? (What role does sorting etc. play?)

Simon Harmel
  • 317
  • 1
  • 7
  • The sampling method is to take the $n$ values with the lowest random numbers. You do not have to use the particular sorting method shown: for example you could use something based on `RANK()` or something based on ` LARGE()` or `SMALL()` – Henry Aug 09 '20 at 20:32
  • The video does shuffle the IDs. Other form of shuffling are possible – Henry Aug 09 '20 at 21:01
  • @Henry, I appreciate it if you could specifically clarify: *why we **sort/order/rank etc.** the sampling frame before being able to pick a random sample?* What role does ***sort/order/rank etc.*** play? – Simon Harmel Aug 09 '20 at 21:12
  • Excel does not have a sample function. But it does have a random number generator. So you can allocate a random number to each value you want to sample from, and then choose those with the smallest (or largest) random numbers as your sample – Henry Aug 09 '20 at 22:18

2 Answers2

12

Sorting a list of objects based on an accompanying set of IID continuous random variables (such as uniform random variables) is equivalent to shuffling those objects into a random order (i.e., by a random permutation). Since the random values are independent continuous random variables, every possible permutation is equally likely, and that is the definition of simple random sampling. This method is used in computer programs that have facilities to create pseudo-random numbers, but do not have an existing sampling function.

Ben
  • 91,027
  • 3
  • 150
  • 376
  • 1
    This was what I was looking for thanks! – Simon Harmel Aug 09 '20 at 22:31
  • BTW, could we instead of generating our random number from a `uniform(0,1)`, generate our random number from `uniform(3,9)`? – Simon Harmel Aug 09 '20 at 22:41
  • 3
    You can use *any* continuous distribution --- the sorting properties are the same (see [this related question](https://stats.stackexchange.com/questions/350923/)). – Ben Aug 09 '20 at 23:06
  • (One slight complication to this is that computer PRNGs never generate perfectly continuous random variables, since there is a limit to the number of digits for the generated number. So technically there is a possibility of ties, but this hardly ever happens.) – Ben Aug 10 '20 at 22:34
  • Let me answer that question with a [link](https://stats.stackexchange.com/search?q=user%3A173082+%5Br%5D). ; ) – Ben Aug 10 '20 at 22:38
  • Oh Awesome! Well, I know you're a stat. math guy! Is [this](https://stats.stackexchange.com/questions/482256/obtaining-standard-error-of-an-estimate-knowing-its-likelihood-function-in-r) something you might know about? – Simon Harmel Aug 10 '20 at 22:43
2

It should be emphasized that you don't need to sort in order to sample. The method given in the tutorial works, but it is extremely inefficient. It basically does $\Theta(n \log n)$ operations for what can be done in $\Theta(1)$.

If you can sample a random floating point number from 0 to 1, you can sample a random integer from 1 to n. And Excel can give you the value in a list at a specific index. You can use that for sampling.

(Note, though, that Excels tends to re-roll all random values whenever you do anything, so you'll want to copy the value of the random index before proceeding.)

Meni Rosenfeld
  • 568
  • 2
  • 11
  • asymptotic efficiency is rarely an issue in spreadsheets. If you have enough data that you need to worry about the amount of time that sorting will take, then you probably shouldn't be using Excel to analyse them.... And if you have little enough data that you can fit it into excel, why are you sampling them.... making the whole question moot. – James K Aug 10 '20 at 17:05
  • @JamesK There are many reasons to sample data small enough to fit in excel. For example, when conducting field investigations that take several weeks each. Even a single-digit number of selections is huge! – indigochild Aug 10 '20 at 17:31
  • @JamesK Even if the actual time of sorting is not noticeable in practice - I find that using such an egregiously inefficient algorithm reflects a fundamental misunderstanding of what sampling is. Such confusion can lead to other errors. Highlighting how the action should be done sheds light on the underlying issues. – Meni Rosenfeld Aug 10 '20 at 17:31