2

Where by "why" I do not mean "list of use cases for randomness." If one has a quantitative question Q about topic X, it does not seem intuitive that values that by definition have absolutely nothing whatsoever to do with topic X would be so crucial in answering Q. And yet, a source of randomness is absolutely essential to most statistical operations. Probably the most common such example is the need to obtain random samples for inferential methods, however the need for randomness is so ubiquitous it seems almost fantastical. So, by "why" I am perhaps looking for something information-theoretical, or even possibly philosophical?

To create a context around Q and X, suppose a statistics student has an assignment to estimate the average height of students at the local college. They come to you and ask about where they should start. Naturally, one place to start involves collecting a random sample of students, so you say: "well, first, you'll need to obtain a Geiger counter." (with the intent that the student use it for generating random numbers). The student adopts a confused expression.

@Tim has suggested that requiring a Geiger counter for this task "at face value is rather ridiculous", which is precisely my point. Other options for obtaining randomness to help with the experiment include lottery balls, atmospheric noise, or repeatedly squaring your zip code: none of which have anything whatsoever to do with measuring heights. In fact, absent a priori knowledge about the population distribution of heights, involving students' heights in the method for obtaining your sample is probably a bad plan. An important part of estimating the average height of the students is finding an activity that is totally unrelated to the heights of students in any way, and obtaining measurements of that thing.

Is there some intuitive explanation to illustrate why a Geiger counter would be so immensely useful for someone interested in investigating the average height of students at the local college?

Ariel
  • 2,273
  • 2
  • 23
Him
  • 2,027
  • 10
  • 25
  • "it does not seem intuitive that values that by definition have absolutely nothing whatsoever to do with topic X would be so crucial in answering Q" -- who exactly claimed that it does?! Obviously, something that has nothing to do with X, has nothing to do with X. – Tim May 18 '21 at 15:51
  • @Tim every statistics book ever written? Job #1 when trying to estimate a quantity is to take a random sample. Taking a random sample requires random numbers. Thus, the random numbers are absolutely essential to the problem. – Him May 18 '21 at 15:54
  • @Tim I am happy to attempt to clarify the question, but it is not clear exactly what your problem with the question is. Can you be more specific as to what you aren't understanding here? – Him May 18 '21 at 15:55
  • 1
    Are you referring to any particular quote in general? I never seen statistics handbook suggesting using Geiger counter to measure height. – Tim May 18 '21 at 16:01
  • "To understand the nature of a simple random sample... put the 10,000 tickets in a drum, mix them thoroughly, and then one by one draw 5 tickets out" -- Statistics for Engineers and Scientists, Navidi W. (this is just a text I happen to have on hand). There is an analogy here to a lottery, but the 5 tickets out of 10,000 are intended to apply to generating a simple random sample for the purposes of answering arbitrary questions. Why are lottery drawings useful for measuring student heights? My example is only bc radioactive stuff are considered extremely reliable sources of randomness. – Him May 18 '21 at 16:12
  • 1
    The quote does not say anything about using lottery drawings to estimate height. It only describes process of randomly drawing 5 tickets. Analogy for measuring students height: randomly pick 5 students from the class and measure their height—what you find controversial about that? – Tim May 18 '21 at 16:16
  • "pick 5 students from the class" -- how would you suggest to pick them? I am only suggesting that the standard for choosing those students is to do so randomly. As a general rule, if you do not choose them randomly, the results would be considered suspect. Thus, the random numbers are a crucial ingredient in the process. – Him May 18 '21 at 16:18
  • 4
    I have answered some questions of this nature at https://stats.stackexchange.com/a/54894/919. – whuber May 18 '21 at 16:19
  • In order to choose them randomly, you need a source of randomness. Usual sources of randomness have nothing to do with the heights of students. So, the first step in estimating the heights of students is to obtain a Geiger counter. – Him May 18 '21 at 16:20
  • @whuber Looking through the comment thread on your answer, I am thinking that your implied answer to my question here is that random numbers are mathematically convenient for the sorts of analyses that we intend to do, and that that is why they are useful. I think that I intended this to elicit something more tangible, but perhaps this shows that my question is just a re-hash of [unreasonable effectiveness](https://www.maths.ed.ac.uk/~v1ranick/papers/wigner.pdf) musings. I will attempt to contemplate an explanation for my hypothetical student in this light. – Him May 18 '21 at 16:38
  • @whuber in any case, as a side note, [comments are not a safe place to put stuff](https://meta.stackexchange.com/questions/130975/are-comments-ephemeral-and-what-should-be-done-with-informative-comments). It seems that your comments on the answer you linked to are actually useful/informative, so it may be worthwhile incorporating them into your answer there, or here, or otherwise preserving them for posterity. I'm sure you have a full crossvalidated plate already, though. :) – Him May 18 '21 at 16:41
  • @Tim sooo.... is my question more clear due to this comment thread? Will incorporating some of this discussion into the body of the question clarify this question to your satisfaction oh mightly closer of questions? :) – Him May 18 '21 at 16:52
  • 2
    At least for me it isn’t. You seem to be misunderstanding what is meant by random variables in probability theory & statistics, but if that’s the case, it seems to be answered in the thread linked by @whuber. – Tim May 18 '21 at 17:20
  • @Tim "seem to be misunderstanding what is meant by a random variable" this could be, and perhaps the answer to why a Geiger counter is useful in height estimation is obvious to you, but I thought that answering questions was the point of Stack Exchange? Or are you stating that meta-statistical questions that are not logical statements within the mathematical calculus of probability theory are out of scope for CrossValidated? – Him May 18 '21 at 18:01
  • @Tim I would like to make my question clear, but I've not been given much to go on besides an insinuation that I'm not knowledgable enough about the subject to even formulate a meaningful question.... – Him May 18 '21 at 18:02
  • @Tim, or, are you denying that a Geiger counter would be a useful tool for someone interested in estimating the average heights of students? – Him May 18 '21 at 18:13
  • You need to improve your question to be more clear. Currently, you are using an example of using Geiger counter for measuring height, which taken at face value is rather ridiculous. The first part, as I noticed in the first comment, is self contradictory so hard to comment on it. – Tim May 18 '21 at 18:58
  • @Tim "at face value is rather ridiculous" At last, you've understood the question perfectly! Other options include lottery balls (as we've already discussed) [atmospheric noise](https://www.random.org/history/), or [repeatedly squaring your zip code](https://en.wikipedia.org/wiki/Middle-square_method). None of which have anything whatsoever to do with measuring heights. In fact, absent a priori knowledge about the population distribution of heights, involving students' heights in the method for obtaining your sample is probably a bad plan. – Him May 18 '21 at 19:21
  • 2
    Ok, but you do not connect the ridiculous claim anyhow with statistics. Nobody uses Geiger counter to measure human height in statistical research. I will not be continuing this discussion because it is already overly long. – Tim May 18 '21 at 19:42
  • @Tim ah. I suppose I had believed it was clear from the link that the purpose of the Geiger counter was to obtain random numbers, but perhaps you are right that leaning on the link for clarity was incorrect on my part. I have added a clarifying blurb near the offending sentence. Please consider reopening my question if this has helped you understand. – Him May 18 '21 at 19:48
  • 2
    You're not just using something random (Geiger counter) to infer something about student height; you're measuring heights and using those measurements in a calculation. – Dave May 18 '21 at 20:08
  • 1
    @Dave sure, but I never meant to insinuate that gathering randomness was the only task at hand. My point is that the randomness is 1) an incredibly important part (if not the only one) and 2) seemingly irrelevant. Why on earth would we *at any stage of the experiment* need to measure the time between release of alpha particles of some radioactive lump when our goal is to measure student heights? It seems bizarrely unrelated. Admittedly, the radioactive lump isn't the only source of randomness, but it's not special: they *all* seem bizarrely unrelated. – Him May 18 '21 at 20:12
  • I'm beginning to think that all of this confusion just illustrates how truly bizarre it really must seem to everyone. :) – Him May 18 '21 at 20:26
  • After the recent edits, I am minded to reopen this question. But I think the title might need a further edit to make the intent of the question more specific. Does "Why are *sources of randomness* so useful?" get to the heart of the matter more clearly? – Silverfish May 18 '21 at 20:57
  • 1
    Also it isn't clear which statistical methods you want covered. You list a lot that do indeed require a source of (at least pseudo-)random numbers: bootstrapping, random forest, Monte Carlo etc though your main focus seems to be on random sampling. Arguably if you want all of these covered, the question is too broad. Would restricting to the need for random sampling alone be sufficient for your needs? I do feel like you're trying to reach for a broader philosophical point that you suspect underlies the ubiquity in other methods, so you may find this unsatisfactory. – Silverfish May 18 '21 at 21:01
  • @Silverfish This is fine with me if you think it is better. In my view, the sources of randomness are only useful because the randomness is useful. Also, I am somewhat worried that "why are sources of randomness so useful" will elicit answers along the lines of "because we need randomness in order to [list of use cases of randomness]" – Him May 18 '21 at 21:01
  • @Silverfish I think that I don't want any of them covered. :D I understand *how* random numbers are used in sampling, and the bootstrap, and ensembles and quicksort etc. What I don't get is why they are useful *at all*. Hence my statement that this is perhaps a philosophical question... Although, if you think that this could be answered by focusing on a single topic, then any arbitrary one will do. – Him May 18 '21 at 21:03
  • 1
    From my experience studying statistics and probability, those in statistics tend not to care so much about *randomness* or *pseudo-randomness* that is inherent in the conceptual machinery they use. These tend to be brushed under the carpet with statements like $X_i$ are $iid$ random variables from some distribution $P$. However, those designing the computational tools to implement this conceptual machinery, as well as those in the simulation and MCMC literature do tend to scrutinise this more e.g. in pseudo random number generators, methods of sampling from non-uniform distributions etc. – microhaus May 18 '21 at 21:05
  • *Also, I am somewhat worried that "why are sources of randomness so useful" will elicit answers along the lines*... hm, I can understand that, but if your Q that follows make it clear why that's not what you're looking for (and I think it does though you may wish to clarify!), I think you'll find any answers would respect that. Re narrowing it down: I understand your point is essentially philosophical so you want something underlying, rather than specific. I actually think this is a reasonable, indeed good, question. I'm more trying to anticipate possible objections from other reopen reviewers – Silverfish May 18 '21 at 21:05
  • 2
    Would a statement such as "uniform randomness maximizes entropy" satisfy your question? – Kuku May 18 '21 at 21:06
  • 1
    Perhaps you might consider looking at Devroye, L. (1986) *Non uniform random variate generation.* Springer. In particular, Chapter XII on implementing random sampling algorithms. Lastly, if the community of those voting on closing/re-opening are unable to understand your question or its relevance as stated, and determine not to re-open, you might stand more chance of receiving an open-minded engagement with your question on the role of randomness in algorithms in a cryptography forum, rather than being told your claims are "ridiculous." – microhaus May 18 '21 at 21:19
  • 4
    I'm probably repeating everything already said, but we need randomness in order to not "leak" any external information to the collected sample. We can imagine the collection process as orthogonal (independent) to the data generating process, therefore it actually feels natural that various unrelated stuff could be used, as you mention, for the draws (as long as we agree that the outcomes are unpredictable & uniform). Cosmic noise works, ticket draws works, etc., although in practice we use a combination of pseudo-random numbers with [cpu-collected-randomness](https://lwn.net/Articles/584005/). – runr May 18 '21 at 21:21
  • @Kuku, maybe? I would certainly be interested in a Bayesian take on the matter. Since the usefulness of the randomness stems from its lack of relationship to anything, then perhaps the maximum entropy principle comes into play somehow. I think it's possible that my question might be more related to the more low-level: Why is maximum entropy a thing that we want?... this probably has answers elsewhere. If you think that these questions provide a satisfactory answer you may want to dupe this. – Him May 18 '21 at 21:33
  • @cpu-collected-randomness I would upvote an answer that discussed something like this. – Him May 18 '21 at 21:35
  • @microhaus in at:Tim's defense, my question has expanded somewhat since he closed it (due largely to my discussion with him, in fact), so it is perhaps unfair to judge based on the current state of my post. – Him May 18 '21 at 21:42
  • Ah okay, I wasn't aware the question had undergone substantial editing. Please accept my apologies @Tim for quoting without that context in mind. – microhaus May 18 '21 at 21:47
  • 4
    @Him If you want to measure the average height of students at the university, there is a quite simple way to do this without a Geiger counter (or without any random generator). Simply measure the height of *every* student at the school and take the average. The issue, of course, is that this is hard to do in most applications. You can try to take a sample without randomness, but many forms of bias can be introduced to your estimate. Simple random sampling simply allows us to use mathematical theorems (e.g., law of large numbers) which guarantee our results say what we want them to say. – knrumsey May 19 '21 at 01:55
  • @knrumsey I would upvote an answer that discussed something like this. – Him May 19 '21 at 05:55
  • @Him, I have expanded my comment into an answer. – knrumsey May 20 '21 at 15:12

2 Answers2

4

If you want to know the average height of students at a university, there is a simple way to do this without a Geiger counter or any other random generator. The solution is quite simple: record the height of every student at the university and compute the average height.

Of course, this is hard to do in practice because the population size is too large. The natural thing to do is to take a sample of your population and use the average of this sub-sample as a proxy. You can try to take a sample without using randomness, but your resulting answer is susceptible to several forms of bias. Here are two simple examples where you can obtain a biased answer.

  1. Suppose you stand at the front door of the gymnasium and measure the height of students as they walk in. This is known as a convenience sample. Are you confident that the height of students that frequent the gymnasium are representative of the student body? What if the Basketball team is arriving for practice today?

  2. Suppose you send an email to the entire student body and request that they fill out an optional survey recording their height. Not every student is going to respond: are you confident that those who do are representative of the population? It may be the case that taller men are more likely to respond. This is an example of voluntary response bias (albeit, not a very good one). In this scenario, you also have to deal with the fact that respondents may not be truthful, with a few subjects exaggerating their height.

By taking a simple random sample, we are able to avoid these (and many other) forms of bias. The Law of Large Numbers essentially guarantees that you can get as close to the true answer as you would like, by taking a large enough sample. But this result (at least the basic version) only holds under simple random sampling.

knrumsey
  • 5,943
  • 17
  • 40
1

Okay, so I think the question you are asking relates a little bit to the causal inference literature and to RCTs (randomized control trials). I am going to change your example a little bit to a classic question studied by econometricians. What is the effect of college on wages. Now we might be tempted to run the regression,

$$wage_i = \beta\times college_i + \epsilon_i$$

And take $\beta$ to be our effect but this would be wrong. One reason is that there may be selection bias. In particular, it is not hard to imagine that choose to enter college do so because they believe it will increase their wage. That might be a problem because those students may have some intrinsic quality that makes them achieve higher wages. This is related to an endogeneity problem in which we may believe students who enter college may have higher ability and so they will also earn higher wages. Essentially, these issues confound our ability to infer what the effect of college on wages are. So we turn to randomness.

The idea behind randomness is that if we are able to randomly assign and force individuals to go to college (the treatment group) or not go to college (the control group) and then compare their wages we should then be able to extract an effect of college on wages. Why? Well essentially randomizing lets us average out the effect of ability because we imagine that both the treatment and control groups have the same distribution of people with ability since they were randomly chosen and no selection into treatment (in this case a college education occurred).

Formally we want to know the counterfactual effect of what would have happened if person $i'$ who went to college had not gone to college:

$$Y_{1i'}-Y_{0i'}$$

Where the 1 indicates going to college and the 0 indicates not going to college. This notation is called potential outcomes notation and I highly encourage you to check out some of the references at the bottom that go into more detail about it.

It turns out that under random assignment we can actually identify the distribution of $Y_d$ where I use $d$ to refer to the potential outcome (college or no college here). To see this consider,

$$\Pr[Y_d\leq y] = \Pr[Y_d\leq y|D=d]=\Pr[Y\leq y|D=d]$$

Where the first equality holds due to random assignment. Since the assignment is random conditioning on $D=d$ does not change $Y_d$. Thus we have identified the distribution of the potential outcomes as the conditional distribution of $Y|D$ for which we have data.

This is essentially the idea behind RCTs and this idea has been expanded on in many different cases. I encourage you to look up terms like: "Average Treatment Effect" and "Selection on Observables"

As you might have guessed we usually cannot force someone to go to college or not go to college. In these cases, another popular technique that can be used to "induce randomness" are instrumental variables. In particular, the "Local Average Treatment Effect" is a very popular way to apply the same idea as above in observational studies where we cannot randomly assign individuals.

I will leave you with some references:

Notes on Treatment effects: https://economics.mit.edu/files/32

Textbook: https://www.amazon.com/Mostly-Harmless-Econometrics-Empiricists-Companion/dp/0691120358

Paper on LATE: https://www.jstor.org/stable/2951620

Ariel
  • 2,273
  • 2
  • 23