1

In Example 5 here, the authors discuss polling. Voting for candidate 'A' is given the numerical value of 1, and voting for 'B' has the value of 0. The CLT says that the average of a sample of $n$ people has the approximate distribution $$\overline{X} \approx N(p_0,\sigma/\sqrt{n}) $$ where $p_0$ is the true proportion of people who would vote 'A'. From this, and some general knowledge of the normal distribution they write:

This means that we can conservatively say that in 95% of polls of $n$ people the sample mean $\bar{X}$ is within $1/\sqrt n$ of the true mean. The frequentist statistician then takes the interval $\bar{X} ± 1/\sqrt n$ and calls it the 95% confidence interval for $p_0$.

I'm fine with this. However they write further

A word of caution: it is tempting and common, but wrong, to think that there is a 95% probability the true fraction $p_0$ is in the confidence interval. This is subtle, but the error is the same one as thinking you have a disease if a 95% accurate test comes back positive. It’s true that 95% of people taking the test get the correct result. It’s not necessarily true that 95% of positive tests are correct.

This remark has me puzzled. Earlier they said that in 95% of polls (of $n$ people) the confidence interval would contain $p_0$. Why doesn't this mean that $p_0$ would be in the interval with a probability of 95%?

Also, I can't see the analogy with the disease/test scenario (which sounds like the base rate fallacy). I'd appreciate some elaboration on that too.

Thank you!

user1337
  • 111
  • 4

1 Answers1

0

I agree that the wording they chose was poor.

What they are trying to get at, I think, is that there is more than one way to be wrong. Their point is that the test is not right 95% of then time: maybe 95% of the people who get a 'positive' for the test have the disease, but maybe 10% of the people who get a negative also have it. Then, on average, if 20 out of 100 people have the disease, and 100 people take the test, 27% will get a positive result, but only 19% have the disease.

This is a real problem. In the early days of HIV, the number of people infected was quite small (<5%) but the test had a high false-positive rate. That situation leads to an outcome where a huge part of your 'diagnoses' are false positives. With the numbers I gave above, use Bayes theorem and try to figure out, from this, what the chances are you have the disease if you get a positive result...it is not 95%.

Part of this is just what I will call 'bookeeping' on true positives, false positives, true negatives, and false negatives - which also must take into account the relative frequency of the issue. In olden times, this basic theory was developed by Pearson during WWII (see 'Detection Theory': https://en.wikipedia.org/wiki/Detection_theory )

The author is also trying to introduce the dichotomy between Bayesian and frequentist stats. Frequentist is what we are mostly taught and involves the standard test (t-tests, etc.). It relies on the data being all we look at. Bayesian stats, on the other hand, looks at the data we have in light of prior data or even thoughts.

There is lots written on the differences but given they raised polling data it is worth noting some historical (and recent) failures of polling due to this split. Have you ever wondered why they can declare a U.S. state resuls in the race for president by exit polls with only 2% of the vote in?

It is because we know, from prior years, exactly what happened in each county. If a Democrat has won the county by a 10-20% margin in the last 20 elections, a Bayesian pollster looks at a 2% lead by the Democrat as a very strong likelihood of his or her winning. But a 2% lead by the Democrat in a county that has always voted Republican is a weaker case - yet, in both cases, the lead is 2%. We are 'informed' by prior information, in Bayesian-speak.

The UK Brexit vote had this problem: no one had voted on this issue before. The frequentist polling got it wrong.

eSurfsnake
  • 1,004
  • 5
  • 13