3

I have a follow up question to the answers given in the US Election results 2016: What went wrong with prediction models? question. I suspect there is some violation of an independence assumption among the samples taken in election polling.

Specifically, I think that poll results affect the sample population. I think that people have gotten a sophistication of sorts with polls and deliberately try to skew their results by working in concert with others through social media. For instance, if you are a part of a demographic that is polled, it is likely that you or someone you know is also being polled. It would not be a stretch to think that this demographic is loosely networked through social media and therefore able to have an effect on one another.

My question is this: What sort of statistical test can one employ to test for the independence of the samples in the sample population?

grldsndrs
  • 454
  • 3
  • 11
  • 1
    Obviously pools affect voters. But I do not think that there is any point in discussing if there is a conspiracy to bias the polls... Moreover the chance to get sampled to take part in a pool is so small that such conspiracy wouldn't work. – Tim Nov 10 '16 at 18:02
  • Also for this to be on-topic you'd need to define precisely what kind of independence you are talking about. Samples to pools are drawn randomly so obviously it's not about dependence in sampling. – Tim Nov 10 '16 at 18:10
  • "Independent and identically distributed" implies an element in the sequence is independent of the random variables that came before it. In this way, an IID sequence is different from a Markov sequence, where the probability distribution for the nth random variable is a function of the previous random variable in the sequence (for a first order Markov sequence). BTW I am not interested in discussing any conspiracy theories. The question simply asks how to test for the validity of an assumption of independence. – grldsndrs Nov 10 '16 at 19:03
  • *Perfect* independence rather does not hold for most of the real-life problems. Also, there are time-trends in political preferences, so obviously there is some kind of auto-correlation and samples drawn in similar time period *would* be similar. Surely you can test for autocorrelation, but even without looking closely at the data I can tell you that it will be autocorrelated. – Tim Nov 10 '16 at 21:14
  • So is autocorrelation the only test for independence? What other methods might one employ? – grldsndrs Nov 10 '16 at 21:29
  • No not only, but *what kind* of dependence will you expect? – Tim Nov 10 '16 at 21:50
  • I would speculate that there is dependence based in social media clusters. For instance, if I had a network map of social media connections, I would expect that the shorter the distance between two nodes ( representing two samples ) the greater the dependence. – grldsndrs Nov 10 '16 at 23:08
  • So this would be impossible to check because you would need for that raw and de-anonymized data of answers people gave and all their accounts on social media with all their connections and possibly the complete connections graphs on all the portals and inter-connections between them. I guess U.S. NSA and British GCHQ have part of this data, but not ordinary people. Moreover such conspiracy won't work since not all age groups use social media actively so you would argue that the young people alone influenced the polls what afaik is not true. – Tim Nov 11 '16 at 08:33
  • Impossible is subjective/relative. But, I don't want to discuss who or what has access to data. The theory doesn't depend on the social media young people frequent. Good old fashion talking has also been enhanced in this digital age, as has been, the communications through public broadcasts. Keep in mind that people don't need to 'agree to think alike'; circumstances create the space for people to form similar ideas and act in a sort of concert. A school of fish dodging a predator or a flock of starlings moving synchronously are analogies for the behavior I would quantify statistically. – grldsndrs Nov 11 '16 at 09:02
  • The problem isn't non-independence within a poll, but that polling errors are correlated. This is known and some prediction models (538) took that into account, but many others didn't. Non-independence generally leads to confidence intervals that are too narrow. 538 gave Clinton a 65-70% chance of winning (which was probably approximately correct), but other models that didn't take this into account correctly gave Clinton a >99% chance. – gung - Reinstate Monica Nov 11 '16 at 15:54
  • @gung, Thanks that is insightful and edifying, but I am looking for ways to for test independence. The election provided a framework to ask the question. Do you have any suggestions for how to test for independence? Tim has suggested autocorrelation as a means for testing. What say you? – grldsndrs Nov 11 '16 at 19:22
  • You have to specify the type of independence you want to test for, as @Tim said. You seem to be focused on the idea of social networks, which doesn't seem promising to me & for which you won't have the relevant data, or on autocorrelations over time w/i a single poll, which I also suspect is largely irrelevant. The primary issue is that polling errors are correlated. This is well known in the right places. It has already been established so it isn't clear what there is for you to do. – gung - Reinstate Monica Nov 11 '16 at 20:08
  • Ok so it seems I am not clear on types of independence. What are some keywords I could search for, besides independence? Just doing a quick search I found Chi square testing, can you add to this reading list? BTW, I a not sure how I gave the impression that I am trying to solve a specific problem, because I am not. I am trying to understand what independence testing is all about. The journey is important here, not the end result. – grldsndrs Nov 11 '16 at 20:25
  • 1
    Independence means that they aren't related. Non-independence means they are related, somehow. You need to specify *how*. Eg, the idea that responses are autocorrelated over time means that the people responding on day1 are more likely to say Clinton, & those on day3 are more likely to say Trump, although when averaged over you get 50-50. The idea that the errors are correlated is that if 1 poll is +2 points off towards Clinton, other polls will also tend to be +2 off towards Clinton. There are potentially infinite ways they can be non-independent. – gung - Reinstate Monica Nov 11 '16 at 21:30

0 Answers0