Confidence interval for raw number of correct binary predictions

Question

Let's say there are 600 games in an NBA season, and a competition among individuals to correctly forecast the winners of the games. The forecasts are not expressed probabilistically but simply in terms of raw outcomes (e.g. "Bulls will beat Spurs in Round 1").

Person A forecasts the winner in 450 games (75% correct), while Person B forecasts the winner in 400 (66% correct).

I'm interested in placing some kind of confidence interval around that 450, with a view to quantifying the distinction between Person A and Person B and establishing whether we have good grounds for saying Person A is more skilled than Person B (and Person C, and others not mentioned), or merely more lucky. I'm ultimately interested in putting all the people on a chart with some sort of confidence interval (or maybe credible interval) around each person.

I'm unsure how to proceed in relation to this. However, I will briefly explain an answer that occurred to me, after I found the search term binomial proportion confidence interval. As I understand from that Wikipedia page and from another answer, the normal approximation CI for a proportion is

$\hat{p} \pm z \times \sqrt \frac{\hat{p}(1-\hat{p})}{n}$

So, I thought the right way forward might be to go

$0.75 \pm 1.96 \times \sqrt \frac{0.75(1-0.75)}{600}$

$= 0.75 \pm 1.96 \times \sqrt \frac{0.1875}{600}$

$= 0.75 \pm 1.96 \times \sqrt {0.0003125}$

$= 0.75 \pm 1.96 \times 0.0177$

$= 0.75 \pm 0.034692$

Thus meaning we can place a confidence interval (0.715308, 0.784692) around Person A's proportion correct, which was 0.75. This confidence interval excludes Person B's proportion correct, which was 0.66.

Then we could scale things up to the sum correct by multiplying by 600, thus getting a confidence interval (429.1848,470.8152) around Person A's raw score of 450.

Is this a sensible way to proceed? Should I be approaching the issue using bootstrapping, or using some other method?

There's a further potential complication which is that Person A and Person B (and others) have been playing this game for many years, and thus I've accumulated yearly totals for them that span across multiple years. I'm unsure if I should be considering some sort of 'rolling' confidence interval over the years, or treating each year as separate.

I recommend looking into bootstrapping to estimate the confidence intervals. — Jeffrey Girard, Oct 02 '18 at 12:12
Thanks! In relation to that, would the original data set be the end-of-season scores, e.g. 450 for Person A, 400 for Person B, 390 for Person C, and so on? Would it be a problem that there were only a small number of persons (say 8-10 persons) involved? — user1205901 - Reinstate Monica, Oct 02 '18 at 12:19

Jeffrey Girard · Accepted Answer · 2018-10-02T17:16:19.263

I'm not sure why you prefer counts over proportions, but in either case you can estimate confidence intervals using bootstrapping. Let's say you have two people who each make 1000 independent predictions. Person A has a 75% chance of being correct overall and Person B has a 66% chance of being correct overall (these are the "true" proportions and may differ in practice due to sampling error). In R, we can simulate such data and estimate bias-corrected and accelerated bootstrap (bca) confidence intervals for the counts as below:

library(boot)
set.seed(12345)
personA <- rbinom(n = 1000, size = 1, prob = 0.75) #replace with real data
personB <- rbinom(n = 1000, size = 1, prob = 0.66) #replace with real data

boot_fun <- function(data, idx) {
   resample <- data[idx]
   sum(resample)
 }

bootA <- boot(data = personA, R = 2000, statistic = boot_fun)
boot.ci(bootA, type = "bca")
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 2000 bootstrap replicates
#> 
#> CALL : 
#> boot.ci(boot.out = bootA, type = "bca")
#> 
#> Intervals : 
#> Level       BCa          
#> 95%   (716, 771 )  
#> Calculations and Intervals on Original Scale

bootB <- boot(data = personB, R = 2000, statistic = boot_fun)
boot.ci(bootB, type = "bca")
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 2000 bootstrap replicates
#> 
#> CALL : 
#> boot.ci(boot.out = bootB, type = "bca")
#> 
#> Intervals : 
#> Level       BCa          
#> 95%   (608, 665 )  
#> Calculations and Intervals on Original Scale

Also note that comparing two confidence intervals is not as easy as seeing if the mean of one is included in the confidence interval of the other. The two confidence intervals can actually overlap and still be significantly different at $\alpha=.05$. See the following article:

Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals and how to read pictures of data. The American Psychologist, 60(2), 170–180.

I can see that the vector for personA in your answer will be something like (1, 0, 0, ...). How should my approach be if I only know personA got 450/600 and do not know which particular games they predicted correctly, and which they predicted incorrectly? In some cases I will have the raw data, and in others only the total. — user1205901 - Reinstate Monica, Oct 03 '18 at 03:55
Assuming that all predictions are independent, you can just take the 450/600 and generate the raw data as 450 ones and 150 zeroes. ‘c(rep(1, 450), rep(0, 150))’ — Jeffrey Girard, Oct 03 '18 at 22:49

Confidence interval for raw number of correct binary predictions

1 Answers1