How to calculate confidence intervals for Precision & Recall (from a signal detection matrix)?

Question

I built a detector to detect a binary outcome and then took a random sample from the population. From this, I can create a signal detection/confusion matrix (hit, miss, false alarm, correct rejection) [aka: TP, FP, FN, TN] and then calculate metrics such as Recall and Precision.

My question: How do you calculate confidence intervals for the Recall and Precision of the population from which I sampled?

I was thinking of this formula: p - z * sqrt(p*(1-p)/n) < p < p + z * sqrt(p*(1-p)/n)

where p = the statistic (e.g., Recall) and z = z-score for the desired confidence

score 2 · Answer 1 · answered Feb 13 '17 at 15:10

2

The following approach might be more accurate and more efficient.

Goutte, C., & Gaussier, E. (2005, March). A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In European Conference on Information Retrieval (pp. 345-359). Springer Berlin Heidelberg.

answered Feb 13 '17 at 15:10

Marsu_

21
2

4

in addition to providing a reference, please briefly explain why this approach would be better – Antoine Feb 13 '17 at 15:37

score 2 · Answer 2 · answered Jan 24 '18 at 23:41

I'll summarise the approaches which are sketched by the other two answers.

Suggest by @marsu. Assume that your confusion matrix $C$ has a multinomial distribution $M(n; π)$, then the distribution of the $TP$ is binomial. Assume a symmetric beta prior for precision $p$ and and recall $r$, that is $p,r ∼ Beta(λ, λ)$. Then given your data $D$ the posterior for $p$ is $p|D ∼ Beta(TP + \lambda, FP + \lambda)$ and $r$ is $r|D ∼ Beta(TP + \lambda, FN + \lambda)$. You can then use software to calculate the appropriate confidence interval as outlined here: Calculate the confidence interval for the mean of a beta distribution.
Suggest by @fred. Generate $D_n$ datasets by sampling with replacement from your underlying dataset $D$. For each $D_n$ fit your classifier, and calculate the confusion matrix $C_n$. For each $C_n$ calculate precision $p_n$ and $r_n$, the confidence interval for these quantities can be calculated directly from the bootstrap distribution.

An alternative to the bootstrap approach in option (2) is to use the jackknife. For jackknife estimates of variance and standard errors of a given statistic (e.g. mean), see e.g. Efron, B. and Tibshirani, R.J., 1994. "An Introduction to the Bootstrap", p 141. Note that the jackknife should not be used for non-smooth statistics where a small change to the data could cause a large change in the statistic like for example in the median (or any other quantile). — mloning, Oct 16 '18 at 10:06

score 1 · Answer 3 · edited Apr 13 '17 at 12:44

1

An answer here suggests using bootstrapped statistics; we've done this at my place of employment and it seems to do the right thing.

Confidence interval for precision and recall in classification

edited Apr 13 '17 at 12:44

Community

1

answered May 11 '16 at 00:38

Fred

11
1

2

Please include the details of the link in the answer to help avoid reads from having to click on the link. As it stands, this looks more like a comment. – Greenparker May 11 '16 at 00:48
1

The link that you posted doesn't have any answer now. – aerin May 22 '18 at 19:00

score 1 · Answer 4 · answered Feb 14 '20 at 04:20

Please note that the approach described in the paper suggested by @Marsu_ is a Bayesian rather than a frequentist one. This means that the intervals it provides, despite what the article claims, are credible intervals, not the confidence ones; and those are in fact very different in interpretation.

The Bayesian approach is assuming that the parameter of interest is a random variable having a prior distribution, and the credible interval bounds are fixed as encompassing a given probability mass of the posterior distribution of that parameter. The prior is chosen through some considerations external to the inference problem; the article itself suggests several alternatives.

From the frequentist standpoint, the parameter is constant and the interval bounds are random variables; the confidence level represents how often on average the true value of the parameter will fall into the resulting confidence interval if the sample was re-drawn multiple times from the distribution.

See Credible interval and Confidence interval: Meaning and interpretation for more information.

So it seems the only remaining options for proper confidence intervals are resampling-based methods like bootstrap or jackknife proposed by others.

How to calculate confidence intervals for Precision & Recall (from a signal detection matrix)?

4 Answers4