6

I built a detector to detect a binary outcome and then took a random sample from the population. From this, I can create a signal detection/confusion matrix (hit, miss, false alarm, correct rejection) [aka: TP, FP, FN, TN] and then calculate metrics such as Recall and Precision.

My question: How do you calculate confidence intervals for the Recall and Precision of the population from which I sampled?

I was thinking of this formula: p - z * sqrt(p*(1-p)/n) < p < p + z * sqrt(p*(1-p)/n)

where p = the statistic (e.g., Recall) and z = z-score for the desired confidence

4 Answers4

2

The following approach might be more accurate and more efficient.

Goutte, C., & Gaussier, E. (2005, March). A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In European Conference on Information Retrieval (pp. 345-359). Springer Berlin Heidelberg.

Marsu_
  • 21
  • 2
  • 4
    in addition to providing a reference, please briefly explain why this approach would be better – Antoine Feb 13 '17 at 15:37
2

I'll summarise the approaches which are sketched by the other two answers.

  1. Suggest by @marsu. Assume that your confusion matrix $C$ has a multinomial distribution $M(n; π)$, then the distribution of the $TP$ is binomial. Assume a symmetric beta prior for precision $p$ and and recall $r$, that is $p,r ∼ Beta(λ, λ)$. Then given your data $D$ the posterior for $p$ is $p|D ∼ Beta(TP + \lambda, FP + \lambda)$ and $r$ is $r|D ∼ Beta(TP + \lambda, FN + \lambda)$. You can then use software to calculate the appropriate confidence interval as outlined here: Calculate the confidence interval for the mean of a beta distribution.

  2. Suggest by @fred. Generate $D_n$ datasets by sampling with replacement from your underlying dataset $D$. For each $D_n$ fit your classifier, and calculate the confusion matrix $C_n$. For each $C_n$ calculate precision $p_n$ and $r_n$, the confidence interval for these quantities can be calculated directly from the bootstrap distribution.

MachineEpsilon
  • 2,686
  • 1
  • 17
  • 29
  • An alternative to the bootstrap approach in option (2) is to use the jackknife. For jackknife estimates of variance and standard errors of a given statistic (e.g. mean), see e.g. Efron, B. and Tibshirani, R.J., 1994. "An Introduction to the Bootstrap", p 141. Note that the jackknife should not be used for non-smooth statistics where a small change to the data could cause a large change in the statistic like for example in the median (or any other quantile). – mloning Oct 16 '18 at 10:06
1

An answer here suggests using bootstrapped statistics; we've done this at my place of employment and it seems to do the right thing.

Confidence interval for precision and recall in classification

Fred
  • 11
  • 1
  • 2
    Please include the details of the link in the answer to help avoid reads from having to click on the link. As it stands, this looks more like a comment. – Greenparker May 11 '16 at 00:48
  • 1
    The link that you posted doesn't have any answer now. – aerin May 22 '18 at 19:00
1

Please note that the approach described in the paper suggested by @Marsu_ is a Bayesian rather than a frequentist one. This means that the intervals it provides, despite what the article claims, are credible intervals, not the confidence ones; and those are in fact very different in interpretation.

The Bayesian approach is assuming that the parameter of interest is a random variable having a prior distribution, and the credible interval bounds are fixed as encompassing a given probability mass of the posterior distribution of that parameter. The prior is chosen through some considerations external to the inference problem; the article itself suggests several alternatives.

From the frequentist standpoint, the parameter is constant and the interval bounds are random variables; the confidence level represents how often on average the true value of the parameter will fall into the resulting confidence interval if the sample was re-drawn multiple times from the distribution.

See Credible interval and Confidence interval: Meaning and interpretation for more information.

So it seems the only remaining options for proper confidence intervals are resampling-based methods like bootstrap or jackknife proposed by others.