2

I have a population including rare events (let's call them "event A") and I want to evaluate the precision and recall of a new algorithm to detect the rare events. In the actual population, I have 100 millions events, including 10,000 A events of interest (positives). But to evaluate the new model, I cannot process the whole population, as it would take too long. So I want to use a sample that will contain all 10k A events, but only 100k negatives, drawn uniformly at random from the whole population of negatives. From the results on this sample, I would evaluate the precision (TP/(TP+FP)) and recall (TP/(TP+FN)) of the algorithm on the whole population.

My reasoning is that in the test population, the ratio of positives to negatives is 1e4/1e5 = 1/10, but in the actual population, that ratio is 1e4/1e8 = 1e-4. So, to undo the bias I introduced in the sample, I should weigh the negatives by 1/(1e-1/1e-4) = 1e3. I would still weigh the positives as 1. With those weights, I would proceed to count the TP, FP and FN in the test sample, then calculate:

$$precision = \frac{TP}{TP+10^3FP}$$ and $$recall = \frac{TP}{TP + FN}$$

The number of FPs needs to be correct, as a single FP in the test sample "stands in" for $10^3$ in the actual population. However, for the FN, a false negative is actually a positive, and those were not biased in the test sample, so I should not correct them.

Is that correct?

PS: after this post, I actually found a way to better estimate the precision, and sidestep the over-counting of FPs in the whole population by the formula above. I draw a second sample, that contains negatives only, which allows me to estimate the actual number of FPs in the population much more precisely, and I calculate the precision from the number of TPs of the first sample, and the estimated FPs for the whole population from the second sample. This procedure with 2 samples seems to evaluate the precision on the whole population much better than the formula above.

Frank
  • 1,305
  • 1
  • 12
  • 17
  • 1
    Don't use accuracy, precision, recall, sensitivity, specificity, or the F1 score. Every criticism at the following threads applies equally to all of these, and indeed to all evaluation metrics that rely on hard classifications: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) Unbalanced classes are almost certainly not a problem, and oversampling will not solve a non-problem: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa Aug 29 '21 at 08:53
  • @StephanKolassa - that's all good - what kind of evaluation should one use on that type of problem then? – Frank Aug 29 '21 at 15:16
  • Sorry for only getting back now, I was on vacation. I would strongly recommend looking at probabilistic predictions, and assessing these using proper scoring rules. Take a look at the two threads I linked to in my previous comment, as well as at [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352). – Stephan Kolassa Sep 07 '21 at 15:06
  • I ended up settling on "precision @k", which means, take e.g. the top 100 model scores, and compute the precision/recall based on that only. The premise is that when it comes to putting a system in production, it would be a valid thing to do in this case. Needless to say, the precision/recall become very good – Frank Sep 08 '21 at 16:48

1 Answers1

3

I'm not sure I'd call this stratified sampling, but your calculations for precision and recall make sense. It's as if you cluster negative examples into 1K groups and choose a representative among them and calculate approximated statistics based on these ones.

Of course, the effect of this procedure on the training is another issue, e.g. for bayesian approaches, your priors will be calculated differently. So, you may need to account for sample weights.

gunes
  • 49,700
  • 3
  • 39
  • 75
  • 1
    Agree -- except that I *would* call this stratified sampling. Actually, I'd probably call it case-control sampling, but that's a special case of stratified sampling – Thomas Lumley Aug 28 '21 at 23:23
  • That's really interesting and good to know, a stratified subsample would contain only 10 positives in the def. I'm accustomed to (splitting the larger pop into homogeneous smaller samples). – gunes Aug 28 '21 at 23:30
  • It's surprising to me that this way of drawing the sample is going to affect the precision calculation, but not the recall - is that correct? - and yes, I think this is stratified sampling, as I'm working with a partition of the original population - the only catch is that I don't sample one of the subset, because it's small enough, so I take it all, and only the one that's too large is sampled. But I should be allowed to do that and still call that "stratified sampling". – Frank Aug 28 '21 at 23:46
  • 1
    It might be surprising at first glance, but makes total sense because the weights of positive samples are still $1$. – gunes Aug 28 '21 at 23:48
  • 1
    For training by the way, the positives are so rare than I need to balance the training set with a ratio of 1:1 for positives:negatives, or maybe 1:5, or the classifier "does not see anything" and reverts to always predicting a negative - I guess I could also change the loss function. – Frank Aug 28 '21 at 23:59
  • One problem I can see though, is that I'm assuming that every FP in the sample stands in for 1000 FPs in the population, which might be incorrect. – Frank Aug 29 '21 at 15:15
  • Yes, but it's the trade-off you pay for. – gunes Aug 29 '21 at 15:17
  • @gunes - I think I've found a way around this, by drawing a second sample that contains negatives only, and evaluating the number of FPs in the population from that second sample. I then plug in that estimate of the FPs and the count of TPs from the first sample, and the precision is much closer to the actual precision on the population. The second sample gives a better estimate of the FPs. – Frank Sep 04 '21 at 20:27
  • That is also a sensible approach. But, how is the sample size of your second sample compared to your first one? And how do you split your train/test sets in the two configurations? – gunes Sep 05 '21 at 21:24
  • I take the same sample sizes, in the test set only. The train set is not affected and completely disjoint (hopefully), in time among other criteria. But in the end, this is equivalent to having a beefier test set with more negative samples, IMHO. – Frank Sep 08 '21 at 16:51