Classification performance metrics from samples based on predicted labels

Question

I'm trying to test the accuracy of a simple binary classifier that predicts whether a Twitter user is a doctor or not.

We have a very large dataset of unlabeled users, so we need to sample the data to annotate it. Because there are obviously many more non-doctors than doctors, we thought the best way to do that would be to sample 500 predicted "doctors" and 500 predicted "non-doctors", and see how many false positives and false negatives we got, respectively.

However, when trying to apply the typical classification metrics to these results, it seems like the denominators aren't right. The quantities we have are $$\frac{FP}{TP+FP}\quad \text{ and } \quad\frac{FN}{TN+FN},$$ whereas for sensitivity and specificity I’d need $$TP+FN\quad \text{ and } \quad TN+FP,$$ respectively.

Similarly, I think I could measure precision, but recall wouldn’t work.

It seems like the underlying problem is that our samples are drawn from predicted classes rather than the true labels. Are there established ways of transforming results from samples like these to performance metrics?

You have a vector of correct labels and a vector of predicted labels. That’s all you need to calculate accuracy, sensitivity, or specificity. Note, however, the issues with such scoring rules (e.g., https://stats.stackexchange.com/a/312787/247274). — Dave, Mar 14 '21 at 15:12
Thanks! I guess what I’m confused about is that the labeled data consists of two different random samples with different statistics. Is it valid to just take the raw counts from each sample or is there some adjustment I’d need to do? I hope that makes sense. — architectpianist, Mar 14 '21 at 15:40
I don’t really get what you’re doing and why you have to label data after your supervised learning model has made its predictions, but if you have observations with true labels and predicted labels, you can calculate the accuracy. As that link I have describes, however, you might not want to calculate accuracy. — Dave, Mar 14 '21 at 16:35
Ah, yeah part of my question is if this sampling method even makes sense to do. We took two different samples because the rate of the positive class is (unknown but approximately) 1-2% of the data, so if we were sampling the entire set we would have to label thousands of examples to get a reasonable estimate of false positives. But our classifier (it’s regex, so no training data) does a reasonable job, so we used it to sample more false positives. And you’re right, we don’t want to report accuracy - it would be close to 99%, but what we’re really interested in is the FPR and FNR. — architectpianist, Mar 14 '21 at 17:12
Perhaps Youden's index and Markedness index is what might interest you. — ttnphns, Mar 14 '21 at 23:09

score 1 · Answer 1 · answered Mar 14 '21 at 22:19

It looks like the metrics I was looking for were positive and negative predictive value (PPV/NPV). These are defined as the proportion of predicted positive examples that are true positives, and the proportion of predicted negative examples that are true negatives, respectively. I would love to see a better answer if there is a way to convert from these to sensitivity/specificity, but it seems these will work well for now.

Classification performance metrics from samples based on predicted labels

1 Answers1