I'm trying to test the accuracy of a simple binary classifier that predicts whether a Twitter user is a doctor or not.
We have a very large dataset of unlabeled users, so we need to sample the data to annotate it. Because there are obviously many more non-doctors than doctors, we thought the best way to do that would be to sample 500 predicted "doctors" and 500 predicted "non-doctors", and see how many false positives and false negatives we got, respectively.
However, when trying to apply the typical classification metrics to these results, it seems like the denominators aren't right. The quantities we have are $$\frac{FP}{TP+FP}\quad \text{ and } \quad\frac{FN}{TN+FN},$$ whereas for sensitivity and specificity I’d need $$TP+FN\quad \text{ and } \quad TN+FP,$$ respectively.
Similarly, I think I could measure precision, but recall wouldn’t work.
It seems like the underlying problem is that our samples are drawn from predicted classes rather than the true labels. Are there established ways of transforming results from samples like these to performance metrics?