0

For a binary classification, I have a training data set that gets divided into calibration and validation sets. Some data is used to train the classifier and some data (truth data) is used to test the precision/recall. To calculate an F-score I think theoretically you would only need one test case for each class but this calls into question how reliable an F-score is based on how many test cases are used. Is there a statistical metric for describing the "power" of an F-score based on the amount (~sample size~?) of the validation data set? Intuitively, it makes sense that an F-score calculated with more truth data is a more reliable description of the classifier's precision/recall but after searching I can't find a metric that actually describes this.

RyanG
  • 1
  • A larger test sample will tend to get you closer to the *true* precision and recall (in effect binomial proportions) of your model and so the *true* F-score. See https://stats.stackexchange.com/questions/363382/confidence-interval-of-precision-recall-and-f1-score for some discussion of confidence intervals – Henry Jun 30 '21 at 18:57
  • 1
    Don't use precision, recall, sensitivity, specificity, or the F1 score at all. Every criticism at the following threads applies equally to them, nd indeed to all evaluation metrics that rely on hard classifications: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) [Is accuracy an improper scoring rule in a binary classification setting?](https://stats.stackexchange.com/q/359909/1352) [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) – Stephan Kolassa Jun 30 '21 at 20:04
  • 1
    Instead, use probabilistic classifications, and evaluate these using [proper scoring rules](https://stats.stackexchange.com/tags/scoring-rules/info). – Stephan Kolassa Jun 30 '21 at 20:04

0 Answers0