I've trained a binary classifier for a language identification problem. The training data is $n$ sentences from language A and $n$ sentences from language B. Such that $n$ sentences are selected uniformly at random from a larger corpus of $N$ sentences, that belong to language B. The rationale is to create a balanced dataset for training.
Next, I'm using the classifier to identify the language of $m$ sentences from a new corpus, for which I don't have the ground truth language. For each such sentence, the classifier predicts a probability $p$, that this sentence belongs to language A (or equivalently a probability $1-p$, that this sentence belongs to language B).
I'd like to have some measure of how strongly the classifier believes the sentence belongs to A rather than B, so I've used $dp := |p - (1-p)|$. One concern is how does the random selection of $n$ out of $N$ sentences for the training set affects $p$. One simple solution that comes to mind, is to conduct $k$ experiments (say 10), and for each unknown sentence report the average and std of $dp$ over the $k$ experiments.
Questions:
- Does it make sense to use $dp$ for such a measure?
- Does it make sense to report the average and std of $dp$ over $k$ experiments? If so, what should be the value of $k$?
Note, there are, of course, questions of how do we know the classifier is any good, that the probability is calibrated, etc. I'm ignoring these issues for simplicity's sake.