I have $N$ data points, each of which has an associated label that is either $0$ or $1$. For which data point I know the true label. Say I put all true labels in a vector $t$.
Next I ask Alice to label each data point (without access to the true labels) and collect her labels in $v^{(a)}$. I do the same with Bob to produce $v^{(b)}$.
By comparing $v^{(a)}$ with $t$, I can compute the f-score $f_a$ that shows how good Alice was at recognising the true labels. I can do the same for Bob to get $f_b$.
Question: assume $f_a$ and $f_b$ are different. How do I know if the difference is significant, or just occurred by chance?
What I've tried:
do an independent samples t-test between $v^{(a)}$ and $v^{(b)}$. The issue is that the $p$ values were over $0.05$ in most experiments, indicating insignificance, even when the values of $f_a$ and $f_b$ were massively different. I have the feeling that I should not be comparing these two vectors, but somehow compare their success in capturing the true labels.
so I tried an independent samples t-test $v^{(a)}\times t$ and $v^{(b)}\times t$ where "$\times$" denotes element-wise multiplication. The results looked better but not much so. I have the feeling that by comparing the sample "means", I do capture really what the f-score says and cannot therefore lead to pertinent conclusions about the difference in the two f-scores $f_a$ and $f_b$.
Is there any significance testing method that would help me properly assess the f-score difference between Alice and Bob?
Follow up: what would I do if I could have more than two possible labels for a data point, say $0$, $1$, or $2$?