How to tell if difference in f-score is significant?

Question

I have $N$ data points, each of which has an associated label that is either $0$ or $1$. For which data point I know the true label. Say I put all true labels in a vector $t$.

Next I ask Alice to label each data point (without access to the true labels) and collect her labels in $v^{(a)}$. I do the same with Bob to produce $v^{(b)}$.

By comparing $v^{(a)}$ with $t$, I can compute the f-score $f_a$ that shows how good Alice was at recognising the true labels. I can do the same for Bob to get $f_b$.

Question: assume $f_a$ and $f_b$ are different. How do I know if the difference is significant, or just occurred by chance?

What I've tried:

do an independent samples t-test between $v^{(a)}$ and $v^{(b)}$. The issue is that the $p$ values were over $0.05$ in most experiments, indicating insignificance, even when the values of $f_a$ and $f_b$ were massively different. I have the feeling that I should not be comparing these two vectors, but somehow compare their success in capturing the true labels.
so I tried an independent samples t-test $v^{(a)}\times t$ and $v^{(b)}\times t$ where "$\times$" denotes element-wise multiplication. The results looked better but not much so. I have the feeling that by comparing the sample "means", I do capture really what the f-score says and cannot therefore lead to pertinent conclusions about the difference in the two f-scores $f_a$ and $f_b$.

Is there any significance testing method that would help me properly assess the f-score difference between Alice and Bob?

Follow up: what would I do if I could have more than two possible labels for a data point, say $0$, $1$, or $2$?

score 1 · Accepted Answer · answered Oct 16 '19 at 13:37

I hope there is an analytic way of doing this, but I think a permutation test solves this for you.

Guess | Correct | Subject
  0        1         B
  1        1         A
 ...      ...       ...
  1        0         A

That's your data frame, containing the guessed value, the true value, and the subject making the guess.

To do a permutation test, you scramble the labels. Then you calculate the f-score for subject A and subject B, using the scrambled labels. Keep doing this over and over, keeping track of each subject's f-score. Finally, combine all of the permuted scors ratios with the true score ratio and test how unlikely a true ratio of whatever you calculated or more extreme is (typical hypothesis test).

The following R (pseudo) code should give you an idea about how to do the permutation test.

guess <- rbinom(1000,1,0.5) # just dummy data
correct <- rbinom(1000,1,0.5) # just dummy data
subject <- c(rep("A",500),rep("B",500))
f_score_ratio <- rep(NA,1000)
f_score_ratio[1] <- # Ratio of true f-scores
for (i in 2:1000){

    perm <- sample(subject,length(subject),replace=FALSE)
    guess_perm_A <- guess[perm=="A"]
    guess_perm_B <- guess[perm=="B"]
    # Here's where you calculate the f-score for each subject and 
    # and the ratio of one to the other (be consistent).

    f_score_ratio[i] <- # ratio of f-scores 
}

It doesn't really matter who is on top of the ratio as long as you're consistent.

How to tell if difference in f-score is significant?

1 Answers1