Why does the BERT NSP head linear layer have two outputs?

Question

Here's the code in question:

class BertOnlyNSPHead(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.seq_relationship = nn.Linear(config.hidden_size, 2)

    def forward(self, pooled_output):
        seq_relationship_score = self.seq_relationship(pooled_output)
        return seq_relationship_score

I think it was just ranking how likely one sentence would follow another? Wouldn't it be one score?

Lerner Zhang · Accepted Answer · 2020-05-31T00:34:06.363

We can see from these two lines that it's cross entropy loss and it's a score.

loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

The NSP is a binarized next sentence prediction task, and you may need to refer to this question: Is binary logistic regression a special case of multinomial logistic regression when the outcome has 2 levels?.

When the size of the label for sequence prediction is 1, it is doing regression using mean square error.

Why does the BERT NSP head linear layer have two outputs?

1 Answers1