Multilabel classification metrics on scikit

Question

I am trying to build a multi-label classifier so as to assign topics to existing documents using scikit

I am processing my documents passing them through the TfidfVectorizer the labels through the MultiLabelBinarizer and created a OneVsRestClassifier with an SGDClassifier as the estimator.

However when testing my classifier I only get scores up to .29 which from what I've read is pretty low for similar problems. I tried multiple options on the TfidfVectorizer such as stopwords, unigrams, stemming and nothing seems to change the result that much.

I've also used GridSearchCV to get the best parameters for my estimator and currently I am out of ideas on what to try next.

At the same time, from what I understand I cannot use scikit.metrics with OneVsRestClassifier so how can I get some metrics (F1,Precision,Recall etc) so as to figure out what is wrong?

Could it be a problem with my data corpus?

Update: I've also tried using CountVectorizer and HashingVectorizer and pipelining them to TfidfTransformer but the results are similar. So I am guessing that the bag-of-words approach is doing it's best in the tokenisation domain and the rest is up to the classifier...

@GeneralAbrial According to scikit documentation running ```score``` on the classifier, ```Returns the mean accuracy on the given test data and labels. In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.``` — mobius, Sep 06 '16 at 14:25
Is that what you've done? It's not clear at all from your question that this is the case, so it's a perfectly reasonable question. — Sycorax, Sep 06 '16 at 14:26
@GeneralAbrial Yes this is what I've done. Sorry for the confusion, I was trying to keep the question in a more theoretical mode rather than a development one. — mobius, Sep 06 '16 at 14:32
Can you please add your code here? Specifically are you using the sample_weight="balanced" for SGD? But there might be other things to note once we see your code. — Diego, Sep 11 '16 at 08:00
@Diego are you talking about ```class_weight```? I've tried it but the results where not good. In which case would a balanced class weight perform better? I don't know if I can share the code, but perhaps I can share the TfidfVectorizer and the classifier options — mobius, Sep 11 '16 at 19:15
Yes, class_weight, sorry. Its in your interest to share as much as you can. — Diego, Sep 12 '16 at 00:02

score 30 · Answer 1 · edited May 23 '17 at 12:39

The subset accuracy is indeed a harsh metric. To get a sense of how good or bad 0.29 is, some idea:

look at how many labels you have an average for each sample
look at the inter-annotator agreement, if available (if not, try yourself to see what subset accuracy the obtained when you are the classifier)
think whether topic are well defined
look at how many samples you have for each label

You may also want to compute the hamming score, to see whether your classifier is clueless, or is instead decently good but have issue predicting all labels correctly. See below to compute the hamming score.

At the same time, from what I understand I cannot use scikit.metrics with OneVsRestClassifier so how can I get some metrics (F1,Precision,Recall etc) so as to figure out what is wrong?

See How to compute precision/recall for multiclass-multilabel classification?. I forgot whether sklearn supports it, I recall it had some limitations, e.g. sklearn doesn't support multi-label for confusion matrix. That would be a good idea to see these numbers indeed.

Hamming score:

In a multilabel classification setting, sklearn.metrics.accuracy_score only computes the subset accuracy (3): i.e. the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

This way of computing the accuracy is sometime named, perhaps less ambiguously, exact match ratio (1):

Another typical way to compute the accuracy is defined in (1) and (2), and less ambiguously referred to as the Hamming score (4) (since it is closely related to the Hamming loss), or label-based accuracy). It is computed as follows:

Here is a python method to compute the Hamming score:

# Code by https://stackoverflow.com/users/1953100/william
# Source: https://stackoverflow.com/a/32239764/395857
# License: cc by-sa 3.0 with attribution required

import numpy as np

y_true = np.array([[0,1,0],
                   [0,1,1],
                   [1,0,1],
                   [0,0,1]])

y_pred = np.array([[0,1,1],
                   [0,1,1],
                   [0,1,0],
                   [0,0,0]])

def hamming_score(y_true, y_pred, normalize=True, sample_weight=None):
    '''
    Compute the Hamming score (a.k.a. label-based accuracy) for the multi-label case
    https://stackoverflow.com/q/32239577/395857
    '''
    acc_list = []
    for i in range(y_true.shape[0]):
        set_true = set( np.where(y_true[i])[0] )
        set_pred = set( np.where(y_pred[i])[0] )
        #print('\nset_true: {0}'.format(set_true))
        #print('set_pred: {0}'.format(set_pred))
        tmp_a = None
        if len(set_true) == 0 and len(set_pred) == 0:
            tmp_a = 1
        else:
            tmp_a = len(set_true.intersection(set_pred))/\
                    float( len(set_true.union(set_pred)) )
        #print('tmp_a: {0}'.format(tmp_a))
        acc_list.append(tmp_a)
    return np.mean(acc_list)

if __name__ == "__main__":
    print('Hamming score: {0}'.format(hamming_score(y_true, y_pred))) # 0.375 (= (0.5+1+0+0)/4)

    # For comparison sake:
    import sklearn.metrics

    # Subset accuracy
    # 0.25 (= 0+1+0+0 / 4) --> 1 if the prediction for one sample fully matches the gold. 0 otherwise.
    print('Subset accuracy: {0}'.format(sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)))

    # Hamming loss (smaller is better)
    # $$ \text{HammingLoss}(x_i, y_i) = \frac{1}{|D|} \sum_{i=1}^{|D|} \frac{xor(x_i, y_i)}{|L|}, $$
    # where
    #  - \\(|D|\\) is the number of samples  
    #  - \\(|L|\\) is the number of labels  
    #  - \\(y_i\\) is the ground truth  
    #  - \\(x_i\\)  is the prediction.  
    # 0.416666666667 (= (1+0+3+1) / (3*4) )
    print('Hamming loss: {0}'.format(sklearn.metrics.hamming_loss(y_true, y_pred)))

Outputs:

Hamming score: 0.375
Subset accuracy: 0.25
Hamming loss: 0.416666666667

(1) Sorower, Mohammad S. "A literature survey on algorithms for multi-label learning." Oregon State University, Corvallis (2010).

(2) Tsoumakas, Grigorios, and Ioannis Katakis. "Multi-label classification: An overview." Dept. of Informatics, Aristotle University of Thessaloniki, Greece (2006).

(3) Ghamrawi, Nadia, and Andrew McCallum. "Collective multi-label classification." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005.

(4) Godbole, Shantanu, and Sunita Sarawagi. "Discriminative methods for multi-labeled classification." Advances in Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2004. 22-30.

great answer, it just made me better :) I am going to read it more thoroughly, try Hamming score and get back to you! — mobius, Sep 11 '16 at 19:24
To be honest, it is not entirely clear to me what exactly is the subset accuracy (Exact Match Ratio). Could you please explain a bit more? It seems that in case of multiclass this is identical to recall. — Outcast, Jun 03 '19 at 12:51
The `hamming_score` function errors out on Keras: in hamming_score(y_true, y_pred, normalize, sample_weight) 60 ''' 61 acc_list = [] ---> 62 for i in range(y_true.shape[0]): 63 set_true = set( np.where(y_true[i])[0] ) 64 set_pred = set( np.where(y_pred[i])[0] ) TypeError: __index__ returned non-int (type NoneType) — rjurney, Jul 22 '19 at 19:19

score 1 · Answer 2 · answered Jan 06 '20 at 08:20

The Hamming-Loss and Exact match (also called subset accuracy) can be calculated Using Scikit-learn as follows.

import numpy as np
from sklearn.metrics import hamming_loss, accuracy_score 
y_true = np.array([[0,1,0],
                   [0,1,1],
                   [1,0,1],
                   [0,0,1]])

y_pred = np.array([[0,1,1],
                   [0,1,1],
                   [0,1,0],
                   [0,0,0]])

print("accuracy_score:", accuracy_score(y_true, y_pred))
print("Hamming_loss:", hamming_loss(y_true, y_pred))

Output

accuracy_score: 0.25
Hamming_loss: 0.4166666666666667

score 0 · Answer 3 · answered Sep 06 '16 at 14:21

Is the 0.29 score not enough? What does your confusion matrix look like? Are there some topics that cannot be separated out maybe by only looking at the word contents?

Otherwise, try to turn your problem around: Hypothesise that the low scores is actually the best your classifier can do on your data. That would mean that your documents are not classifiable using this approach.

To test this hypothesis, you need a set of test documents with known bag-of-word characteristics (which you create yourself). You should get 100% scores.

If you do not, then you have a bug. Otherwise, you need a different approach to classify your documents. Ask yourself: how do the documents from the different classes differ from one another? Do I need to look at other features of my documents, etc.

Apart from the numbers I sense that 0.29 is low. I use the trained model to predict topics on documents that I've already used in the training to manually test the classifier. I haven't been able to get at least the same number of topics that the user has manually entered on the document. I usually just get a subset of them. Also with regards to the the confusion matrix question, I don't think I can get a confusion matrix on the OneVsRestClassifier using the scikit.metrics... I will check it out though — mobius, Sep 06 '16 at 14:32

score 0 · Answer 4 · answered Nov 06 '21 at 10:29

The following is a vectorized version of the Hamming score:

import numpy as np


def hamming_score(pred, answer):
    out = ((pred & answer).sum(axis=1) / (pred | answer).sum(axis=1)).mean()
    if np.isinf(out):
        out = np.array(1.0)
    return out


pred = np.array([[0, 1, 1], [0, 1, 1], [0, 1, 0], [0, 0, 0]])
answer = np.array([[0, 1, 0], [0, 1, 1], [1, 0, 1], [0, 0, 1]])

hamming_score(pred, answer)

or in PyTorch

import torch


def hamming_score(pred, answer):
    out = ((pred & answer).sum(dim=1) / (pred | answer).sum(dim=1)).mean()
    if out.isnan():
        out = torch.tensor(1.0)
    return out

answer = torch.tensor([[0, 1, 0], [0, 1, 1], [1, 0, 1], [0, 0, 1]])
pred = torch.tensor([[0, 1, 1], [0, 1, 1], [0, 1, 0], [0, 0, 0]])

hamming_score(pred, answer)

Multilabel classification metrics on scikit

4 Answers4