1

After working for a while on this text classification problem, I realize that some documents belong to more than one class. I am using multinomial logistic regression which also provides a probability distribution over the classes (labels). I wonder if it is a good idea to use this distribution for multi labeling. For example, when the probabilities are [0.3, 0.6, 0.1] for the classes A, B, C respectively, I can label the document with the classes that have a probability for that document higher than a predefined threshold (say 0.25) .

Is this a good idea? I've made a Google search but couldn't found any document mentioning a method similar to this. How reliable is this method? What do you think?

To be more clear about my problem space, there are like 20 classes and mostly a document belongs to either one or two of these classes.

hrzafer
  • 111
  • 3

1 Answers1

1

Multinomial assumes that an outcome belongs to only one class, but you can redefine classes. E.g. if there are two original classes A and B, then you can label the documents as belonging to three mutually exclusive classes:

I - document is A only

II - document is B only

III - document is both A and B.

Nik Tuzov
  • 511
  • 2
  • 10
  • Yes, that is one of the approaches. But I have nearly 20 classes, which could lead to many combinations. – hrzafer Nov 10 '16 at 19:22
  • If you have no use for that many classes, then collapse a few classes into one. E.g. I, II and III can be collapsed into a single class, "A or B". – Nik Tuzov Nov 14 '16 at 20:31