I am working on a multi-label classification problem. Each sample is capable of taking more than a single label. Sometimes samples don't have any labels associated with them.
My dataset has 50% samples with 1 or more labels associated with them. The remaining have no labels at all. I am sure, among the future "test" samples, there will be a population that has no labels attached.
So far, I've been dropping the 50% samples with no labels and training a multilabel classifier. Recently, I realized that this model will end up predicting labels for a sample even when none of the labels seem appropriate for it. This leaves me with 2 options -
- Add a new label called "NONE", which is equal to 1 for samples with no labels and 0 for label-annotated samples.
- Simply train the multilabel classifier on all the standard labels. Let the model figure out on its own which combination of features qualify for no labels at all.
Am I thinking in the right direction? I'd also like to know your suggestions on this problem.