Can a multi-class classification model be trained and used for multi-label classification, under any mathematical-theoretical guarantee?
Imagine the following model, actually used in one machine learning library for (text) classification:
- A multi-class classifier ― a softmax terminated MLP feeding from word embeddings, but could be anything else as well ― is trained on multi-label data. (i.e. some/most data items have multiple class designations in the training data).
- The loss per training item is computed while accounting for only a single target label, selected at random at each epoch, from among the labels applying to the item in the training data (exact loss function here, excuse the C++). This is just a small speed-inspired variant to standard stochastic gradient descent... which should average out over epochs.
- For actual usage, the confidence threshold which maximizes the aggregate Jaccard index over the entire test set, is then used for filtering the labels returned as the network's (softmax normalized) prediction output.
- For each prediction made by the model, only those labels that have confidence larger than the threshold, are kept and considered as the final actionable prediction.
This may feel like a coercion of a multi-class model into a multi-label interpretation. Are there any theoretical guarantees, or counter-guarantees, to this being useful for multilabel semantics? or, how would you reduce a multilabel problem to this?