Theoretical justification for training a multi-class classification model to be used for multi-label classification

Question

Can a multi-class classification model be trained and used for multi-label classification, under any mathematical-theoretical guarantee?

Imagine the following model, actually used in one machine learning library for (text) classification:

A multi-class classifier ― a softmax terminated MLP feeding from word embeddings, but could be anything else as well ― is trained on multi-label data. (i.e. some/most data items have multiple class designations in the training data).
The loss per training item is computed while accounting for only a single target label, selected at random at each epoch, from among the labels applying to the item in the training data (exact loss function here, excuse the C++). This is just a small speed-inspired variant to standard stochastic gradient descent... which should average out over epochs.
For actual usage, the confidence threshold which maximizes the aggregate Jaccard index over the entire test set, is then used for filtering the labels returned as the network's (softmax normalized) prediction output.
For each prediction made by the model, only those labels that have confidence larger than the threshold, are kept and considered as the final actionable prediction.

This may feel like a coercion of a multi-class model into a multi-label interpretation. Are there any theoretical guarantees, or counter-guarantees, to this being useful for multilabel semantics? or, how would you reduce a multilabel problem to this?

I mean that for example, results will bear a meaning not saying which single class is most likely to apply (as in multiclass classification), but that multilabel semantics will apply to the output. New data being from a different distribution is a standard issue with machine learning models and has relatively nothing to do with this question. Thanks for the title edit though. — matt, Apr 04 '18 at 07:20
Very interesting question, but very complicated to answer -- not that I would know an answer or where to find it. There are several helpful restrictions, but to me the abstraction of the determination threshold -- the decision rule whether to accept a label -- seems rather awkward to formalize. With only ANN in mind and considering that some of the "easy" proofs are already quite complicated, I'd be sceptical that you could prove any guarantee. I'm not familiar with text classification, but in my fields of application a different approach proved to be more successful: start with regression... — cherub, Apr 06 '18 at 09:13
... so, instead of using classification, one tries to model the CDF. A theoretical proof is not yet given, but from general considerations it is clear, that the CDF holds all information. Regarding the training, it has the unfortunate exponential scaling behaviour wrt the number of dimensions. One remedy is to use different networks for each category, so you don't have to restrict the number of categories. Classification might then be done in a second step with any means that yield the best answer, evaluated in posterior. — cherub, Apr 06 '18 at 09:16
Sorry, I didn't fully grasp where this is going. Happy to learn more... — matt, Apr 10 '18 at 10:51
What you described is in some aspects similar to the standard method, expensive but works very well. The theoretical judgement can be made by constructing a generative process and analyze the likelihood. Other methods as the one suggested in Bjorn's answer are faster but less accurate. — THN, Aug 16 '19 at 02:04

score 7 · Accepted Answer · answered Apr 09 '18 at 15:00

7

A softmax output layer does not seem to make sense. The total probability of all classes would then be coerced to sum to 1. This does not make sense in a multi-label setting. Using a sigmoid instead would seem more logical (allows multiple classes to have high probability e.g. close to 1).

Perhaps what is being done in steps 2, 3 and 4 is a consequence of having to compensate for having used a softmax activiation function for the output layer? At least the bit in step 2 ensures that this matches up with a softmax activiation. I am not so sure that this evens out over epochs: each category will occur less frequently in the created training data than is really the case in the training data and you throw away all information on what categories tend to occur together (unless you have so little data that there is a concern about overfitting to that?!). Additionally, I assume you would get better performance (=not speed-wise, but from the prediction perspective) if you did have the multiple labels in each step. I find it hard to believe that this really cost that much in speed and I would assume that it would improve your predictions to use all the labels at all times.

In short, I see some reasons for why what you describe could go wrong and assume (without really knowing) that some of the contortions in the approach try to compensate for some of these. I do not know enough and have not tried this, so I cannot say how successful this would be. I conjecture that the correlation between the multiple labels is something that this approach would not capture.

Personally, I would be tempted to just do this as a proper multi-label prediction with a final dense layer with a number of units equal to the number of classes (i.e. encoded as e.g. 1 0 0 0 1 0 0 1 0 0 ... if an item falls into the 1st, 4th and 8th class) and a sigmoid activation (using e.g. binary crossentropy as the loss function). I believe this is the standard approach normally recommended for this situation.

answered Apr 09 '18 at 15:00

Björn

21,227
2
26
65

Many thanks for the great deliberation! Something small ― you mention "in the created training data", but the random choice is made during training, so I think it could possibly be rephrased or differently put... and I think in terms of the expectancy, it will be equivalent and hence the justification for that step ― which I agree is a mostly an undesirable tradeoff of speed with accuracy/performance ― is nonetheless rationalized. – matt Apr 10 '18 at 10:52
Additionally, to clarify the answer, the final layer is already fully connected to the hidden layer, and has a number of units equal to the number of classes.... or are you suggesting an additional layer, or am I mis-reading this? – matt Apr 10 '18 at 11:02
My point was that the true training data seems to have multiple labels per item, but the approach would not preserve that and instead generate training data that has one class present. This throws away all information on what classes occur together, even if some other adjustments somehow manage to get their overall frequencies right. – Björn Apr 10 '18 at 16:47
The standard approach would not differ in terms of what is fully connected, but rather what training data is used (I.e. With multi -labelled instances) and what the activation function for the final layer is. – Björn Apr 10 '18 at 16:49
I should apologize: a single label is selected at random per datum _per epoch_ during the training. I've only now more explicitly reflected that in the question. In case this changes your perspective feel free updating the answer; we'd probably remove our comments here to clean up anyway I guess... – matt Apr 10 '18 at 17:43
Hi @Björn Thanks for ur answer. Cud u share some thought what if I had multi-class training data (one sample and one label) but I have multiple labels (hence multi-class) and train a multi-class with softmax as we do, but then suggest top 5 categories(labels) to a new test instance based on rank ordering of prob of classes... is this correct? data had only one label to one instance but we are suggesting multiple labels as top 5 predicted labels... is this theoretically correct or the rest 4 labels (outside the correct label) make sense here? – Baktaawar Oct 11 '19 at 23:31
@Baktaawar you mean you have N possible labels, but each example can have only one at a time? In that case, yes, it makes sense to show the top matches (e.g. 5 or anything with predicted probability >0.1 or so). Of course, only one can apply, but often a model will not be completly 'sure' (and can be wrong even if it is). – Björn Oct 12 '19 at 09:14
No. In reality a ticket can have multiple labels. But the data which is used for training is only multi-class - one instance with one label. In that case it doesn't make sense to show Top N beyond top 1 as the model never really learns the other labels since those were never in the data – Baktaawar Oct 13 '19 at 00:02
@Baktaawar when you training data does not match your actual task (or at least has a different distribution), then you likely have a serious problem in getting decent performance in the real world no matter what you do. Asking the model to perform the wrong task won't help though. Allowing for multiple labels in the training (and then in production) and showing multiple outputs is likely to produce better results than hard coding the wrong information into the model that there can only ever be on label at a time. – Björn Oct 13 '19 at 05:37
lets for example assume training data is multi-class and is fine. Even then we can't comment on Top N if the data just has one label mapped to one instance... Beyond Top 1 rest all n-1 wud be pure random guesses and not something model learnt – Baktaawar Oct 14 '19 at 00:52
@Baktaawar by the very definition of what you are saying the training data are not really fine. However, it is entirely possible to predict multiple classes based on such training data. Think of it as separate logistic regressions for each class, if course you can get high predicted probabilities for multiple classes and they can make sense. – Björn Oct 14 '19 at 05:33
I don't think that is correct. If u have correct multi-class data even then its not right to make Top N predictions as the data has not seen any other labels beyond the one it is mapped to. Also u end up using softmax for multi-class and that does not mean probabilities. – Baktaawar Oct 14 '19 at 21:18
@Baktaawar if you wrongly use softmax, things are of course messed up. – Björn Oct 15 '19 at 03:20
we are saying, if data does not have any single training instance mapped to multiple labels,then doing Top N does't make sense. Also softmax used here when showing Top N is not right too. But main issue is data not having multiple labels. Then its not correct to say Top N predictions as model does not see any of multiple labels mapped to each instance – Baktaawar Oct 15 '19 at 16:20
@Baktaawar I think you have that backwards, but I guess I am not going to convince you of that. I would suggest trying what happens when you at least don't you a fundamentally wrong model, but up to you whether you want to try it. – Björn Oct 15 '19 at 16:37

Theoretical justification for training a multi-class classification model to be used for multi-label classification

1 Answers1