Is it wise to use multiple binary models as opposed to a multinomial model?

Question

Let's say that you have a classification problem where the dependent variable has MANY levels (say 20) and you CANNOT transform the target (i.e. no clustering, combining of levels, etc.). Is it a good idea to fit a binary model for each level (i.e. Y = 1 if specific level versus Y= 0 otherwise) as opposed to a model where you are trying to predict all of the level of Ys simultaneously?

I know that when you are using logistic regression, the multinomial approach is more efficient, but will this hold when employing machine learning algorithms? So far the best model was a boosted tree. Any thoughts on this?

For the choice of the predicted value, I was thinking that I would just predict whichever category corresponded to the highest probability.

is there a reason you can't do say... 1 vs {2,...,20}...then 2 vs {3,...,20}...then 3 vs {4,...,20}...and so on until the 19th model is 19 vs 20.... then you use your favourite 2-class algorithm for each step — probabilityislogic, Aug 04 '19 at 04:23

score 1 · Answer 1 · answered Feb 17 '16 at 18:21

The approach using binary classifiers against multiclass data instead of methods designed especially for such data is commonly used in machine learning and implemented in multiple software packages. Clear description of such procedures can be found in Python's scikit-learn library documentation:

One-Vs-The-Rest

This strategy, also known as one-vs-all, is implemented in OneVsRestClassifier. The strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes.

(...)

One-Vs-One

OneVsOneClassifier constructs one classifier per pair of classes. At prediction time, the class which received the most votes is selected. In the event of a tie (among two classes with an equal number of votes), it selects the class with the highest aggregate classification confidence by summing over the pair-wise classification confidence levels computed by the underlying binary classifiers.

The approach that you mention

For the choice of the predicted value, I was thinking that I would just predict whichever category corresponded to the highest probability.

is a description of one-vs-rest classifier (see Wikipedia article on multiclass classification). You can find more detailed description of these approaches in most of the introductory machine learning handbooks.

The idea of working with the multinomial distribution is that the probabilities of the classes sum to one. I don’t see how this is guaranteed in the approaches described in this answer. — Dimitris Rizopoulos, Aug 04 '19 at 05:17
@DimitrisRizopoulos it is not guaranteed and is unlikely to happen. Those approaches are used only to make classification decision (highest score wins), the "probabilities" are meaningless. — Tim, Aug 04 '19 at 06:03
But aren’t the classification decisions based on the probabilities? How the decisions will be valid if the probabilities in the first place are not valid? — Dimitris Rizopoulos, Aug 04 '19 at 06:27
@DimitrisRizopoulos not every classifier returns probabilities, even if they do, often the probabilities are not [calibrated](http://scikit-learn.org/stable/modules/calibration.html) so they are meaningless as so and cannot be interpreted. Classifiers return scores, that tell you about ranking, but only in some cases they are valid probabilities. If you have further questions, feel free to ask them as separate questions. — Tim, Aug 04 '19 at 07:21

Is it wise to use multiple binary models as opposed to a multinomial model?

1 Answers1

One-Vs-The-Rest

One-Vs-One

Linked