How to compute the AUROC for a single categorical variable

Question

I am building new features for a binary classifier. The new features fall into two categories: categorical and ordinal. An example of the first feature would be the colours red, blue, green and one of the second would be integer counts 1, 2, 3, ....

For the ordinal variables I can roughly get an idea at how good each feature is by computing the area under the roc (AUROC) curve. If the AUROC is close to 1, it means that there is a good threshold value for the new feature such that it can discern well between true and false positives.

I would like to have a similar measure for the categorical features. For example, I know in each category what the rate of 1's is. However, it is hard to compare this rate across many category levels. Would be keen to hear your suggestions on what to do.

One thought I had was to fit a logistic regression with the categorical variable as the only predictor, and then calculate the AUROC for the predicted probabilities under this regression.

geekoverdose · Answer 1 · 2016-06-10T19:31:19.003

In short: yes, you could use a (simple) model(s) to compute the AUC (AUROC) for categorial features too.

When you compute the AUC for an ordinal feature, you use the feature itself like you would use a classification model output and apply the threshold to it (of which one class lies below and the other lies above). Note that the complexity is determined by the - in this case non-existing - model: using a threshold on an ordinal feature boils down to using a linear separation that divides the feature into two parts. If you would use a more complex model instead (e.g. tree), you could easily obtain multiple parts too. For a categorial feature, doing so might make sense. This essentially is just answering the question "how likely is class 1 if my feature is blue?", which you could employ many model types for (small trees, etc). Note that you can of course overfit this too, so using models with little complexity might be reasonable (like the linear separation for the ordinal feature) .

PS: you might need to encode your categorial variable in one-hot encoding for some models (that cannot make meaning of categories themselves), e.g. if you want to use it in logistic regression. This makes the problem $N$ dimensional instead, with $N$ being the amount of categories of your variable (though this is automatically done with most implementations).

Use correct terminology. A classification model has a 0-1 output. You seem to be addressing prediction models, e.g., probability models. With a probability model no classification is needed, or it can be deferred to occur much later than the analysis once utilities are defined. — Frank Harrell, May 27 '18 at 11:46

How to compute the AUROC for a single categorical variable

1 Answers1