8

AUC is a popular classification evaluation metric.

  1. This is a measure of aggregate performance—do any of the standard loss functions (functions of an individual example's label & prediction) optimize for this?

  2. Are there any other ML approaches (not necessarily loss-function-based optimization algorithms) that are designed to optimize AUC?

Yang
  • 2,981
  • 3
  • 20
  • 18
  • 3
    Probably not what you're interested in, but AUC is very often optimized during hyperparameter search. Furthermore, the objective function of SVM classifiers is closely related to optimizing AUC, but I don't have a reference at hand currently. – Marc Claesen Mar 31 '15 at 08:19
  • What are you referring to, AUC of ROC or AUC of a TAC? – Carl Mar 06 '18 at 18:27
  • See https://stats.stackexchange.com/questions/235089/optimizing-auc-vs-logloss-in-binary-classification-problems, https://stats.stackexchange.com/questions/193138/roc-curve-drawbacks/193141#193141 – kjetil b halvorsen Nov 13 '19 at 00:45
  • I found this paper, [Scalable Learning of Non-Decomposable Objectives](https://arxiv.org/pdf/1608.04802.pdf), to be helpful when searching on this topic. It covers loss functions to optimize not only AUC but also other related concepts such as precision at fixed recall. Relevant code by the authors [here](https://github.com/tensorflow/models/tree/master/research/global_objectives). – J Trana Jan 17 '20 at 22:35

1 Answers1

5

There are actually multiple ways to do this.

Remember that the AUC is a normalized form of the Mann-Whitney-U statistic, that is, the sum of ranks in either of the class. This means that finding optimal AUC is the problem of ordering all scores $s_1,\ldots,s_N$ so that the scores are higher in one class than the other.

This can be framed for example as a highly infeasible linear-programming-problem which can be solved heuristically with appropriate relaxations, but one method that interests me more is to find approximate gradients to the AUC so that we can optimize with stochastic-gradient-descent.

There's plenty to read about this, here is a naive approach:

Using '$[]$' as the Iverson-Bracket, another way to look at the sought ordering over the scores could be that

$[s_i\leq s_j]=1$ for all $i,j$ where response $y_i=0$ and $y_j=1$

So if the scores are a function of inputs and parameters $s_i = f(x_i,\theta)$

We want to maximize $$M^*=max_\theta \sum_{i,j}[s_i\leq s_j]$$

Consider the relaxation $\tanh(\alpha(s_j-s_i)) \leq [s_i\leq s_j]$

So $$M^*\geq \sum_{i,j}\tanh(\alpha(s_j-s_i))$$

And we could then sample $i$ and $j$ from each class to get contributions $\nabla_\theta \tanh(\alpha(s_j-s_i))$ to the full gradient.

ragulpr
  • 634
  • 5
  • 11