13

I'm using scikit package with RandomForestClassifier, trying to predict binary or multi-lable classifications.

I'm looking for a way to estimate the reliability of the model but really can't figure whether to use the Brier score or a Log Loss scorer.

I understand that both can estimate the reliability of the probabilities that the model outputs.

Can anyway clarify what are the pros and cons of each method, and why/when should I choose one over another?

Guy Manzur
  • 141
  • 1
  • 4

3 Answers3

11

Since the log likelihood function (combined with the prior if Bayesian modeling is being used) is the gold standard optimality criterion, it is best to use the log likelihood (a linear translation of the logarithmic accuracy scoring rule). This automatically extends to ordinal and multinomial (polytomous) $Y$. There are only three reasons I can think of for not using the log likelihood in summarizing the model's predictive value:

  1. you seek to describe model performance using a measure the model was not optimizing (not a bad idea; often why we use the Brier score)
  2. you have a single predicted probability of one or zero that was "wrong", rendering an infinite value for the logarithmic score
  3. it's often hard to know "how good" a value of the index is (same for Brier score, not so much for $c$-index, i.e., concordance probability or AUROC)
Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
4

Either of these measures may be appropriate, depending on what you want to concentrate on.

The Brier score is basically the sum of squared errors of the classwise probability estimates. It will inform you as to both how accurate the model is and how "confidently" accurate the model is.

You would not want to use the Brier score for scoring an ordinal classification problem. If missing a class 1 by predicting class 2 is better than predicting class 3, for example. The Brier score weights all misses equally.

Cross entropy (log loss) will, basically, measure the relative uncertainty between classes your model produces relative to the true classes. Over the past decade or so, it's become one of the very standard model scoring statistics for multiclass (and binary) classification problems.

Thomas Cleberg
  • 1,525
  • 10
  • 13
2

This paper seems to talk a bit: http://faculty.engr.utexas.edu/bickel/Papers/QSL_Comparison.pdf

And I got it from this answer: Justifying and choosing a proper scoring rule

Lily Long
  • 133
  • 6