8

Consider the three scoring rules in the case of a binary prediction:

  1. Log: sum(log(ifelse(outcome, probability, 1-probability))) / n
  2. Brier: sum((outcome-probability)**2) / n
  3. Sphere: sum(ifelse(outcome, probability, 1-probability)/sqrt(probability**2+(1-probability)**2)) / n

What is the intuition behind them? When should I use one and not the other? I am especially interested in the case of low prevalence (e.g., 0.1%).

PS. This is to evaluate the results from my calibration algorithm which I asked about before.

sds
  • 2,016
  • 1
  • 22
  • 31
  • possible duplicate of [Justifying and choosing a proper scoring rule](http://stats.stackexchange.com/questions/126965/justifying-and-choosing-a-proper-scoring-rule) – sds Apr 24 '15 at 20:31
  • 2
    do you think your own post is a duplicate? As I read the linked thread, it does not (currently) answer all the questions I understand in your Q here. I would not vote to close, as I would be interested in answers to your questions (+1 from before). But you can always delete your own Q, if you want. – gung - Reinstate Monica Apr 24 '15 at 21:07
  • 1
    @gung: I would love to see an answer too, but the referenced question and its answer is highly related and I wanted to point that out. I think a "possible dupe" is a good way, especially since you clearly indicated your disagreement (thank you!) and thus made the actual closing unlikely. :-) – sds Apr 24 '15 at 21:27
  • 1
    You can simply add a comment to your Q w/ a link saying that it is related or may also be of interest to readers. That would accomplish what you set out to do here. I would not flag your Q for closing as a duplicate. – gung - Reinstate Monica Apr 24 '15 at 23:17
  • Regarding only 1. the intuition is that it is the log likelihood function for a binary outcome Y which we know has certain optimality properties when maximized to fit statistical models. – Frank Harrell Nov 15 '20 at 11:52

2 Answers2

2

One place where log scoring may be inappropriate: the comparison of human forecasters (who may tend to overstate their confidence).

Log scoring strongly penalizes very overconfident wrong predictions. A wrong prediction that was made with 100% confidence gets an infinite penalty. For example, suppose a commentator says "I am 100% sure that Smith will win the election," and then Smith loses the election. Under log scoring, the average score of all the commentator's predictions is now permanently stuck at $-\infty$, the worst possible. It should be possible to distinguish that somebody who has made a single wrong 100% confidence prediction is a better forecaster than somebody who makes them all the time.

fblundun
  • 3,732
  • 1
  • 5
  • 18
  • 1
    I would say that "over-penalizing over-confidence" is a _feature_, not a _bug_. – sds Dec 20 '20 at 18:44
0

Log

The expected surprisal of the prediction when we discover the actual value.

Brier

$L^2$, RMSE, OLS.

However, the fact that $p=2$ is the only value which turns the $L^p$ norm into proper scoring rule detracts from this intuition.

Sphere

The cosine of the angle between the prediction vector $(p,1-p)$ and the outcome vector (0,1) or (1,0).

Note that the angle itself is not a proper scoring rule, which also detracts from the intuition.

sds
  • 2,016
  • 1
  • 22
  • 31
  • Although you answered your own question, your answer is incoherent as a response to your interesting question. For example, "Log: the expected surprisal of the prediction when we discover the actual value" is incoherent as an answer to the question, "What is the intuition behind them? When should I use one and not the other?" – Tripartio Nov 11 '20 at 07:00
  • @Tripartio: do you have a better answer? – sds Nov 11 '20 at 11:46
  • 1
    No I don't. I upvoted the question because I would really like to learn the answer. However, I posted a related question: https://stats.stackexchange.com/questions/495935/non-mathematical-explanation-of-how-to-interpret-and-evaluate-scoring-rules-in-r – Tripartio Nov 11 '20 at 13:03