2

I've been reading up on scoring rules and posts such as this one Why is LogLoss preferred over other proper scoring rules?

It's clear to me that if I have a set of biased coin and after examining each one I predict that heads is p_i for coin i then after flipping each one I can estimate my performance by taking the average of log(p_outcome) over many flips.

So if I my p_heads forecast is:

0.3, 0.4, 0.7, 0.3 and I observe HHTH then my score will be:

$ \dfrac{1}{4} \left[\log(0.3) + \log(0.4) + \log(0.3) + \log(0.3)\right] $

However what if my predictions were:

Beta(3,7), Beta(4,6), Beta(7,3), Beta(3,7)

What would my score be then?

evan54
  • 343
  • 2
  • 9

1 Answers1

2

Your setup is slightly different from a "standard" probabilistic prediction situation, because you have a probabilistic prediction for $p\sim\text{Beta}(\alpha,\beta)$, but you do not directly observe the quantity $p$ for which you made the prediction. Instead, you observe only the outcome of a Bernoulli experiment with parameter $p$.

So you have a case of a compound distribution, specifically, a Beta-Bernoulli one, which is a very simple case of a Beta-binomial distribution, with $n=1$ and $k\in\{0,1\}$. Fortunately, we can derive the predictive density of the coin toss in a Beta-Bernoulli directly from the predictive density of the parameter $p$. Namely, per the Wikipedia page, if $p$ denotes the probability of throwing heads,

$$ \begin{align*}P(\text{Heads}) &= {1\choose 1} \frac{B(1+\alpha,\beta)}{B(\alpha+\beta+1)} \\ &= \frac{\Gamma(\alpha+1)\Gamma(\beta)}{\Gamma(\alpha+\beta+1)}\cdot \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \\ &= \frac{\Gamma(\alpha+1)}{\Gamma(\alpha)}\cdot \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha+\beta+1)} \\ &= \frac{\alpha}{\alpha+\beta}, \end{align*}$$ which of course is just the expectation of the $\text{Beta}(\alpha,\beta)$. Analogously (or simply by subtracting from $1$), $$ P(\text{Tails})=\frac{\beta}{\alpha+\beta}. $$

To calculate the log score, you now simply take the logarithm of this predictive probability mass for the correct outcome, and average over trials.

So if you have four trials, with predictive densities for $p$ of $\text{Beta}(3,7)$, $\text{Beta}(4,6)$, $\text{Beta}(7,3)$ and $\text{Beta}(3,7)$ and observed outcomes of H, H, T, H, then your score would be

$$\frac{1}{4}\big(\log(0.3)+\log(0.4)+\log(0.3)+\log(0.3)\big) \approx -1.13.$$

We see that this is precisely the same score as for non-probabilistic predictions for $p$ of the expectation of the corresponding Betas.

(Note that by using $\log$ and not $-\log$, you have "positively oriented scores", where larger scores are better. There is also the opposite convention, where smaller scores are better, which is more in keeping with scores as losses.)


EDIT: as you write, since this log-loss only depends on the expectation of the Beta-Bernoulli compound, it does not differentiate between different Beta predictive distributions on $p$ with the same expectation, but different variances. For instance, we could have different density forecasts $p\sim\text{Beta}(3,7)$ and $p\sim\text{Beta}(30,70)$, which have the same expectation, but the second one is much more certain, and the log loss should really include this, by penalizing the more certain one more for incorrect predictions.

betas

xx <- seq(0,1,.01)
opar <- par(mfrow=c(1,2),las=1,mai=c(.5,.5,.5,.5))
    plot(xx,dbeta(xx,3,7),type="l",xlab="",ylab="",main="Beta(3,7)")
    plot(xx,dbeta(xx,30,70),type="l",xlab="",ylab="",main="Beta(30,70)")
par(opar)

However, this is not a problem of the log scoring rule, but one of our sampling scheme. After all, the problem is that the PMF of the resulting Beta-Bernoulli compound is the same for both Betas, $$P(\text{Heads})=\frac{\alpha}{\alpha+\beta}=\frac{10\alpha}{10\alpha+10\beta},$$ so it's unsurprising that the log loss cannot distinguish between two resulting compounds on the observable number of heads.

The solution is that we need to observe multiple coin tosses with the same $p$. Then instead of the degenerate Beta-Bernoulli for $n=1$, which is simply a Bernoulli again, we see a true Beta-binomial compound, with PMF

$$ P(X=k) = {n \choose k} \frac{B(k+\alpha,n-k+\beta)}{B(\alpha,\beta)}.$$

And this compound distribution indeed distinguishes between two different predictive distributions on $p$ that only differ in the variance. For instance, in the case of $p\sim\text{Beta}(3,7)$ and $p\sim\text{Beta}(30,70)$ above, both forecasts give us an expected probability for heads of $0.3$, but actually observing $k=3$ heads out of $n=10$ trials is more likely under $p\sim\text{Beta}(30,70)$ than under $p\sim\text{Beta}(3,7)$:

> beta(3+3,10-3+7)/beta(3,7)
[1] 0.001547988
> beta(3+30,10-3+70)/beta(30,70)
[1] 0.002119483

And this difference in PMFs then carries directly through to the log loss.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Hi Stephan, thank you for the answer, one problem I have with this is that it doesn't take into account how confident one is in their prediction. So if someone says Beta (3, 7) or beta (30, 70) they get the same score even though the second is a lot more confident in their prediction. I would expect this to somehow influence their score. Is there a way to incorporate this in the score? – evan54 May 02 '21 at 12:37
  • @evan54: that is an extremely good point! I edited my answer. The answer is that a single trial will simply not give us enough information to distinguish between the two densities on $p$, so we have to run multiple ones. – Stephan Kolassa May 03 '21 at 06:13
  • Why the uncertainty should change the score? We should use the expectation as this is the "best" prediction we have for this point no? – ofer-a May 03 '21 at 06:57
  • @ofer-a: a more uncertain forecast of $p$ is more consistent with a wider spread in the outcomes in the coin toss.As an extreme case, suppose one forecast for $p$ is a Beta(3,7), and the other one is a degenerate distribution $p=0.3$. If we then run 100 coin tosses, and see exactly 30 heads, then the degenerate distribution is more compatible with that. And the scoring rule should reflect that. – Stephan Kolassa May 03 '21 at 07:03
  • @StephanKolassa yes i understand the implication of uncertainty. But my questions is: during the optimization process what should be changed? i.e. which value we should put in the log loss for this point (isn't is the expected value of the beta distribution?) – ofer-a May 03 '21 at 08:13
  • @ofer-a: the log loss is a [tag:scoring-rules], take a lool at the [tag wiki](https://stats.stackexchange.com/tags/scoring-rules/info). It uses the log of the predictive density (or predictive PMF in the discrete case, as here). Which is why we here need the density of the *observed* outcome (the coin toss), which is the beta-binomial, not the density of the *unobservable* $p$. And of course a scoring rule like the log loss still allows us to discriminate between different predictive densities for $p$. – Stephan Kolassa May 03 '21 at 08:32
  • hm... I need to think about this more but I think this still isn't satisfactory enough - meaning that if you have a forecaster and he produces Beta(a1, b1) for event 1 and Beta(a2, b2) for event 2 etc and event 1 / 2 etc are "unique" ie you can't repeat the experiment the only commonality is the forecaster. In that case would it make sense to start with a Beta(1, 1) for the forecaster and the predicted outcome is smth like P(forecaster) x P(forecast for event 1) and that way you get to observe multiple forecasts by the same forecaster and arrive at a P(forecaster) distribution? – evan54 May 03 '21 at 13:03
  • Well, it looks like if you can't repeat experiments and you have a beta-Bernoulli compound, then you indeed can't distinguish between two Betas on $p$ with the same expectation. It sounds intuitively correct enough to me. And a Beta(1,1) would be uniform, so the expectation would be $0.5$. I don't quite understand your second point, I have to admit. – Stephan Kolassa May 03 '21 at 14:56
  • My second point was whether formulating the problem as follows: probability used = p(forecaster is right) x p(forecast the forecaster provided) + p(forecaster is wrong) x (1 - p(forecast the forecaster provided)). In this case even though I can only flip each coin once the p(forecaster is right) is what I'm trying to get a good estimate for and that to me seems like trying to test a single coin through this noisy process... happy to move this to chat if you want but don't know how – evan54 May 03 '21 at 18:11
  • Hm. I don't quite see how this helps. After all, $p$ is continuous, so P(forecaster is right) is probably rather zero... And I don't really see the connection to scoring rules, probabilitstic predictions and beta compounds... – Stephan Kolassa May 03 '21 at 18:52
  • sorry not sure i follow what you mean by "probably rather zero". I think the relation to scoring rules is that a scoring rule is an assessment of how good a forecast is and another way to say it is what is the probability that the forecast is right so intuitively I would expect some relation but I could very well be wrong here... – evan54 May 03 '21 at 19:13
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/123752/discussion-between-evan54-and-stephan-kolassa). – evan54 May 03 '21 at 19:30