Scoring Rule for Continuous Probability Prediction

Question

I have a question about choosing the right scoring rule. I am building a system which predicts the spatial (2D) probability of an event. The label data contains continuous values between 0 and 1, indicating the probability of the event for each pixel. I constructed my NN and applied a sigmoid at the end to pull the values between 0 and 1. I started using the Brier-score (MSE) as loss function but after pondering a bit about it, it doesn't seem like a 'fair' scoring system in my case.

Take a scenario with two pixels, one pixel has a correct label of 0.5 and the other of 1. Imagine that we predict 0 for the first pixel and for the second 0.5, both will have the same Brier-score of (0.5-1)^2. However, I feel that the first case is actually worse than the second. The second prediction is far from correct, but it's halfway in between wrong and correct, it could have performed much worse (here we have equaled the average error of an untrained model). In the first case we have made the worst prediction we could possibly make (here we did much worse than what untrained model would on average predict). Intuitively, the first prediction, which was the worst prediction possible, should have a worse score than the prediction that was halfway in-between of the second pixel.

The second function I was considering was log-loss. However, since we're working with continuous values, taking P(x) seems like a bad operation to do as every P(x) will be very small.

I have searched the relevant topics for existing posts but haven't found a satisfying answer (or I didn't understand it).

What would be a 'good' scoring rule in this case?

I suspect you are misunderstanding the situation. You are discussing *single point predictions*, and your error measure is actually the squared error. Scoring rules assess *probabilistic* predictions. So the input to the Brier or log score would be a probabilistic prediction, i.e., a predictive *density* for each pixel. — Stephan Kolassa, Jul 29 '21 at 07:39
The two concepts coincide if you have a degenerate probabilistic prediction that indeed assigns a probability of $1$ to a particular outcome, but that would mean that you just witnessed two zero-probability events (and indeed, the log loss would then be infinite - [some people consider that a feature, others a bug](https://stats.stackexchange.com/q/274088/1352)), so your more urgent problem is your "probabilistic" (indeed, deterministic!) model, not the choice of a scoring function. Does this help in any way? — Stephan Kolassa, Jul 29 '21 at 07:41
@StephanKolassa Thank you very much for your help. Your answer (and your answer in the link you posted) definitely aided me forward. I understand now that I was talking about error functions and not scoring-rules. However, I don't understand what you imply with your last sentence '_your more urgent problem is your "probabilistic" (indeed, deterministic!) model, not the choice of a scoring function_'. I feel like I am still at the same spot but looking for a good error function instead of scoring rule. Would it be possible to elaborate a bit more? — Seppe Lampe, Jul 29 '21 at 09:36
If you are really dealing with *point* predictions and looking for error functions, rather than scoring rules, then that comment of mine is not really pertinent. (But here goes: if your probabilistic prediction is "$0$ with probability $1$", then your prediction is that any other outcome has probability zero. If you then observe $0.5$, then this is an impossible event by your prediction. Like predicting your arrival time by car, only halfway there, your car spontaneously turns into a bowl of petunias. If you observe impossible events, your probability model is broken.) — Stephan Kolassa, Jul 29 '21 at 20:55

score 2 · Accepted Answer · answered Jul 29 '21 at 21:03

We have established in the comments that you are not really looking for scoring-rules to assess predictive densities, but for error measures to assess point predictions.

One tool that forecasters very often use and that would probably be helpful to you would be to use the Mean Squared Error (MSE), as you are doing - but using it in a relative way against a benchmark. So you really want to check whether your prediction improves on this benchmark.

For instance, if your outcomes are always constrained to be between 0 and 1, then one very simple benchmark prediction would be a flat 0.5 across all pixels. So you would calculate the MSE of your predictions, as you have been doing (calculating squared errors per pixel, averaging across pixels). In the second step, you would calculate the MSE of this very simple benchmark prediction. In the third step, you would divide your prediction MSE by the benchmark MSE. If this ratio comes out greater than one, then you did worse than the very simple benchmark.

Note how this approach indeed penalizes cases where your prediction is on the wrong side of 0.5, compared to the actual: if the actual is 0.8 and you predicted 0.3, then the flat prediction of 0.5 across all pixels would indeed have been better.

Another, possibly even better, benchmark would be to take the average across all pixels in your training data. Maybe your pixels are all between 0 and 1 - but the average is 0.4, not 0.5. In this case, a "naive" prediction would also rather be 0.4. (Forecasters like to use exactly this kind of "naive" forecast, the historical mean, as a sanity check. If your extremely sophisticated method cannot even beat this simple benchmark, you really don't have anything to be proud of. You would be surprised to see how often this happens.)

This is along the lines of how $R^2$ functions in linear regression. — Dave, Jul 29 '21 at 21:14
Yes, this is the answer I was looking for! Benchmarking like this indeed seems like the appropriate solution to obtain an 'interpretable' measure of performance. My apologies for the confusion with the scoring-rule, I mistakenly assumed it was synonymous for error-function. Thank you for your help. — Seppe Lampe, Jul 29 '21 at 21:59
No apologies necessary, comments are precisely to ask for clarification. Do look at the [scoring-rules tag wiki](https://stats.stackexchange.com/tags/scoring-rules/info) some day, probabilistic predictions are really often much better than simple point predictions. Also, [this might be interesting](https://doi.org/10.1016/j.ijforecast.2019.02.017). — Stephan Kolassa, Jul 29 '21 at 22:06

Scoring Rule for Continuous Probability Prediction

1 Answers1

Linked