Averaging Brier score

Question

To score a RandomForestClassifier using GridSearchCV for multiclass classification, I decided to use Brier score.

However, I could only manage to get the Brier score for each class.

Is it reasonable to get the average of that as an overall performance measure? Or can you think of a better way instead?

Edit: I am aware this question is similar, so I'll explain why I think it's a different problem:

When I run my model with brier_score as defined by that question's author (brier_multi), the score obtained for the best model is 202.3

However, when I apply the following code (made by me)

def brier_score_multi(y_true, y_pred):    
    y_true_bin = label_binarize(y_true, classes=[0,1,2])
    y_pred_bin = label_binarize(y_pred, classes=[0,1,2])
    score = mean([brier_score_loss(y_true_bin[:,0], y_pred_bin[:,0]),brier_score_loss(y_true_bin[:,1], y_pred_bin[:,1]),brier_score_loss(y_true_bin[:,2], y_pred_bin[:,2])])
    return score

The best score is 0.0432.

As you can see, this is a big difference, and given the definition of a the brier score, I'm biased towards the second result.

EDIT 2:

Seeing as the first result is incorrect, I started thinking... maybe instead of the average between classes, the sum of the brier score between classes makes more sense?

Not really. For some reason when I try the `brier_multi` that he applies, it returns ridiculous numbers. I think it might have sth to do with what the random forest does backstage. Actually, let me edit my post and include that — amestrian, Sep 24 '20 at 23:32
The answer to the other question shows that `brier_multi` is the correct extension of the Brier score to multiple classes. So if your question is not "how do I extend Brier score to multiple classes?" then I don't know what it is. — Sycorax, Sep 24 '20 at 23:54
As a loss function, it doesn’t matter if you divide by the sample size or not. Does your software divide by the sample size? — Dave, Sep 25 '20 at 00:06
If we read a citation in the other answer, the Brier score in the multi-class case is bounded between 0 and 2, so obtaining values of 200 implies some kind of programming or user error. See: https://www.wikiwand.com/en/Brier_score#/overview — Sycorax, Sep 25 '20 at 00:29
@Sycorax that's what I thought... between the random forest itself and the grid search it's a bit of a black box to know exactly what's going on. I'm currently trying some variations to see if I can figure it out but I don't see it likely... would it be too terrible to end up using the one I made? — amestrian, Sep 25 '20 at 01:00
@Dave hmm good question I guess... I'm not sure, just did a google check but couldn't find anything. Anyways I don't think that should affect this score, since I implemented self-made scorers before and the results were reasonable — amestrian, Sep 25 '20 at 01:02
What if instead of averaging it I sum the brier score for the three classes? — amestrian, Sep 25 '20 at 01:33
Dave is correct that we don’t care about rescaling by a positive constant, because the two forms will have the same minima. But it’s still not clear to me why you’re fixated on making a new variation on Brier score. You’ve demonstrated that you’ve got a programming or user error. Find that and you’re done. — Sycorax, Sep 25 '20 at 01:38
because I've tried many things so far and nothing is working, and I don't have enough time to go to the source code to figure out exactly what is wrong with it... — amestrian, Sep 25 '20 at 01:41
But you’re convinced that you have enough time to invent a new, untested idea and make decisions based on it? — Sycorax, Sep 25 '20 at 01:41
well.. i have to use something. I asked something else in [here](https://stats.stackexchange.com/questions/489010/what-is-the-limit-to-consider-something-is-overfitting) and everyone recommended to use this scoring technique (or log loss, but it's the same issue), because my previous one was not good, so I'm trying it. — amestrian, Sep 25 '20 at 01:45

Averaging Brier score

0 Answers0