0

To score a RandomForestClassifier using GridSearchCV for multiclass classification, I decided to use Brier score.

However, I could only manage to get the Brier score for each class.

Is it reasonable to get the average of that as an overall performance measure? Or can you think of a better way instead?

Edit: I am aware this question is similar, so I'll explain why I think it's a different problem:

When I run my model with brier_score as defined by that question's author (brier_multi), the score obtained for the best model is 202.3

However, when I apply the following code (made by me)

def brier_score_multi(y_true, y_pred):    
    y_true_bin = label_binarize(y_true, classes=[0,1,2])
    y_pred_bin = label_binarize(y_pred, classes=[0,1,2])
    score = mean([brier_score_loss(y_true_bin[:,0], y_pred_bin[:,0]),brier_score_loss(y_true_bin[:,1], y_pred_bin[:,1]),brier_score_loss(y_true_bin[:,2], y_pred_bin[:,2])])
    return score

The best score is 0.0432.

As you can see, this is a big difference, and given the definition of a the brier score, I'm biased towards the second result.

EDIT 2:

Seeing as the first result is incorrect, I started thinking... maybe instead of the average between classes, the sum of the brier score between classes makes more sense?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
amestrian
  • 233
  • 1
  • 7
  • Not really. For some reason when I try the `brier_multi` that he applies, it returns ridiculous numbers. I think it might have sth to do with what the random forest does backstage. Actually, let me edit my post and include that – amestrian Sep 24 '20 at 23:32
  • 1
    The answer to the other question shows that `brier_multi` is the correct extension of the Brier score to multiple classes. So if your question is not "how do I extend Brier score to multiple classes?" then I don't know what it is. – Sycorax Sep 24 '20 at 23:54
  • Does it make sense to get a score of 200 then? – amestrian Sep 25 '20 at 00:02
  • As a loss function, it doesn’t matter if you divide by the sample size or not. Does your software divide by the sample size? – Dave Sep 25 '20 at 00:06
  • If we read a citation in the other answer, the Brier score in the multi-class case is bounded between 0 and 2, so obtaining values of 200 implies some kind of programming or user error. See: https://www.wikiwand.com/en/Brier_score#/overview – Sycorax Sep 25 '20 at 00:29
  • @Sycorax that's what I thought... between the random forest itself and the grid search it's a bit of a black box to know exactly what's going on. I'm currently trying some variations to see if I can figure it out but I don't see it likely... would it be too terrible to end up using the one I made? – amestrian Sep 25 '20 at 01:00
  • @Dave hmm good question I guess... I'm not sure, just did a google check but couldn't find anything. Anyways I don't think that should affect this score, since I implemented self-made scorers before and the results were reasonable – amestrian Sep 25 '20 at 01:02
  • What if instead of averaging it I sum the brier score for the three classes? – amestrian Sep 25 '20 at 01:33
  • Dave is correct that we don’t care about rescaling by a positive constant, because the two forms will have the same minima. But it’s still not clear to me why you’re fixated on making a new variation on Brier score. You’ve demonstrated that you’ve got a programming or user error. Find that and you’re done. – Sycorax Sep 25 '20 at 01:38
  • because I've tried many things so far and nothing is working, and I don't have enough time to go to the source code to figure out exactly what is wrong with it... – amestrian Sep 25 '20 at 01:41
  • But you’re convinced that you have enough time to invent a new, untested idea and make decisions based on it? – Sycorax Sep 25 '20 at 01:41
  • well.. i have to use something. I asked something else in [here](https://stats.stackexchange.com/questions/489010/what-is-the-limit-to-consider-something-is-overfitting) and everyone recommended to use this scoring technique (or log loss, but it's the same issue), because my previous one was not good, so I'm trying it. – amestrian Sep 25 '20 at 01:45

0 Answers0