2 classfiers, the better one has a higher CrossEntropy loss values in during training

Question

I have two classifiers (linear1 and linearGP). LinearGP has a better accuracy but it's CE loss has higher values in comparison with CE values of linear1.

linearGP is learned by another loss. Data set is balanced. X axis represent samples during training process, at the end of the traning 30000 samples were passed through both models.

What is the reason?

I think that one model returns very high probabilities for it's prediction whereas the other one doesn't although it is better in it's predictions

I created a simulated jupyter notebook example: https://github.com/cherepanovic/omwtuss/blob/master/CE_Acc_sim.ipynb

Would you agree or do you have also other arguments?

Thanks a lot!

https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models/312783#312783 — Sycorax, Dec 30 '19 at 15:31
Different error metrics can give different results, for different models. Moreover, as mentioned in the thread reffered by @SycoraxsaysReinstateMonica , there are many problems accuracy as error metric. — Tim, Dec 30 '19 at 18:02
Please be more specific about the loss used for linearGP and which model is returning "very high probabilities". The link from @SycoraxsaysReinstateMonica notes that accuracy is not a good measure of performance of these types of models; the better apparent accuracy of your linearGP model might be misleading. — EdM, Dec 30 '19 at 18:03
@EdM, the loss of linearGP is a not known loss and how does this information help in interpretation of the plots? Regarding high probabilities, it was my assumption. — malocho, Dec 31 '19 at 19:04
Without knowing the criteria used for a model fit it's going to be hard to say much about what might be going on. Also, it's not clear what the "Samples" along the horizontal axis represent (although I have some suspicions) or how the "MLL linearGP" model (with the lowest CE loss) differs from the other 2. Please provide more information about those, and describe how you are intending to use the models (particularly if you don't know how they were built). — EdM, Jan 01 '20 at 18:13
@EdM samples at the x axis are samples during the training, I thought it is obvious, added it to my post. The loss of linearGP is based on KL divergence/ELBO. Models are used for a classification task... I could describe both models but linearGP is not such known approach. — malocho, Jan 01 '20 at 20:55
So by 'samples' you mean samples from the posterior distribution in a Bayesian modeling approach? — EdM, Jan 01 '20 at 21:01
Samples which were passed through the model during learning process. For example 10 000 in one epoch. — malocho, Jan 02 '20 at 00:40

EdM · Answer 1 · 2020-01-01T21:47:16.917

First, accuracy can be a poor choice for building or evaluating a model. When you say that "linearGP has a better accuracy" that doesn't necessarily mean it's the "better one."

Second, from your comments it's clear that what you are plotting is training error. An overfit model could well have a lower training error but a higher test error. So the model with the lower training error is not necessarily the "better one," either.

Third, it can be good to consider different loss functions for training as you evidently have done. The choice of loss function might differ depending on how you intend to use your model; the last half of this answer gives a brief overview. Make sure that your loss function provides a proper scoring rule. That said, it's not clear why the linearGP model differs from the linear1 model if linearGP was based on KL divergence and linear1 is based on cross-entropy, as these are the same. Perhaps the models involve different predictors, but that's not clear from the question.

(!) a very good point regarding the train and the test plots. — malocho, Jan 02 '20 at 00:42
I am going to read your proposed links, I hope it is ok, if I write my question here in the comment section. Thanks! — malocho, Jan 02 '20 at 00:44

2 classfiers, the better one has a higher CrossEntropy loss values in during training

1 Answers1