1

I'm dealing with an multiclass classification problem. The data is textual and too imbalanced. I see that the models that i'm building using the character level or word level grams are always giving more probability to the class with most number of samples. This lead to a very weak macro F1 score compared to accuracy (20% macro F1 vs 40% accuracy).

Since, the data is textual i can't use data balancing techniques. Moreover, i have the same distribution of data between train and test.

Can i know if i can improve the macro F1 score?

  • Don't use the F1 score at all. Every single criticism at [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) applies equally to it. Instead, use probabilistic classifiers, per my answer there. – Stephan Kolassa May 05 '20 at 15:58
  • Thank you for your response. I've read your answer there. It's really helpful in the case of binary classification. And i was looking on thresholding and scoring rules. But, i see that this will work only on the binary classification case? Any ideas about how to apply it on multi-classification? – John Karimov May 05 '20 at 16:07
  • No, not at all. Probabilistic classification works also with multiple classes (of course, you need appropriate models and algorithms, like multinomial logistic regression, or Random Forests with probabilistic output), and so do scoring rules. [Thresholding is not a good idea.](https://stats.stackexchange.com/a/312124/1352) – Stephan Kolassa May 05 '20 at 16:12
  • Yes, exactly. I was working with probabilistic models. But i'm still can't figure out how can a vector of 10 probabilities (10 classes) be used by the scoring rules? Can you please show me a small example. Since, i'm still can't figure out the difference between scoring rules and threshold :) Thank you – John Karimov May 05 '20 at 16:20
  • Sure. Let's suppose you have 10 possible classes. Your probabilistic prediction for a single instance therefore is a vector $(\hat{p}_1, \dots, \hat{p}_{10})$ that sums to 1. Suppose the actual outcome is class number $i\in\{1, \dots, 10\}$. Now, for the [logarithmic score](https://en.wikipedia.org/wiki/Scoring_rule) for instance, this would contribute $\log \hat{p}_i$ to the total score (which you get by averaging over many such instances, each with its own 10-vector of predicted class probabilities). Maximizing this is equivalent to maximizing the predicted probability for the true class. – Stephan Kolassa May 05 '20 at 16:24
  • I guess that i got the idea right now. This is like the Logloss minimization. Which mean maximizing the probabilities of samples to there real classes. I've tried to work with that using neural networks and minimizing the loss. But unfortunately, i'm working on having a better macro F1 score, since as i said, the data is imbalanced so the loss as i guess can be calculated over the whole data. So, i am looking for how can the model see the weak classes also. – John Karimov May 05 '20 at 16:36
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/107638/discussion-between-john-karimov-and-stephan-kolassa). – John Karimov May 05 '20 at 16:48

0 Answers0