I'm dealing with an multiclass classification problem. The data is textual and too imbalanced. I see that the models that i'm building using the character level or word level grams are always giving more probability to the class with most number of samples. This lead to a very weak macro F1 score compared to accuracy (20% macro F1 vs 40% accuracy).
Since, the data is textual i can't use data balancing techniques. Moreover, i have the same distribution of data between train and test.
Can i know if i can improve the macro F1 score?