Real life class imbalance

Asked Apr 05 '20 at 11:04

Active Apr 05 '20 at 11:04

Viewed 27 times

Fellow like-minded people,

I'm writing my thesis in fake news detection on scrapped twitter data and facing an issue (among many others). Fake news consist of less than 10% of the total tweets or news content (during events), which means that there is a valid real-life class imbalance. I know there are models that respond well to such imbalances and I also have the choice of building my own training and validation set.

Question: Given the real life class imbalance, would my models benefit from keeping the huge class imbalance proportions in the data or should I build something more balanced? I understand what the accuracy paradox is, confusion matrix and such, but what is your say?

asked Apr 05 '20 at 11:04

Andrei Catana

2

[Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa Apr 05 '20 at 12:16
Thank you, Stephan. The links shed light on the matter and a few other questions. I see this as a win. – Andrei Catana Apr 06 '20 at 08:14

Real life class imbalance

0 Answers0