How does one effectively deal with data imbalance while working on a NLP problem without dropping data points?

Question

I am working with a data set of fake job postings and it has the columns following columns:

data.columns
Out[18]: 
Index(['title', 'location', 'description', 'requirements', 'telecommuting',
       'has_company_logo', 'has_questions', 'fraudulent', 'title_tokenized',
       'description_tokenized', 'requirements_tokenized'],
      dtype='object')

The issue is:

pos_instances = data[data['fraudulent']==1].shape[0]
neg_instances = data[data['fraudulent']==0].shape[0]

print('There are {} data points for positive class, and {} data points for the negative class.'.format(pos_instances,neg_instances))
print('The ratio of positive class to negative class is {}.'.format(round(pos_instances/neg_instances,2)))
print('The data is highly imbalanced.')

del pos_instances, neg_instances
There are 705 data points for positive class, and 14310 data points for the negative class.
The ratio of positive class to negative class is 0.05.
The data is highly imbalanced.

The data is highly imbalanced. Imputation is not viable because the data is textual. I cannot impute a fake review.

Any ideas to deal with this issue are welcome. I presently cannot see any other way to solve this but to under-sample the negative class.

[Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Aug 30 '20 at 05:08
@StephanKolassa Unbalanced datasets leave machine learning models prone to overfitting, which ultimately affects the prediction performance of the model, so, yes unbalanced datasets are problematic. Oversampling simply duplicates the data and might result in overfitting, which kind of defeats the purpose. There are better techniques such as SMOTE or AdaSyn which create artificial data points imitating the characteristics of the original data points. — rxp3292, Aug 30 '20 at 05:43
My strong impression is that overfitting is exclusively due to using inappropriate objective functions, e.g., [accuracy](https://stats.stackexchange.com/q/312780/1352). Do you know of any methods that optimize on a proper scoring rule and still overfit? I would be interested in any references. — Stephan Kolassa, Aug 30 '20 at 06:14
@StephanKolassa Thanks for sharing the link to your answer. It was an informative read. I personally never knew that proper scoring rules like Brier existed. So, I'm really thankful to you for nudging me in the right direction. However, there are a few questions I have in my mind: Does the ratio of each class not affect the equation of decision boundary irrespective of the scoring rule? Even if we are training the models using a proper scoring rule such as Brier, does the ratio of representation of each class not matter? More importantly, do decision boundaries even truly exist? — rxp3292, Aug 30 '20 at 07:43
In my opinion, the *statistical* part of the exercise ends with probabilistic (!) predictions. Proper scoring rules help in calibrating these. Once we have such a probabilistic prediction for a new instance, we can take this and add costs of mis-handling the instance, and come to a *decision*. This is where boundaries or [thresholds](https://stats.stackexchange.com/a/312124/1352) come in. IMO, there is no sense in discussing thresholds/boundaries without specifying the costs we are trying to minimize (in expectation, usually). — Stephan Kolassa, Aug 30 '20 at 08:20
Note that even if there are two possible classes (sick/healthy), we may be best served in terms of costs by having three or more possible *decisions* (treat as sick/run more tests/treat as healthy). I believe that this conflation between *statistical modeling* and the subsequent *decisions* has been a cause for great and unending confusion. — Stephan Kolassa, Aug 30 '20 at 08:22
@StephanKolassa Thanks for taking time to share your knowledge. May I send you a connection request on LinkedIn? I wish to follow your posts and work. — rxp3292, Aug 30 '20 at 08:25
Certainly! Please just include a reference to this exchange, or your CV username, so I know it's you ([see my profile](https://stackexchange.com/users/204092/stephan-kolassa)). — Stephan Kolassa, Aug 31 '20 at 05:59
@StephanKolassa I have sent you a connection request, Stephan. Thanks for adding me to your network and allowing me to follow your work. — rxp3292, Aug 31 '20 at 06:05
Thanks! I'll accept in two weeks - I'm on holiday right now and am not checking my emails (experience shows that I can't "just address this one email" - I wouldn't get out of the rabbit hole before dinner). So please don't worry if it takes a while. — Stephan Kolassa, Aug 31 '20 at 13:15

score 0 · Answer 1 · answered Aug 30 '20 at 12:37

I believe that there might be different ways of answering your question. One possibility that comes to my mind is the following: take all the examples of the minority class (say $m$ rows) and construct $N$ balanced datasets, containing $2m$ rows, by sampling (without replacement) $m$ records associated to the majority class $N$ times.

Basically, you are underbalancing $N$ times so that most of your majority class data is used.

Now, train $N$ possibly highly-biased classifiers on these balanced datasets using your favourite metric - accuracy for instance. Finally, combine the classifiers in some suitable way.

Thanks for your answer. I agree creating ensembles of weak classifiers, and then using their results together through some kind voting should work just fine. However, after reading Stephan's answers, I would think twice about using accuracy straight away as a scoring rule. I was trying to see if anyone had any newer perspectives other than creating ensembles to deal with the data imbalance issue. Either way, thank you for taking time to write this answer. — rxp3292, Aug 31 '20 at 06:17

score 0 · Answer 2 · answered Aug 31 '20 at 20:44

With 5% y=1 and having 705 observations to work with, I wouldn't consider this an extremely unbalanced problem to the point where I'd worry about sampling strategies.

You're probably considering how accuracy wouldn't be a good metric to evaluate your model with - and you're right! But this doesn't mean you have to change your sampling strategies. Try using a different optimization metric such as logloss or auc to tune the model. Then you'll want to diagnose the model probabilities to make sure that they are "calibrated" (there are plenty of discussions on this topic, as well).

Long story short, don't worry about classifying 1 or 0 until much later in the process. Your dataset is probably good enough to tune a decent model that provides realistic probabilities.

How does one effectively deal with data imbalance while working on a NLP problem without dropping data points?

2 Answers2