How to deal with biased dataset for both training and testing data?

Question

I am currently working on a classification problem with a highly biased dataset. The dataset is biased for both training and testing data. And I am having trouble dealing with the dataset or modifying the model.

For example, I have 30 classes, 70% of which are class A and B.

And I have tried to expand the dataset to make my model more robust. However, it has worse performance on the test dataset since the test dataset is also biased.

I am using a deep learning model with cross-entropy loss and also tried weighted cross entropy loss. I wonder what else I can try to relieve the impact of the bias.

This is about imbalance, not about bias. What kind of metric did you use to train and evaluate the model? — user2974951, Sep 12 '19 at 13:37
To extend the comment from @user2974951 a bit, bias has to do with whether your sample fairly represents the underlying population. So if the underlying population has this substantial imbalance among classes, then your dataset isn't biased. Please say more about how you intend to apply your model, as a good answer might depend on things like the relative costs to you of different types of mis-classifications. See [this page](https://stats.stackexchange.com/q/357466/28500) and its links for reasons why imbalance might not be a problem, and how trying to correct for it can make things worse. — EdM, Sep 12 '19 at 13:48
Thanks for the comments and the links! Yes, it is better to call it unbalanced dataset. For the questions, I am using classification accuracy and cross-entropy loss between the softmax output and one-hot true labels as the metrics. The classified outputs will be further processed according to the classification result, hence any misclassification can result in an error. And the final performance of the model is assessed by accuracy. That's why I keep accuracy as an important metric. — junhuizh, Sep 16 '19 at 02:13
Hello, recently browsing kaggle kernels I found fantastic package for dealing with imbalanced data: https://imbalanced-learn.org/en/stable/over_sampling.html and in general topic about oversampling: https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis — 404pio, Oct 10 '19 at 06:57

score 1 · Answer 1 · answered Sep 12 '19 at 09:40

1

What did you already try to balance your dataset? Often, basic methods like random over-/undersampling are used, but if this does not help, you might want to try advanced sampling methods. Additionally you need to keep in mind that you should just re-balance your dataset for the training and NOT for the evaluation of your models.

answered Sep 12 '19 at 09:40

Chowkah

108
4

Thanks for the answer. Yes I did try oversampling my dataset such as duplicating some classes of data with low frequencies on training set only. The result was, the training accuracy became higher but the evaluation accuracy became lower. The cross-entropy losses are close. I have also tried other methods such as changing context of the training data(text). – junhuizh Sep 16 '19 at 02:21
Do you use early stopping or some regularization in your network? Probably you are overfitting the train data and therefore receive poor results during the evaluation. – Chowkah Sep 16 '19 at 06:22
1

Yes, it is definitely overfitting the training data. I have tried early stopping and dropout but not regularization, and I will add a l2 loss to see if it helps. Thanks~ – junhuizh Sep 16 '19 at 08:03

score 1 · Answer 2 · answered Sep 19 '19 at 21:49

A few thoughts:

First, even if you ultimately need to evaluate the accuracy of your model, training and testing your model on accuracy is probably not be the best way to proceed. This issue is discussed extensively on this site, with this page being a good place to start. That probably explains why your cross-entropy losses (log losses) agree much better between your test and training sets than do assessments of accuracy. Stick with cross-entropy.

Second, as you seem to have further processing beyond this initial modeling, consider doing that further processing in a way that carries through the predicted class probabilities until the end rather than depending on an early all-or-none assignment of cases to 1 of your 30 classes. That could lead to more reliable final results.

Third, you haven't said much about the nature of your "deep learning model." You might need to consider a different type of model, or adjusting the learning characteristics of the model (as with the $\ell 2$ penalization you seem to be considering).

Fourth, it's possible that you just don't have enough data of a type that can discriminate among class memberships, particularly for the low-prevalence classes. Even the best attempts at such problems can hit unavoidable barriers.

score 1 · Accepted Answer · answered Oct 10 '19 at 05:08

Still working on the problem of imbalanced dataset recently. I tried to expand the dataset through duplication and random replacement, and also add regularizations but these did not improve the overall accuracy.

There are some papers talking about dealing with imbalanced dataset and the ideas of customer loss functions, such as focal loss, GHM, DR loss, are adopted in my experiments. I have tried some of them and it turned out they helped a little.

How to deal with biased dataset for both training and testing data?

3 Answers3