Training a 3 million sample data which has unbalanced labels

Question

I have data which has 3 million samples and unbalanced label.

I have tried many neural network approaches, but I couldn't get a good result.

Which path do you suggest me to follow in this case, in order to be successful?

Thanks,

score 5 · Answer 1 · answered Feb 21 '16 at 20:18

The main reason analysts have trouble with unbalanced cases is that they are using improper accuracy scoring rules in their optimization procedure. If you try to use a probability estimation method (e.g., logistic regression) and you choose a proper objective function (e.g., the likelihood) you will not have that problem.

score 1 · Answer 2 · edited Apr 13 '17 at 12:44

1

You can try using stratified cross validation [1]. There are also some other suggestions such as in [2]. These of course don't guarantee success, but can be used to solve issues related to unbalanced labels.

1- Understanding stratified cross-validation

2- https://stats.stackexchange.com/a/133385/64720

edited Apr 13 '17 at 12:44

Community

1

answered Feb 21 '16 at 21:52

erensezener

220
1
5

Training a 3 million sample data which has unbalanced labels

2 Answers2