training approaches for highly-imbalanced data set

Question

I have a highly-imbalanced test data set. The positive set consists of 100 cases while the negative set consists of 1500 cases. On the training side, I have a larger candidate pool: the positive training set has 1200 cases and the negative training set has 12000 cases. For this kind of scenario, I have several choices:

1) Using weighted SVM for the whole training set (P: 1200, N: 12000)

2) Using SVM based on the sampled training set (P:1200, N :1200), the 1200 negative cases are sampled from 12000 cases.

Is there any theoretical guidance on deciding which approach is better? Since the test data set is highly imbalanced, should I use the imbalanced training set as well?

please check the following questions: [Supervised learning with “rare” events](http://stats.stackexchange.com/questions/9398/supervised-learning-with-rare-events-when-rarity-is-due-to-the-large-number-o) and [Best way to handle unbalanced multiclass dataset with SVM](http://stats.stackexchange.com/questions/20948/best-way-to-handle-unbalanced-multiclass-dataset-with-svm). Does this help ? Frankly, your questions sounds rather similar ;). — mlwida, Nov 06 '12 at 22:14

score 7 · Accepted Answer · answered Nov 06 '12 at 22:07

From a recent post on reddit, the reply by datapraxis will be of interest.

edit: the paper mentioned is Haibo He, Edwardo A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, pp. 1263-1284, September, 2009 (PDF)

score 0 · Answer 2 · answered Dec 06 '12 at 17:29

Pairwise Expanded Logistic Regression, ROC-based learning, Boosting and Bagging (Bootstrap aggregating), Link-based cluster ensemble (LCE), Bayesian Network, Nearest centroid classifiers, Bayesian Techniques, Weighted rough set, k-NN

and a lot of sampling methods to handle imbalance.

training approaches for highly-imbalanced data set

2 Answers2

Linked