I understand what is K-fold cross validation, but what does balanced folds mean? I thought k-fold cross validation is balanced. What are the differences between the two?
-
Could you give us a link and/or some context? I've not heard the term before, but maybe it is a kind of [stratification](http://en.wikipedia.org/wiki/Stratified_sampling)? – cbeleites unhappy with SX Oct 06 '13 at 00:38
1 Answers
The term balanced usually refers to having a fairly equal representation of data classes in each sample. Imbalance refers to one or more classes having very low relative frequencies compared to the other classes. In the context of cross-validation, one ideally would like a balanced representation of classes in each 'fold' of data. Because of sparse class examples in the training data, a model might be biased towards predicting the majority class most of the time. Ways to deal with this problem range from changing the probability thresholds (and hence specificity and sensitivity) to focus more on the underrepresented class, attempting to re-balance test data via selective or up/down sampling, stratification of folds, and re-weighting to focus more on classes with errors (as in boosting) .
There was a related prior question here that included a link to a related paper, Haibo He, Edwardo A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, pp. 1263-1284, September, 2009. Another related paper is, Resampling Methods in Software Quality Classification, Wasif Afzal, et al, May, 2012.