Machine Learning with Skewed Classes in R

Question

I am looking for some suggestions on what methods are appropriate for training a dataset with a high skew in the outcome classes. The ratio of Class 0: Class 1 is about 20:1 and I am looking to maximize the accuracy for identifying Class 1 outcomes. This is similar to oft discussed topics such as cancer detection.

I have used some methods before but am trying to find if there is any comprehensive resource / suggestions that talks to the different methods for these cases. Examples of how they are applied in R (packages, etc) or with caret would be useful. It is a sparse dataset with about 100K examples of which 5000 belong to Class 1 and the rest to Class 0. Each example has about 20 features, and includes null values.

score 1 · Answer 1 · answered Feb 28 '14 at 17:32

There's a reference I use for classifying with skewed data: Cohen, 2006. In it, the author describes a method for weighted over-sampling of samples, based on class prevalence in the data set. You should read the paper, but, briefly, the cost function he proposes takes the form

$$P(c)=\frac{{\rm Cost}(c)}{\max[\text{Cost}(c),\ \forall_{c}\in C]}$$

score 1 · Answer 2 · answered Mar 02 '14 at 03:12

If you look at the AppliedPredictiveModeling package, it has scripts associated with the book. Chapter 16 is about class imbalances and shows how to deal with them with caret and a lot of other packages.

The code for each chapter is in the AppliedPredictiveModeling package and, once loaded, those files can be found using the scriptLocation function.

Machine Learning with Skewed Classes in R

2 Answers2