0

Let's say I have a dataset with 100,000 class A training observations and 400 class B training observations. I want to use Support vector machine for this binary classification problem. Instead of applying random undersampling or SMOTE, I want to apply a method as such: I will divide my class A observations into 400 distinct batches (100,000/400=4000). and add all of the 400 class B observations into each of the 400 batches. Then, I will take the average of all the results (accuracy,f1, average precision) obtained from each of the 400 observations.

Is following such a method completely wrong? Does it give me a very optimistic results? Or what are the possible misleading effects?

Thank you.

glslmn
  • 3
  • 3
  • 6
    Unbalanced classes are almost certainly not a problem, and oversampling will not solve a non-problem: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa Jul 24 '19 at 15:49
  • 4
    Do not use accuracy, precision or f1 to evaluate a classifier: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) [Is accuracy an improper scoring rule in a binary classification setting?](https://stats.stackexchange.com/q/359909/1352) [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) – Stephan Kolassa Jul 24 '19 at 15:49
  • In addition: is the problem really binary (as opposed to one-class?) – cbeleites unhappy with SX Jul 26 '19 at 15:10

1 Answers1

2

As per the method described, you are duplicating the Class B to an extent such that it equals the volume of A. That means the model over learns about the occurrence of Class B and thinks whatever is present in the training set is the real truth. This obviously leads to over fitting, high variance and unstable models. If you are using SVM, please use the class_weight attribute to specify various ranges of the importance of classifying the class B correctly and use cv to identify a specific weight for B. This modifies the optimisation function to severely penalise any class B misclassifications.