Cross-validation / Threshold moving when training is balanced but test is imbalanced?

Question

I have a binary text classification problem where texts of class 0 account for ~95% of cases and class 1 for ~5%. I put some effort until having a decently sized, balanced manually labeled subset (7k) of my overall data set (700k). I train a SVM classifier on 1-p% of the labeled data and test it on p% of the labeled data (no overlaps between train and test). The classifier performs well but the test scenario is not realistic. To test how the classifier performs on realistic data (i.e. when applying it on unlabeled texts), I construct a more realistic test set by removing class 1 (minority class) texts to have a realistic 5/95 balance in the test set. The model performs alright (and way better compared to having imbalanced training as well). However

My first question: Is there a possibility with sklearn to apply cross-validation / hyperparameter tuning for my scenario? I am aware of Stratified K-Fold Cross-Validation but, as far as I understood, this applies for imbalance in training and test sets and not for balanced training but imbalanced test sets?

My second question: Currently, my model performs well on recall but not so well for precision. Assuming this would be a robust finding, is there a way to adjust for it? Specifically, is it valid to move the classification threshold to improve F1? (False positives and false negatives are equaly costly).

The picture shows how F1 increases when moving the threshold. Moving it to 0.6 seems to yield better results for F1. (Values are average of 10 runs, randomizing which labeled texts are balanced train/imbalanced test).

Appreciate your help!

Unbalanced classes are almost certainly not a problem, and oversampling will not solve a non-problem: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Apr 05 '21 at 19:51
Do not use accuracy to evaluate a classifier: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) [Is accuracy an improper scoring rule in a binary classification setting?](https://stats.stackexchange.com/q/359909/1352) [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) ... — Stephan Kolassa, Apr 05 '21 at 19:52
... The same problems apply to sensitivity and specificity, and indeed to all evaluation metrics that rely on hard classifications. Instead, use probabilistic classifications, and evaluate these using [proper scoring rules](https://stats.stackexchange.com/tags/scoring-rules/info). — Stephan Kolassa, Apr 05 '21 at 19:52
Appreciate the quick and insightful answers! One add-on question: Frank Harrell writes "It is simply the case that a classifier trained to a 1/2 prevalence situation will not be applicable to a population with a 1/1000 prevalence." I get it. However, since I already oversampled the minority class, throwing the "oversampled" share away wouldnt be sensible either, right? Then I would end up with realistic but (way) less training data overall. — mire, Apr 05 '21 at 20:48
Well, if I understand you correctly, that would amount to simply taking a subsample from your original sample. Which would indeed just throw information away (unless your original sample was too large to handle.) — Stephan Kolassa, Apr 06 '21 at 06:41

Cross-validation / Threshold moving when training is balanced but test is imbalanced?

0 Answers0