I have a binary text classification problem where texts of class 0 account for ~95% of cases and class 1 for ~5%. I put some effort until having a decently sized, balanced manually labeled subset (7k) of my overall data set (700k). I train a SVM classifier on 1-p% of the labeled data and test it on p% of the labeled data (no overlaps between train and test). The classifier performs well but the test scenario is not realistic. To test how the classifier performs on realistic data (i.e. when applying it on unlabeled texts), I construct a more realistic test set by removing class 1 (minority class) texts to have a realistic 5/95 balance in the test set. The model performs alright (and way better compared to having imbalanced training as well). However
My first question: Is there a possibility with sklearn to apply cross-validation / hyperparameter tuning for my scenario? I am aware of Stratified K-Fold Cross-Validation but, as far as I understood, this applies for imbalance in training and test sets and not for balanced training but imbalanced test sets?
My second question: Currently, my model performs well on recall but not so well for precision. Assuming this would be a robust finding, is there a way to adjust for it? Specifically, is it valid to move the classification threshold to improve F1? (False positives and false negatives are equaly costly).
The picture shows how F1 increases when moving the threshold. Moving it to 0.6 seems to yield better results for F1. (Values are average of 10 runs, randomizing which labeled texts are balanced train/imbalanced test).
Appreciate your help!