How to oversample when data needs to be labeled?

Question

I am working on an NLP problem where we obtain labels through a Mechanical Turk–like system. We started with a random sample, experimented with models and determined that rebalancing the minority class improved model performance. We also know that adding more data will likely improve model performance.

The next step would be to get more labels, but ideally we'd like to oversample the minority class. Obviously we don't know the ground truth until someone labels the data, but we'd need the label in order to stratify the sample we want labeled in the first place.

One idea is to use the model to predict the labels and use the predictions to oversample, and pass those to the labelers. However if we used high confidence predictions (i.e. <.1 and >.9) it seems like we'd be collecting more samples that the model has already learned how to predict well. If we used low confidence prediction (i.e between .4 and .6) then there wouldn't really be confidence in the label which would impact the oversampling strategy.

Any ideas or papers on this topic you're aware of?

[Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Oct 07 '21 at 17:02
@BenReiniger: that is why I am commenting, not answering. It sounds like the OP might be interested in whether or not the (ancillary) topic of oversampling is useful or not. — Stephan Kolassa, Oct 07 '21 at 17:05
I am with @StephanKolassa and believe there might not be a problem here, because Ken should not be optimizing a threshold-based metric like accuracy and oversampling to achieve such optimization. — Dave, Oct 07 '21 at 17:52

How to oversample when data needs to be labeled?

0 Answers0