I am working on an NLP problem where we obtain labels through a Mechanical Turk–like system. We started with a random sample, experimented with models and determined that rebalancing the minority class improved model performance. We also know that adding more data will likely improve model performance.
The next step would be to get more labels, but ideally we'd like to oversample the minority class. Obviously we don't know the ground truth until someone labels the data, but we'd need the label in order to stratify the sample we want labeled in the first place.
One idea is to use the model to predict the labels and use the predictions to oversample, and pass those to the labelers. However if we used high confidence predictions (i.e. <.1 and >.9) it seems like we'd be collecting more samples that the model has already learned how to predict well. If we used low confidence prediction (i.e between .4 and .6) then there wouldn't really be confidence in the label which would impact the oversampling strategy.
Any ideas or papers on this topic you're aware of?