1

I am working on a classification problem with a small amount of labelled data (~200 instances) and a larger sample of unlabelled data (~500 instances).

To increase the size of the training data I am intending to use some oversampling technique (e.g. SMOTE). I was wondering if there is some way that I can use the unlabelled data to improve the oversampling. This is also particularly important as I think the unlabelled data is more representative of the underlying population as certain factors have influenced the choice to test, and therefore label, samples.

A. Bollans
  • 83
  • 8
  • Statisticians do not see class imbalance as a problem, and there is no need to use undersampling, oversampling, or artificial balancing to solve a non-problem. It might be helpful if you say why you find the imbalance problematic. https://stats.stackexchange.com/questions/357466 https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/ https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Jan 05 '22 at 11:19
  • Hi, thanks for your comment and the links. Maybe I should have phrased my question slightly differently as class imbalance is not the main issue. The real problem for me seems to be that the labelled (training) data is not a very representative sample of the underlying population, and so a model trained on this data may not be useful in the wild. The unlabelled data is much more representative and it seems like I might be able to use information from the unlabelled data to improve the labelled data in this regard. – A. Bollans Jan 05 '22 at 15:45
  • It appears that this paper: "Huang, Jiayuan, et al. "Correcting sample selection bias by unlabeled data." Advances in neural information processing systems 19 (2006)." describes the process that I'm looking for. I would ideally like to find a robust implementation of their method but am struggling to find one. – A. Bollans Feb 15 '22 at 12:20

0 Answers0