Improving synthetic oversampling with unlabelled data

Question

I am working on a classification problem with a small amount of labelled data (~200 instances) and a larger sample of unlabelled data (~500 instances).

To increase the size of the training data I am intending to use some oversampling technique (e.g. SMOTE). I was wondering if there is some way that I can use the unlabelled data to improve the oversampling. This is also particularly important as I think the unlabelled data is more representative of the underlying population as certain factors have influenced the choice to test, and therefore label, samples.

Statisticians do not see class imbalance as a problem, and there is no need to use undersampling, oversampling, or artificial balancing to solve a non-problem. It might be helpful if you say why you find the imbalance problematic. https://stats.stackexchange.com/questions/357466 https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/ https://twitter.com/f2harrell/status/1062424969366462473?lang=en — Dave, Jan 05 '22 at 11:19
Hi, thanks for your comment and the links. Maybe I should have phrased my question slightly differently as class imbalance is not the main issue. The real problem for me seems to be that the labelled (training) data is not a very representative sample of the underlying population, and so a model trained on this data may not be useful in the wild. The unlabelled data is much more representative and it seems like I might be able to use information from the unlabelled data to improve the labelled data in this regard. — A. Bollans, Jan 05 '22 at 15:45
It appears that this paper: "Huang, Jiayuan, et al. "Correcting sample selection bias by unlabeled data." Advances in neural information processing systems 19 (2006)." describes the process that I'm looking for. I would ideally like to find a robust implementation of their method but am struggling to find one. — A. Bollans, Feb 15 '22 at 12:20

Improving synthetic oversampling with unlabelled data

0 Answers0