0

I am currently working on imbalanced data topic. And I found a function in R called ROSE (paper). I understand from a high level how the function works, unfortunately, I do not have a very strong background in statistics, therefore, I can't reproduce the algorithm in Python.

So my questions are; Can someone help me understands in depth how ROSE works? Or at least point me towards a good reference to understand it in depth.

Secondly, a rough estimation on how hard would it be to reproduce it in python.

Much appreciate it!

ombk
  • 116
  • 2
  • 2
    Why is class imbalance such a problem? Much of statistics says otherwise. https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://stats.stackexchange.com/questions/368949/example-when-using-accuracy-as-an-outcome-measure-will-lead-to-a-wrong-conclusio – Dave Aug 06 '21 at 03:11
  • Getting to the software issue, there is a Python library called RPy2 that lets you call R functions, kind of a reverse of “reticulate” in R. – Dave Aug 06 '21 at 03:13
  • Imbalance can be a problem for very small datasets, but otherwise it tends to be just an unequal misclassification costs problem and the imbalance is irrelevant. The important thing to do is to know what criterion you actually want to optimise (e.g. what are the misclassification costs for each type of error, e.g. false-positive and false-negative). Until you have decided that, you probably don't want to resample the data. – Dikran Marsupial Jan 08 '22 at 15:40

1 Answers1

0

I saw your question 6 months later, so my answer may be useless, but I want to answer for users who find the answer later.

I'm not sure it's exactly the same with the ROSE package in R, but a python package imblearn implements the ROSE sampling. The below is an excerpt from here: https://imbalanced-learn.org/stable/over_sampling.html

If repeating samples is an issue, the parameter shrinkage allows to create a smoothed bootstrap. However, the original data needs to be numerical. The shrinkage parameter controls the dispersion of the new generated samples. We show an example illustrate that the new samples are not overlapping anymore once using a smoothed bootstrap. This ways of generating smoothed bootstrap is also known a Random Over-Sampling Examples (ROSE) [MT14].

Comparison between naive Random Over Sampling and smooth Random Over Sampling

I don't know about the ROSE in depth, but the algorithm seems to perform over sampling with smoothing in a multivariate way (sampling dataset and smoothing each sample?). I hope someone who is familiar with this algorithm could explain in depth with another answer.

So, I think you can use the package instead of implementing the algorithm yourself, if you just want to use this sampling. Or, it would be helpful to check the source code of imblearn github directly. https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/over_sampling/_random_over_sampler.py

eyShin
  • 1
  • 1