0

We are conducting a predictive analysis (classic binary outcome classifier i.e. patient got the disease or not) from longitudinal/panel data (from each patient we have 1 or more observations depending on whether the patient left the study earlier or not).

The outcome [disease yes (positive class) or no (negative class)] it's unbalanced towards the negative class being way more represented than the positive class.

Now the question is: would make any sense to exploit any oversampling (e.g. SMOTE et similia) techniques in order to balance the outcome classes given that we do have longitudinal/panel data (SMOTE/oversampling ignoring the correlation amongst observations from the same patient might just introduce more noise in the analysis?).

Thanks a lot in advance for your support.

  • 1
    Welcome to Cross Validated! Statisticians do not see class imbalance as such a problem, and there is no need to use artificial balancing to solve a non-problem. It might be helpful if you say why you find the imbalance problematic. https://stats.stackexchange.com/questions/357466 https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/ https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Mar 03 '22 at 10:47
  • Related to Dave's excellent comment you have fallen into a very common machine learning trap of wrongly conceptualizing a sensible analyses of _tendencies_ to get a disease (probability of disease) as a forced-choice premature decision classification problem. Medical diagnosis and prognosis are areas for which classification is singularly unhelpful, and tendencies are everything. – Frank Harrell Mar 03 '22 at 12:36

0 Answers0