machine learning and class imbalance

Question

I'm trying to apply machine learning to classify around 50 diseases according to their protein intensities (ie, disease X is characterized by abnormal levels of proteins a, b, and c). I've tried randomforest and xgboost, but both do poorly for rare diseases that only have 1-2 samples. These rare diseases suffer from class imbalance, as normal samples and more common diseases can be 10-100x more prevalent in the dataset. A literature search shows that both class imbalance and low sample numbers are problematic for machine learning. Under/Over sampling (eg, via SMOTE) have been ruled out as acceptable things to do. In randomforest, I've tried classwt and sampsize + strat (2:2:2, etc; which is just down-sampling), but the gains were minimal for the rare diseases, instead many of the true negatives turned into false positives. Interestingly, adding synthetic rare disease samples (make up intensities for the rare disease's proteins) would immensely increase corresponding prediction accuracy without increasing the false positive rate.

It feels like there should be a solution, because the protein intensities in these rare diseases can be 3-5 logs higher than found in normal samples. Classification improves when the total number of diseases is reduced to single-digits. Hyperparameter tuning has been done exhaustively via caret. It's like the learner just didn't look at these rare diseases.

Are there ML models that specialize in this scenario?

It's highly unlikely that *any* method will work with only 1-2 samples. The problem isn't class imbalance, it's that there's too little data to learn from. And, there's not enough data to properly check performance of a trained model. — user20160, Oct 28 '19 at 05:43
+1 to [@user20160's comment](https://stats.stackexchange.com/questions/433406/machine-learning-and-class-imbalance#comment808461_433406). Also, that unbalanced classes are a problem is a piece of essentially baseless folklore: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Oct 28 '19 at 07:50

machine learning and class imbalance

0 Answers0