Imbalanced dataset - Majority positive class

Question

My dataset consists of 150 patients where 50 are controls/healthy (negative) and 100 are sick (positive).

If I want my model to have high sensitivity at high specificity (left side of the ROC), in other words to have low false positive rates, should I correct my model by applying weights to it? Because usually the positive class is the minority class and I see why you need to correct for it but should I in my case?

You may profit from [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) and from [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) - all the criticisms of accuracy as a KPI apply equally to sensitivity and specificity. — Stephan Kolassa, Feb 10 '20 at 06:37
For a *binary* problem, changing the class composition by duplicating observations from one class won't change the ROC curve. Citations can be found here: https://stats.stackexchange.com/questions/111478/unbalanced-dataset-roc-curve-to-compare-classifiers/185059#185059 — Sycorax, Feb 10 '20 at 17:44

Haitao Du · Answer 1 · 2020-02-10T05:23:30.230

What you described is not an imbalanced classification problem. Positive vs. negative is 2:1 is a well balanced dataset and most models (such as logistic regression) will be perfectly fine.

Usually when talking about the imbalanced problem people are talking about 1000:1 or even worse. Think about the credit card fraud detection problem, where most transactions are legitimate, and the fraud transaction ratio can be 1 in 10K. Even in such ratio, logistic regression is also "fine" (depending on how people will use the model output) in most cases.

If you want to adjust sensitivity or specificity, try to apply different threshold on the model's output probability. Check ROC curve for details.

What I want is to maximize the left side of the ROC curve (as attached in my question) which corresponds to sensitivity at high specificity. — Luis Pinto, Feb 10 '20 at 17:37

Imbalanced dataset - Majority positive class

1 Answers1