The subject of my dissertation is predicting the mortality of ICU patients and my dependent variable which is mortality is 150 dead and 2630 alive and the accuracy and f1 score and roc curve become one by logistic regression method. And I found a few variables that are 99% correlated with the dependent variable. What should I do to solve this problem?
-
You should be using a proper scoring rule. https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://stats.stackexchange.com/questions/368949/example-when-using-accuracy-as-an-outcome-measure-will-lead-to-a-wrong-conclusio – Dave Jul 28 '21 at 09:49
1 Answers
"And I found a few variables that are 99% correlated with the dependent variable"
Are you sure you're not using variable you shouldn't have in the time of your prediction ?
Example, if in your dataset (which is data you know final answers, if patient is dead or alive), you have a variable HOUR_OF_DEATH, empty if the patient survived and filled if your patient died, then you're using a variable you shouldn't have in real case (where your patient is actually alive and you don't know wether he'll die or not). 99% correlation seems too much for being fair (or the problem doesn't need a model, just to look this variable and make a yes/no according to it).
Also, since your sets are really unbalanced, I'd suggest you to stick with AUC / ROC Curve and forget Accuracy as a metric.

- 166
- 7