2

I am trying to run a binary classification problem on people with diabetes and non-diabetes.

For labeling my datasets, I followed a simple rule. If a person has T2DM in his medical records, we label him as positive cases (diabetes) and if he doesn't have T2DM, we label him as Non-T2DM.

Since there are a lot of data points for each subject, meaning he has a lot of lab measurements, a lot of drugs taken, a lot of diagnoses recorded, etc, I ended up with 1370 features for each patient.

In my training, I have 2475 patients and in my testing, I have 2475 patients. (I already tried 70:30. Now am trying 50:50 still the same result (as 70:30))

My results are too good to be true as shown below

enter image description here

enter image description here

a) Should I reduce the number of features?

b) Is it overfitting?

c) Should I retain only the top features like top 20 features, top 10 features etc?

d) can help me understand why is this happening?

The Great
  • 1,380
  • 6
  • 18
  • 2
    [Please don't just cross-post nearly-identical questions across multiple sites.](https://datascience.stackexchange.com/q/84567/2853) Or at least link them together, so we don't duplicate work. – Stephan Kolassa Oct 27 '20 at 21:07

1 Answers1

2

Yes, you are overfitting, and very much so.

Use a holdout sample to assess your model's predictive capabilities. Optimizing in-sample fits will usually lead to overfitting, especially with a large number of predictors, as in your case.

Consider using a regularized/penalized model to deal with your large number of predictors, like GLMNET or similar.

Finally, most of the KPIs your screenshot shows are misleading. This thread is ostensibly about accuracy, but the very same criticisms also apply to sensitivity, specificity etc.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • 1
    My understanding is that OP is evaluating on a test set, so I disagree with the diagnosis of overfitting: either there is data leakage, in which case the test set is invalid and we can't conclude anything, or the high performance is obtained on a valid test set, and this is not consistent with overfitting. – Erwan Oct 27 '20 at 21:41