I have a binary classification problem and a fairly balanced dataset: 56% class 0 and 44% class 1. I trained the models randomforest, xgboost, and lgbm. I have categorically encoded, frequency encoded, ordinally encoded, and two numerical features in my dataset. I am getting good model accuracy, and good AUC and precision too – they are all in the high 90s. But when I look at the confusion matrix, I can see that the TP is way higher than the actual positives in the test set. And I have a similar case with TN. I have done the following:
- Checked imbalance – I have a fairly balanced dataset
- Feature encoding – Tried full integer encoding, as well as the above set up.
- Tried different models, but didn't play with the parameters.
Is there anything that I can look at in terms of this problem?