I am new to machine learning and need help. I have a dataset with two classes(0,1) where is 0 is Profitable and 1 is Unprofitable. Ratio of 0:1 in train set is 150/52
Taking positive as "1"(Unprofitable) and negative as 0(Profitable),False Negative cost is 4900 and False Positive cost is 4000. Objective is to maximize f(profit) = 4000*(No of True Negative-No of False Positive)-(4900*No of False Negative) such that is atleast > 1775$(base profit without any model)
With EDA I figured Years At employer, Debt/Income Ratio and Age to be most important predictors. With xgboost having scale_pos_weight =3 I get excellent results on training set but fails badly(overfits) on test set.
No matter how much I try I am not able to improve Profits for test set beyond 1375$(as mentioned above needs to be atleast >1775$)
Even rpart with loss function does not help much...Can anyone please provide any input
However If I take an alternative approach (i.e take only observations with Years at employer <20) and then apply glm or rpart results are really great on both train and test set, but is this approach even right ? (I did this because EDA shows that all unprofitable customers were with "years at employer <20",in train set)