Binary Classification problem for imbalanced dataset

Question

I am new to machine learning and need help. I have a dataset with two classes(0,1) where is 0 is Profitable and 1 is Unprofitable. Ratio of 0:1 in train set is 150/52

Taking positive as "1"(Unprofitable) and negative as 0(Profitable),False Negative cost is 4900 and False Positive cost is 4000. Objective is to maximize f(profit) = 4000*(No of True Negative-No of False Positive)-(4900*No of False Negative) such that is atleast > 1775$(base profit without any model)

With EDA I figured Years At employer, Debt/Income Ratio and Age to be most important predictors. With xgboost having scale_pos_weight =3 I get excellent results on training set but fails badly(overfits) on test set.

No matter how much I try I am not able to improve Profits for test set beyond 1375$(as mentioned above needs to be atleast >1775$)

Even rpart with loss function does not help much...Can anyone please provide any input

However If I take an alternative approach (i.e take only observations with Years at employer <20) and then apply glm or rpart results are really great on both train and test set, but is this approach even right ? (I did this because EDA shows that all unprofitable customers were with "years at employer <20",in train set)

Maybe dups: https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression, https://stats.stackexchange.com/questions/283170/when-is-unbalanced-data-really-a-problem-in-machine-learning, https://stats.stackexchange.com/questions/235808/binary-classification-with-strongly-unbalanced-classes, https://stats.stackexchange.com/questions/247871/what-is-the-root-cause-of-the-class-imbalance-problem, https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models — kjetil b halvorsen, Dec 27 '19 at 12:21

score 0 · Answer 1 · answered Dec 27 '19 at 12:43

Sometimes the base profit without any model is truly the best one can do. Compare How to know that your machine learning problem is hopeless?, and tangentially related Is it unusual for the MEAN to outperform ARIMA?

If your alternative approach works in production, then it works. There is little "right" or "wrong" here. Do you automatically classify everyone with <20 years tenure as "unprofitable"? Note that "all unprofitable customers had <20 years tenure" is not the same as "everyone with <20 years tenure is unprofitable"!

My answer to "Classification probability threshold" may be helpful.

Thanks for your input . No I do not automatically classify everyone with <20 years as Unprofitable. I just took subset where Year of exp was less than 20 and then build the model, to predict profitable/Unprofitable — Ps1979, Dec 28 '19 at 19:39

Binary Classification problem for imbalanced dataset

1 Answers1