Imbalanced dataset binary classification

Question

I am new in ML & DS and i have a dataset with an imbalance of 9:1 for Binary Classification,as an assignment. Could you please guide me in this regard? Also Which classifier is best for Imbalanced Binary Classification?

Regrds.

Related: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Apr 08 '19 at 19:10

score 7 · Answer 1 · answered Apr 08 '19 at 11:59

7

You got off on the wrong foot by conceptualizing this as a classification problem. The fact that $Y$ is binary has nothing to do with trying to make classifications. And when the balance of $Y$ is far from 1:1 you need to think about modeling tendencies for $Y$, not modeling $Y$. In other words, the appropriate task is to estimate $P(Y=1 | X)$ using a model such as the binary logistic regression model. The logistic model is a direct probability estimator. Details may be found here and here.

Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.

answered Apr 08 '19 at 11:59

Frank Harrell

74,029
5
148
322

Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb: – Sid_Mirza Apr 08 '19 at 17:18
params = { "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state } – Sid_Mirza Apr 08 '19 at 17:21
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization. – Frank Harrell Apr 09 '19 at 04:11
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no. – Sid_Mirza Apr 09 '19 at 19:29
1

I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function. – Frank Harrell Apr 10 '19 at 11:31

Imbalanced dataset binary classification

1 Answers1