Let's suppose a binary classification task and an unbalanced dataset (10% of positive records). I am using LightGBM and would like to better understand the difference between the combined pos_bagging_fraction
and neg_bagging_fraction
vs is_unbalance
vs scale_pos_weight
. I thought first that 1) setting neg_bagging_fraction
to the prior target probability (inverse proportion for balancing data) would raise the same results as 2) setting to is_unbalance
to True and 3) setting scale_pos_weight
to the number of negative/number of positive records.
While the 2 latter (2 and 3) actually raise identical results, I understand the scale_pos_weight
is more versatile than is_unbalance
because of the broader spectrum of values that makes the user play to pimp the results of the various models.
The former (1) outputs different results than the 2 latter. Is it because of the sampling...? I set bagging_fraction
to 0.7 in the 3 cases so I bootstrap 70% of records whatever the case... So I am not sure.
Also, based on that, what parameters would you advise for dealing with unbalanced data? scale_pos_weight
or pos_bagging_fraction
and neg_bagging_fraction
?
This post might somehow be related but doesn't address my question: Difference between class_weight and scale_pos_weight in LightGBM
Code below:
Creation of the dataset
import lightgbm as lgb
def function_stratified_sampling(df, target_label, training_perc):
""" Function sampling records based on target density
Warning: the sampling doesn't consider the design data (i.e. levels of categorical features might not be complete in the training set)"""
#Creating design and response dataframe
df_X=df.drop([target_label], axis=1)
df_Y=df[[target_label]]
#Test proportion
test_perc=1-training_perc
#Stratified sampling
X_train, X_test, Y_train, Y_test=train_test_split(df_X, df_Y, test_size=test_perc, random_state=0, stratify=df_Y)
#Training set
df_training=df[df.index.isin(X_train.index)]
#Test set
df_test=df[df.index.isin(X_test.index)]
return df_training, df_test
X, Y = make_classification(n_samples=500, n_features=10, n_informative=3, n_classes=2, weights=[0.9, 0.1], random_state=1)
X_df=pd.DataFrame(X)
Y_df=pd.DataFrame(Y, columns=["label"])
all_df=pd.concat([X_df, Y_df], axis=1)
tr, te=function_stratified_sampling(df=all_df, target_label="label", training_perc=0.7)
train_data = lgb.Dataset(data=tr.drop(["label"], axis=1), label=tr[["label"]], free_raw_data=False)
test_data = lgb.Dataset(data=te.drop(["label"], axis=1), label=te[["label"]], free_raw_data=False)
Target investigation
print(tr.label.value_counts())
print(tr.label.value_counts()/tr.shape[0])
#prior target probability=0.103152
print(tr.label.value_counts()[0]/tr.label.value_counts()[1])
#number of negative/number of positive=8.694444444444445
LightGBM models
#Overall parameters
param = {"seed": 1, "objective": "binary", "tree_learner": "serial", "metric": "auc", "boosting": "gbdt", "linear_tree": True, "learning_rate": 0.01,
"num_iterations": 100, "feature_fraction": 0.1, "bagging_fraction": 0.7}
#Model 1
param["pos_bagging_fraction"]=0.896848
param["neg_bagging_fraction"]=0.103152 #prior target probability
lightgbm1 = lgb.train(train_set=train_data, params=param,
valid_sets=[train_data, test_data],
valid_names=["training", "test"],
num_boost_round=10, callbacks=[lgb.early_stopping(stopping_rounds=5)],
keep_training_booster=True)
#Model 2
del param["pos_bagging_fraction"]
del param["neg_bagging_fraction"]
param["is_unbalance"]=True
lightgbm2 = lgb.train(train_set=train_data, params=param,
valid_sets=[train_data, test_data],
valid_names=["training", "test"],
num_boost_round=10, callbacks=[lgb.early_stopping(stopping_rounds=5)],
keep_training_booster=True)
#Model 3
del param["is_unbalance"]
param["scale_pos_weight"]=8.694444444444445 #number of negative/number of positive
lightgbm3 = lgb.train(train_set=train_data, params=param,
valid_sets=[train_data, test_data],
valid_names=["training", "test"],
num_boost_round=10, callbacks=[lgb.early_stopping(stopping_rounds=5)],
keep_training_booster=True)
Evaluation
print(lightgbm1.eval_valid(feval=None)) #0.8578431372549019
print(lightgbm2.eval_valid(feval=None)) #0.9044117647058824
print(lightgbm3.eval_valid(feval=None)) #0.9044117647058824