0

Let's suppose a binary classification task and an unbalanced dataset (10% of positive records). I am using LightGBM and would like to better understand the difference between the combined pos_bagging_fraction and neg_bagging_fraction vs is_unbalance vs scale_pos_weight. I thought first that 1) setting neg_bagging_fraction to the prior target probability (inverse proportion for balancing data) would raise the same results as 2) setting to is_unbalance to True and 3) setting scale_pos_weight to the number of negative/number of positive records.

While the 2 latter (2 and 3) actually raise identical results, I understand the scale_pos_weight is more versatile than is_unbalance because of the broader spectrum of values that makes the user play to pimp the results of the various models.

The former (1) outputs different results than the 2 latter. Is it because of the sampling...? I set bagging_fraction to 0.7 in the 3 cases so I bootstrap 70% of records whatever the case... So I am not sure.

Also, based on that, what parameters would you advise for dealing with unbalanced data? scale_pos_weight or pos_bagging_fraction and neg_bagging_fraction?

This post might somehow be related but doesn't address my question: Difference between class_weight and scale_pos_weight in LightGBM

Code below:

Creation of the dataset

import lightgbm as lgb

def function_stratified_sampling(df, target_label, training_perc):
    
    """ Function sampling records based on target density
        Warning: the sampling doesn't consider the design data (i.e. levels of categorical features might not be complete in the training set)"""
        
    #Creating design and response dataframe
    df_X=df.drop([target_label], axis=1)
    df_Y=df[[target_label]]
    #Test proportion
    test_perc=1-training_perc
    
    #Stratified sampling
    X_train, X_test, Y_train, Y_test=train_test_split(df_X, df_Y, test_size=test_perc, random_state=0, stratify=df_Y)
    
    #Training set
    df_training=df[df.index.isin(X_train.index)]
    #Test set
    df_test=df[df.index.isin(X_test.index)]
    
    return df_training, df_test

X, Y = make_classification(n_samples=500, n_features=10, n_informative=3, n_classes=2, weights=[0.9, 0.1], random_state=1)
X_df=pd.DataFrame(X)
Y_df=pd.DataFrame(Y, columns=["label"])
all_df=pd.concat([X_df, Y_df], axis=1)
tr, te=function_stratified_sampling(df=all_df, target_label="label", training_perc=0.7)
train_data = lgb.Dataset(data=tr.drop(["label"], axis=1), label=tr[["label"]], free_raw_data=False)
test_data = lgb.Dataset(data=te.drop(["label"], axis=1), label=te[["label"]], free_raw_data=False)

Target investigation

print(tr.label.value_counts())
print(tr.label.value_counts()/tr.shape[0]) 
#prior target probability=0.103152
print(tr.label.value_counts()[0]/tr.label.value_counts()[1]) 
#number of negative/number of positive=8.694444444444445

LightGBM models

#Overall parameters
param = {"seed": 1, "objective": "binary", "tree_learner": "serial", "metric": "auc", "boosting": "gbdt", "linear_tree": True, "learning_rate": 0.01, 
          "num_iterations": 100, "feature_fraction": 0.1, "bagging_fraction": 0.7}

#Model 1
param["pos_bagging_fraction"]=0.896848
param["neg_bagging_fraction"]=0.103152 #prior target probability

lightgbm1 = lgb.train(train_set=train_data, params=param,  
                      valid_sets=[train_data, test_data], 
                      valid_names=["training", "test"],
                      num_boost_round=10, callbacks=[lgb.early_stopping(stopping_rounds=5)], 
                      keep_training_booster=True)

#Model 2
del param["pos_bagging_fraction"]
del param["neg_bagging_fraction"]
param["is_unbalance"]=True

lightgbm2 = lgb.train(train_set=train_data, params=param,  
                      valid_sets=[train_data, test_data], 
                      valid_names=["training", "test"],
                      num_boost_round=10, callbacks=[lgb.early_stopping(stopping_rounds=5)], 
                      keep_training_booster=True)

#Model 3
del param["is_unbalance"]
param["scale_pos_weight"]=8.694444444444445 #number of negative/number of positive

lightgbm3 = lgb.train(train_set=train_data, params=param,  
                      valid_sets=[train_data, test_data], 
                      valid_names=["training", "test"],
                      num_boost_round=10, callbacks=[lgb.early_stopping(stopping_rounds=5)], 
                      keep_training_booster=True)

Evaluation

print(lightgbm1.eval_valid(feval=None)) #0.8578431372549019
print(lightgbm2.eval_valid(feval=None)) #0.9044117647058824
print(lightgbm3.eval_valid(feval=None)) #0.9044117647058824
guiotan
  • 33
  • 5

0 Answers0