0

I am dealing with a dataset composed of both numerical (discrete) and nominal variables and I have to classify a binary response.

Since the dataset is imbalanced, I decided to oversample the minority class using the SMOTE algorithm. However, since my numerical variables are all discrete ones, I do not want to create synthetic observations which are not integer numbers. So I decided to tackle the problem of imbalancing by using the SMOTE-N algorithm which only deals with nominal variables (SMOTE: Synthetic Minority Over-sampling Technique). In order to do that, I must perform binning on my numerical variables.

I read about methods that bin by maximizing the Information Value and also about some other entropy-based approaches. However, how can I choose the most suitable method to do that? And then, does this choiche have to be related to the model that I am going to use for classification?

I found the R package woeBinning that uses kind of IV approach and it seems to suit the task. However it does not provide any deep explaination and I do not like to use methods without having a solid understending of the theory upon which they are based. I understood that it aggregates bins based on similiar WOE and uses Information Value as stopping criterion but I did not find a theoritical explaination of the whole approach. Does anyone have any hint for me?

elione30
  • 11
  • 2
  • Perhaps it is worth reexamining your rationale for oversampling https://stats.stackexchange.com/q/357466/240024 – Ryan Volpi Mar 19 '21 at 17:18
  • It has been an interesting reading, thank you. However, my professor indicated to use SMOTE so I am kinda "bound" to use it and now I'm interested in optimal binning methods for the reasons I explained above. – elione30 Mar 19 '21 at 23:28

0 Answers0