0

I am working on a classification problem where my outcome variable is either "Approved" or "Denied". The % of approvals in my dataset is roughly 60% and the denials make up roughly 30%. I have tried multiple models (random forest, decision tree, neural network, and gradient boosting machine). The highest specificity that I can achieve is with the random forest of 0.69. I also tried to balance the data within the "train" function of the caret package by down sampling, over sampling, SMOTE, and ROSE. I performed the sampling only within the training dataset using cross validation (10 folds). I am pretty new to machine learning, so any advice is appreciated. Unfortunately, I cannot provide the datasets or any code that I have written, so I am just looking for general suggestions relating to unbalanced datasets.

Thanks!

  • ... what are the remaining 10%, after you've considered the 60% approvals and 30% denials? And... this is hardly an unbalanced data set, not that unbalanced data is really a problem anyway: https://stats.stackexchange.com/questions/283170/when-is-unbalanced-data-really-a-problem-in-machine-learning – jbowman Jul 23 '19 at 15:26
  • Sorry, it's really 68.731% approvals and 31.269% denials. – user254529 Jul 23 '19 at 15:32
  • I actually tried running the same models without balancing the dataset and the specificity was worse than balancing. – user254529 Jul 23 '19 at 15:33
  • How many observations & features do you have in your RF, roughly? – jbowman Jul 23 '19 at 15:34
  • 23,348 rows for my training set. 24 features. 7 of those are continuous and the rest are categorical. The largest number of levels for my categorical variables is 28. – user254529 Jul 23 '19 at 15:38
  • An alternative is to use weights on your observations. – user2974951 Jul 24 '19 at 13:36
  • Yes, I read about this but I couldn't find any application of it utilizing the caret package in R. I'm not sure exactly what I would set the weights too as well. Can you give an example? I can post the R code that I have for training the model if that would help too. – user254529 Jul 24 '19 at 13:40

0 Answers0