0

I´ve been working in a random forest model for credit scoring in R. I've trained a model using caret::train.

My data "df_samples_rf" has the next structure: enter image description here

control <- trainControl(method="repeatedcv", 
                    number = 10, 
                    repeats = 5, 
                    search = "grid", 
                    classProbs = T, 
                    savePredictions = T, 
                    summaryFunction = twoClassSummary, 
                    allowParallel = T)

tune_grid <- expand.grid(.mtry = c(3:15))#1

rf_gridsearch <- NULL

rf_gridsearch <- train(bad ~., 
                       data = df_samples_rf[-27], 
                       method = "rf", 
                       metric = "ROC",
                       tuneGrid = tune_grid, 
                       trControl = control)

I get the next results: enter image description here

And: enter image description here

So I assume the model can perform well.

Then I try to predict like this:

rf_predictions <- predict(rf_gridsearch$finalModel, 
                          newdata = df_no_yes)

Data "df_no_yes" has exactly the same structure and class variables as "df_samples_rf", predicted variable also have the same class and levels.

results <- data.frame(real = df_no_yes$bad, 
                      rf_predictions)

I run a confusionMatrix and get:

confusionMatrix(results$rf_predictions, results$real)

enter image description here

Why does caret::train have models with good ROC-AUC and classification metrics, and when predicting over a new set of data, I get so bad results? Am I doing something wrong? Applying the predict in the wrong way?

  • 2
    If you are preforming poorly on new data, you are probably overfitting your model. For general modeling advice, you should ask for help at [stats.se]. This doesn't seem like a specific programming question that's appropriate for Stack Overflow. At the very least you would need to include a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data we can run ourselves to see what's going on. Picture of data and results are not helpful because we can't copy/paste that into R. – MrFlick Feb 18 '22 at 17:08
  • That´s what I thougth at first but Leo Breiman (2001) says that overffiting in random forest isnt a problem, the only thing is that there is a limit when reducing generalization error by adding more trees. I´ll try to ask in Croos Validated, thank you for your advice. –  Feb 18 '22 at 17:16
  • 2
    How did you split the training and testing? I ask because there are a lot of ways that are *not great* for classification. Did you use `caret::createDataPartition()` with the outcome variable? It involves the stratification of possible outcomes. As you stated about overfitting, I've never seen a random forest model actually overfit. However, I've definitely seen models that needed different parameters. – Kat Feb 18 '22 at 21:17

0 Answers0