Problem predicting caret::train random forest ("rf") model

Question

I´ve been working in a random forest model for credit scoring in R. I've trained a model using caret::train.

My data "df_samples_rf" has the next structure:

control <- trainControl(method="repeatedcv", 
                    number = 10, 
                    repeats = 5, 
                    search = "grid", 
                    classProbs = T, 
                    savePredictions = T, 
                    summaryFunction = twoClassSummary, 
                    allowParallel = T)

tune_grid <- expand.grid(.mtry = c(3:15))#1

rf_gridsearch <- NULL

rf_gridsearch <- train(bad ~., 
                       data = df_samples_rf[-27], 
                       method = "rf", 
                       metric = "ROC",
                       tuneGrid = tune_grid, 
                       trControl = control)

I get the next results:

And:

So I assume the model can perform well.

Then I try to predict like this:

rf_predictions <- predict(rf_gridsearch$finalModel, 
                          newdata = df_no_yes)

Data "df_no_yes" has exactly the same structure and class variables as "df_samples_rf", predicted variable also have the same class and levels.

results <- data.frame(real = df_no_yes$bad, 
                      rf_predictions)

I run a confusionMatrix and get:

confusionMatrix(results$rf_predictions, results$real)

Why does caret::train have models with good ROC-AUC and classification metrics, and when predicting over a new set of data, I get so bad results? Am I doing something wrong? Applying the predict in the wrong way?

If you are preforming poorly on new data, you are probably overfitting your model. For general modeling advice, you should ask for help at [stats.se]. This doesn't seem like a specific programming question that's appropriate for Stack Overflow. At the very least you would need to include a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data we can run ourselves to see what's going on. Picture of data and results are not helpful because we can't copy/paste that into R. — MrFlick, Feb 18 '22 at 17:08
That´s what I thougth at first but Leo Breiman (2001) says that overffiting in random forest isnt a problem, the only thing is that there is a limit when reducing generalization error by adding more trees. I´ll try to ask in Croos Validated, thank you for your advice. — , Feb 18 '22 at 17:16
How did you split the training and testing? I ask because there are a lot of ways that are *not great* for classification. Did you use `caret::createDataPartition()` with the outcome variable? It involves the stratification of possible outcomes. As you stated about overfitting, I've never seen a random forest model actually overfit. However, I've definitely seen models that needed different parameters. — Kat, Feb 18 '22 at 21:17

Problem predicting caret::train random forest ("rf") model

0 Answers0