0

I've trained a random forests for a regression problem. Now, I want to check if the model is not overfitted. I have tuned the parameters and then compared the R-Squared of Train and Test dataset as below

test_predict <- model%>% predict(test_data) %>% predictions
R2_test <- 1 - (sum((test_actual-test_predict )^2)/sum((test_actual-mean(test_actual))^2))

train_predict <- model%>% predict(train_data) %>% predictions
R2_train <- 1 - (sum((train_actual-train_predict )^2)/sum((train_actual-mean(train_actual))^2))

The R-Squared on train dataset is significantly higher (close to 0.9) which made think the model is overfit. Then I came across to this question Random Forest - How to handle overfitting. It is saying that predict(model, newdata=train) "treats your training data as if it was a new dataset, and runs the observations down each tree. This will result in an artificially close correlation between the predictions and the actuals, since the RF algorithm generally doesn't prune the individual trees, relying instead on the ensemble of trees to control overfitting."

I don't know what would be the best way to compare the performance of model between train and test dataset.

1 Answers1

0

Good afternoon!

  1. You have provided a good link in your post. It shows that you need
    predict(model)

Why? Because it's Out Of Bag forecast (the OOB score is calculated using only a subset of decision trees not containing the OOB sample in their training dataset), as far as I know, so it's similar to cross-validation. Using

predict(model, newdata=train)

you will make predictions using train data and model trained on that data, which seems confusing and may provide unrealistically high model performance.

  1. I personally would pay attention to model performance on test, because model was fitted on train so it 'knows' train and will perform on train better.

  2. Is R-squared the metric you really want to work with? It just indicates the percentage of target variable variance explained by model. Since ML models are used for forecasting, the most concern is about (surprisingly!) the accuracy of forecast. So for regression problem people usually use RMSE, MAPE etc. which describe better the average error of forecast.

rsx
  • 21
  • 1