How do I add cross validation for a random forest regression?

Question

The error percentage of regression changes with change in the train and test data which I am deciding randomly. Cross validation can overcome this but how do I apply it for my regression model?

Possible duplicate of [Cross-Validation in plain english?](http://stats.stackexchange.com/questions/1826/cross-validation-in-plain-english) — Sycorax, Mar 31 '16 at 13:07

score 1 · Answer 1 · answered Jul 26 '17 at 20:46

If I understand the question, you're looking to use a cross-validation for tuning your random forest parameters, resulting in two holdout sets:

one for cross-validation // model tuning
one for a final test (from which you generate an estimated overall performance, RMSE, MAE, etc)

Is that correct?

Assuming it is, I would suggest first splitting your dataset into two sets -- train and the rest, then split "the rest" again into two additional datasets, thereby resulting in a CV and Test dataset.

Example (Python 3.x && sklearn's train_test_split)

from sklearn.model_selection import train_test_split  

X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.3, random_state=10)

X_cv, X_test, y_cv, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=10)

I've used a seed so the datasets would be repeatable across experiments // iterations. Note that the CV and Tests datasets are derived from the first test and that I elected to make X_Train 70% of the set and a 15% / 15% split on CV and Test.

score 0 · Answer 2 · answered Oct 23 '18 at 14:45

That may be due to overfitting. Normally there is an 80 - 20 rule that advices to assign 80% of your data as a train set and the rest as a test set - so you don't have to partition them with random percentage.

another cross validation method, which seems to be the one you are suggesting is the k-fold cross validation where you partition your dataset in to k folds and iteratively use each fold as a test test, i.e. training on k-1 sets. scikit[1] learn has a kfold library which you can import as follows:

from sklearn.model_selection import KFold

[1] http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

How do I add cross validation for a random forest regression?

2 Answers2