3

The error percentage of regression changes with change in the train and test data which I am deciding randomly. Cross validation can overcome this but how do I apply it for my regression model?

ekvall
  • 4,361
  • 1
  • 15
  • 37
  • 3
    Possible duplicate of [Cross-Validation in plain english?](http://stats.stackexchange.com/questions/1826/cross-validation-in-plain-english) – Sycorax Mar 31 '16 at 13:07

2 Answers2

1

If I understand the question, you're looking to use a cross-validation for tuning your random forest parameters, resulting in two holdout sets:

  • one for cross-validation // model tuning
  • one for a final test (from which you generate an estimated overall performance, RMSE, MAE, etc)

Is that correct?

Assuming it is, I would suggest first splitting your dataset into two sets -- train and the rest, then split "the rest" again into two additional datasets, thereby resulting in a CV and Test dataset.

Example (Python 3.x && sklearn's train_test_split)

from sklearn.model_selection import train_test_split  

X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.3, random_state=10)

X_cv, X_test, y_cv, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=10)

I've used a seed so the datasets would be repeatable across experiments // iterations. Note that the CV and Tests datasets are derived from the first test and that I elected to make X_Train 70% of the set and a 15% / 15% split on CV and Test.

jms
  • 11
  • 1
0

That may be due to overfitting. Normally there is an 80 - 20 rule that advices to assign 80% of your data as a train set and the rest as a test set - so you don't have to partition them with random percentage.

another cross validation method, which seems to be the one you are suggesting is the k-fold cross validation where you partition your dataset in to k folds and iteratively use each fold as a test test, i.e. training on k-1 sets. scikit[1] learn has a kfold library which you can import as follows:

from sklearn.model_selection import KFold

[1] http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

yosemite_k
  • 115
  • 3