5

How do we build a model, cross validate it and use it to predict for unknown data?

Say I have a known dataset of 100 points. Steps for 10 fold cross-validation are-

  1. Divide the data randomly into training and test datasets in a ratio of 90:10
  2. Make a model on the training dataset (90 points). (I used libSVM grid.py to optimize C and gamma)
  3. Test the optimized model on the test dataset (10 points) and calculate the error.
  4. Repeat steps (1,2,3) 10 times for 10-fold cross validation. Average the error from each repeat to get the average error.

Now, after repeating the steps 10 times, I will have 10 different optimized models. To predict for an unknown dataset (200 points), should I use the model which gave me minimum error OR should I do step 2 once again on the full data (run grid.py on full data) and use it as model for prediction of unknowns?

Also I would like to know, is the procedure same for other machine-learning methods (like ANN, Random Forest, etc.)

jonsca
  • 1,790
  • 3
  • 20
  • 30
d.putto
  • 901
  • 2
  • 10
  • 13
  • Your steps are almost correct. Initially you want to split the data into 10 disjoint sets. You do not repeat your first step. You simply go through steps 2 and 3 ten times with the $i^{th}$ step having a training set of all data less the $i^{th}$ set and a test set of the $i^{th}$ set. – assumednormal Jul 04 '12 at 11:03
  • You've tagged this with `SVM` and `neural-networks`, but your question seems to be more general and not related to those methods in particular. If I'm wrong about that, please consider adding some text that explains what your question is _for these methods_. – MånsT Jul 04 '12 at 11:17
  • @Max - If I will not repeat first step I will train and test on same data 10 time!! – d.putto Jul 04 '12 at 12:03
  • @d.putto - That's not what I'm suggesting. You want to randomly split your data into 10 (approximately) equally sized sets. You only do that once. Now on the first iteration, your training set consists of sets 2 through 10, while your test set is set 1. On the second iteration, your training set consists of sets 1 and 3 through 10, while your test set is set 2. This continues until you've gone through 10 iterations (until you've used each individual set as a test set). – assumednormal Jul 04 '12 at 12:08
  • @Max - I think I got your point.. So 3 fold cross validation, divide data in 3 sets (say X,Y,Z). Take X as test and remaining (Y+Z) as training set. For next iteration take Y as test and (X+Z) as training set and so on... – d.putto Jul 04 '12 at 12:16
  • @d.putto - Exactly. I have to apologize, though, because it appears that your original method is also a form of cross-validation that I hadn't seen before. I was suggesting the use of $K$-Fold Cross-validation, while you were suggesting the use of Repeated Random Sub-sampling Validation. You may want to read [Wikipedia's page on Cross-validation](http://en.wikipedia.org/wiki/Cross-validation_(statistics)) to weigh the advantages and disadvantages of both methods. – assumednormal Jul 04 '12 at 12:24

2 Answers2

3

Now, after repeating the steps 10 times, I will have 10 different optimized models.

yes. Cross validation (like other resampling based validation methods) implicitly assumes that these models are at least equivalent in their predictions, so you are allowed to average/pool all those test results.

Usually there is a second, stronger assumption: that those 10 "surrogate models" are equvalent to the model built on all 100 cases:

To predict for an unknown dataset (200 points), should I use the model which gave me minimum error OR should I do step 2 once again on the full data (run grid.py on full data) and use it as model for prediction of unknowns?

Usually the latter is done (second assumption).

However, personally I would not do a grid optimization on the whole data again (though one can argue about that) but instead use cost and γ parameters that turned out to be a good choice from the 10 optimizations you did already (see below).

However, there are also so-called aggregated models (e.g. random forest aggregates decision trees), where all 10 models are used to obtain 10 predictions for each new sample, and then an aggregated prediction (e.g. majority vote for classification, average for regression) is used. Note that you validate those models by iterating the whole cross validation procedure with new random splits.

Here's a link to a recent question what such iterations are good for: Variance estimates in k-fold cross-validation

Also I would like to know, is the procedure same for other machine-learning methods (like ANN, Random Forest, etc.)

Yes, it can be applied very generally.

As you optimize each of the surrogate models, I recommend to look a bit closer into those results:

  • are the optimal cost and γ parameters stable (= equal or similar for all models)?

  • The difference between the error reported by the grid optimization and the test error you observe for the 10% unknown data is also important: if the difference is large, the models are likely to be overfit - particularly if the optimization reports very small error rates.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
0

@d.putto - That's not what I'm suggesting. You want to randomly split your data into 10 (approximately) equally sized sets. You only do that once. Now on the first iteration, your training set consists of sets 2 through 10, while your test set is set 1. On the second iteration, your training set consists of sets 1 and 3 through 10, while your test set is set 2. This continues until you've gone through 10 iterations (until you've used each individual set as a test set). – Max

@Max - I think I got your point.. So 3 fold cross validation, divide data in 3 sets (say X,Y,Z). Take X as test and remaining (Y+Z) as training set. For next iteration take Y as test and (X+Z) as training set and so on... – d.putto

I saw contradiction, according to Max, at each new iteration, the test set will be included in traningg set and the size of training set tend to increase, (until using all sets as test set and according d.putto, the size of training sets still constant )

Sihem
  • 313
  • 1
  • 2
  • 9