0

Suppose I split my data into two parts -- a training set (having 80% of my data) and a testing (20%) set. I train a model on my training set, and test it on the test set.

What do we learn from predicting on the test set? Are we just looking for a measure of generalization error (and perhaps noticing where it makes mistakes) or is there more information we can get out of it? What can we do with this information?

Hatshepsut
  • 1,429
  • 4
  • 15
  • 26

2 Answers2

0

Basically you fit your model on a training set and then you ask yourself the question "My model fits well the data in the training set, but how does it perform on new, unseen data?". Why is that important? That's hugely important since in applications or anytime you are given an assignment the greatest share of responsability you are given is not just to fit a model on the data you have at your disposal, but to guarantee a good fit/predictionpower of new data.

That's as general as I can get.

Tommaso Guerrini
  • 1,099
  • 9
  • 27
  • Ya I agree that's important. I'm wondering if there are any other benefits we can get from test data in addition to getting a measure of generalization error. – Hatshepsut Apr 01 '16 at 09:22
  • You may consider the partition of the dataset in $training$ +$validation$ + $test$ set, may be that's what you're referring to when asking "information we get from test set". http://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set . In that case the $validation$ set, is a sort of $test$ set inside the $training$ one and you basically perform a sort of testing on your training data to assess the performance on future unseen data. – Tommaso Guerrini Apr 01 '16 at 09:27
0

One very important benefit of using a test set is to calculate the forecast error which can be calculated using various available methods like MSE, MAPE etc. This becomes important in statistical forecasting techniques like ARMA/ARIMA because it is difficult to identify a best fitting model just on the basis of AIC etc. The best fit model may not forecast the best. Thus, test sets help you in comparing the forecasting accuracy of various models and identification of the best model amongst the desirable set. It is always important to compare the performance of various models by using a test set.

A K R
  • 31
  • 1
  • 8