Suppose I have bunch of models and I need to figure out the best model. I calculate the likelihood for each model on the training data set. Why cannot I use likelihood from the training data set for model selection?
-
2Because you can overfit the training data. The model that gives the smallest error on training data could, due to overfitting, generalize poorly. That's why for model selection a hold-out set should be used. – Fariborz Ghavamian Jul 22 '19 at 19:10
2 Answers
If you have a complex model and you train it sufficiently, it can memorize the training data, even reaching an error of $0$ (this is called overfitting). The problem is that this doesn't necessarily mean that your model is any good because it might not be able to generalize well on unseen data.
Because of this, if you select your model just based on performance on the training set, you might select a model that has overfit and isn't actually as good as you think.
The solution is to use a test set, which isn't seen during training and can evaluate the models' actual performance.
I'd suggest you read this post for more details.

- 5,395
- 5
- 25
- 36
-
1Okay. Suppose I have two criterion BIC and AIC. I split the data into training and test. I fit the model on training data and get the estimates based on BIC and AIC. Then I use these estimates on the test data to calculate the likelihood. I repeat this let's say 100 times and average out the likelihood. But how do I select the best model? I have likelihood using BIC and AIC. – Hello Aug 01 '19 at 13:26
-
In order to select amongst models, we need some way of evaluating their performance.
You can't evaluate a model's hypothesis function with the cost function because minimizing the error can lead to overfitting.
A good approach is to take your data and split it randomly into a training set and a test set (e.g. a 70%/30% split). Then you train your model on the training set and see how it performs on the test set.
For linear regression, you might do things this way:
Learn parameter θ from training data by minimizing training error J(θ). Compute test set error (using the squared error) (mtest is the test set size):
Jtest(θ)=12mtest∑i=1mtest(hθ(x(i)test)−y(i)test)
For logistic regression, you might do things this way:
Learn parameter θ from training data by minimizing training error J(θ). Compute test set error (mtest is the test set size):
Jtest(θ)=−1mtest∑i=1mtesty(i)testloghθ(x(i)test)+(1−y(i)test)loghθ(x(i)test)
Alternatively, you can use the misclassification error ("0/1 misclassification error", read "zero-one"), which is just the fraction of examples that your hypothesis has mislabeled:
err(hθ(x),y)test error={1,0,if hθ(x)≥0.5,y=0 or if hθ(x)<0.5,y=1otherwise=1mtest∑i=1mtesterr(hθ(x(i)test),y(i)test)
A better way of splitting the data is to not split it only into training and testing sets, but to also include a validation set. A typical ratio is 60% training, 20% validation, 20% testing.
So instead of just measuring the test error, you would also measure the validation error.
Validation is used mainly to tune hyperparameters - you don't want to tune them on the training set because that can result in overfitting, nor do you want to tune them on your test set because that results in an overly optimistic estimation of generalization. Thus we keep a separate set of data for the purpose of validation, that is, for tuning the hyperparameters - the validation set.
You can use these errors to identify what kind of problem you have if your model isn't performing well:
If your training error is large and your validation/test set error is large, then you have a high bias (underfitting) problem. If your training error is small and your validation/test set error is large, then you have a high variance (overfitting) problem. Because the test set is used to estimate the generalization error, it should not be used for "training" in any sense - this includes tuning hyperparameters. You should not evaluate on the test set and then go back and tweak things - this will give an overly optimistic estimation of generalization error.
Some ways of evaluating a model's performance on (some of) your known data are:
hold out (just set aside some portion of the data for validation; this is less reliable if the amount of data is small such that the held out portion is very small) k-fold cross-validation (better than hold out for small datasets) the training set is divided into k folds iteratively take k−1 folds for training and validate on the remaining fold average the results there is also "leave-one-out" cross-validation which is k-fold cross-validation where k=n (n is the number of datapoints) bootstrapping new datasets are generated by sampling with replacement (uniformly at random) from the original dataset then train on the bootstrapped dataset and validate on the unselected data jackknife resampling essentially to leave-one-out cross-validation, since leave-one-out is basically sampling without replacement

- 11
- 1