When the validation set is a subset of the training set

Question

I am doing the following but I am not sure if this is right or which behavior should I expect:

A union B union C is the full dataset
Training set: is A union B datasets
Testing set: is C
Validation set: is B (so, it is a subset of the training set)

I am using these datasets on a classifier to test the quality of the training set. The training set data is generated using two different methods, so I want to compare them according to a metric calculated over the results of the classifier.

So, my questions are:

What is the incidence to use B as the validation set? Which should be the result of the metric? Because I think should be near a perfect classification, I'm right?

Sorry if it is a dummy question, I'm quite lost. Thanks!

score 1 · Accepted Answer · answered Feb 07 '19 at 16:04

The classification on your "validation" set (which is not a true validation set) will not necessarily be perfect. There may be a lot of residual variation that simply cannot be classified, no matter how much you overfit.

However, you will overfit, and your classification quality on this "validation" set will systematically overestimate the quality you will achieve on a true new dataset. Which makes me wonder why you would want to do this, instead of the classical setup of disjoint training, validation and test sets (which can still overfit, especially if you look at the test set multiple times and adjust your method).

When the validation set is a subset of the training set

1 Answers1

Linked