Multiple resampling test/train dataset when choosing new models?

Question

I have been reading several posts on testing multiple models on the same dataset, which can lead to problems controling type-1 errors. Mostly these posts have to do with data-mining on big datasets: How to draw valid conclusions from big-data and ensuring testing data doesn't influence training

However, I know that this is done frequently. In fact, in my old work it was my job to find the best (logit) model given a dataset. In order to decide if a model is predictive or not, you have to validate against the test data. If the performance is poor, then you start from zero and create a new model. By the end you may have checked the test dataset dozens of times.

I was recently asked if multiple re-sampling of the dataset would be a possible solution. I wanted to say 'no,' but I don't actually know why this is bad. To give a specific example:

Suppose I am looking for the best linear regression given a dataset of 1,000 observations. I split the data into training and testing sets. I formulate a model which is not satisfactory on the testing sample. So, I redistribute the 1,000 observations into new training/test samples and attempt to find a new model. Each model will be trained and tested on their own specific training/test sample, which all come from the same 1,000 original observations.

My question: Why is this incorrect? What are the problems that are created with this methodology?

cbeleites unhappy with SX · Accepted Answer · 2013-12-08T21:41:09.283

I think the problem is that you train (or rather: optimize) using your "test" set. In other words, you can do this, but then you need an additional independent test set for the final validation. Or a nested validation set up from the beginning.

This is how I see the problem:

There can be combinations of training and test data (particular splits), where the model trained on that training data works well with the given test set -- regardless of how representative the test set is for the actual problem. Your strategy is actually a search strategy that tries to find such combinations. As there is no guarantee that you'll encounter a really satisfactory model before encountering one of the "fake satisfactory" models, there is trouble lurking.

Because you decide depending on the test set performance whether to go on for a new model or not, your testing is not independent. I think this is related to the problems with other iterative model optimization approaches, where an increase in model quality seems to occur also between equivalent models.

Here's a simulation:

multivariate normally distributed data, sd = 1 for 25 variates, first 4 informative being 0 for one class and 1 for the other.
500 cases of each class in the data set, split 80:20 randomly without replacement into train and "test" sets.
50000 cases each class independent test set.
repeat until "acceptable" accuracy of 90% is reached according to internal test set.

enter image description here
circles: internal test estimate, dots and whiskers: external independent test set with 95% ci (Agresti-Coull method), red line: cumulative maximum of internal estimate.

Your rule basically uses the cumulative maximum of the internal test set. In the example that means that within few iterations, you end up with an optimistic bias that claims 1/3 less errors than your models actually have. Note that the models here cannot be distinguished with a 200 cases test set. The order of differences between the large external test set results is the same as the confidence interval width.
You can also nicely see what I mean with skimming variance: the internal test set estimate itself is unbiased. What causes the bias is doing (potentially large) numbers of iterations and picking the maximum.

Besides the optimization that is hidden in this procedure as well, the problem is of course the large variance of the accuracy. Other performance measures like Brier's score have lower variance and thus do not lead to such serious overfitting that fast.

The code of the simulation:

require ("binom")
require ("MASS")

set.seed (seed=1111)

randomdata <- function (class, n, p = 25, inf = 1:4){
  x <- matrix (rnorm (n * p), nrow = n)
  x [, inf] <- x [, inf] + class 

  data.frame (class = class, X = I (x))
}


data <- rbind (randomdata (class = 0, n = 500), 
               randomdata (class = 1, n = 500)) 
indeptest <- rbind (randomdata (class = 0, n = 5e4), 
                    randomdata (class = 1, n = 5e4)) 

internal.acc <- rep (NA, 100)
external.acc <- rep (NA, 100)

for (i in 1 : 100){
  i.train <- sample (nrow (data), size=nrow (data) *.8, replace=FALSE)
  train <- data [ i.train, ]
  test  <- data [- i.train,]

  model <- lda (class ~ X, train)

  pred <- predict (model, test)
  indep.pred <- predict (model, indeptest)

  #table (reference = test$class, prediction = pred$class)
  internal.acc [i] <- sum (diag (table (reference = test$class, prediction = pred$class))) / nrow (test)
  external.acc [i] <- sum (diag (table (reference = indeptest$class, prediction = indep.pred$class))) / nrow (indeptest)

  if (internal.acc [i] >= 0.9) break ;
  cat (".")
  }

internal.acc <- internal.acc [1 : i]
external.acc <- external.acc [1 : i]

plot (internal.acc, ylab = "accuracy", xlab = "iteration")
points (external.acc, pch = 20)
lines (cummax (internal.acc), col = "red")

ci <- binom.agresti.coull (external.acc*nrow (indeptest), nrow (indeptest))
segments (x0 = seq_along (external.acc), x1 = seq_along (external.acc), y0 = ci$lower, y1 = ci$upper)

score 1 · Answer 2 · answered Nov 08 '13 at 17:43

The concept that an out-of-bag estimate may be useful in choosing a model is standard practice and sounds reasonable.

I think as fair number of people will get hung up on the practical issues with your method: (1) What is the motivation? Or is this just theoretical? Seems like this would be far more complicated that just using standard model building approaches. (2) The model building process you describe isn't how I many might use validation. Usually use it either to assess if one has overfit the data or to compare completely different modelling approaches. I'm not convinced that the use of validation as part of an iterative model building process makes sense for a logistic model, though this practice may be common in other models (or even built into the model like MARS/EARTH) (3) You're using split sample validation in a fairly small sample so the estimates are likely to be unreliable. You may want to increase n to 10,000 or 20,000 to get better answers to the question. (4) As you start making adjustments to your method as a result of (3) above you'll find you're describing LOOCV or K-fold validation.

I'm not sure your read the whole question. This is a theoretical question, and I'm more interested in why (or why not) this method is incorrect, not alternatives. Also, the example is for linear regression, but that is not important. — Drew75, Nov 09 '13 at 07:11
To me the issue is split sample validation. I'd look at the MARS model (http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines). Again I might have misunderstood your intentions, but it takes your modeling approach and applies an appropriate validation method. This is fairly common approach. But not with split sample validation. — charles, Nov 12 '13 at 18:13

Multiple resampling test/train dataset when choosing new models?

2 Answers2