(Why) is stratified sampling preferable to "adding levels" to handle unseen factor levels in test data?

Question

Bumbling through cross-validation procedures when fitting random forests, I kept bumping into errors of this sort (in R)

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor <some.factor> has new levels

As it happened, I had many high-cardinality, categorical features and only a few hundred observations.

In this excellent answer, the recommendation is to use a clever stratification technique when creating folds, which definitely has some sense to it, but in testing my modelling approach, I was way more curious if model performance would be stable with different training data configurations than how a specific model with specific hyperparameters, etc. performs (I was doing nested cross-validation).

Inspired by this answer, I wrote the following functions to "solve" my problems, evaluating AUC with repeated k-fold CV.

# subset a data.frame for categorical variables
#' Indices for categorical data
#'
#' @param x `data.frame`
#'
#' @return Logical vector of `length(x)`, `TRUE` for categorical, `FALSE`
#'   otherwise
#' @export
#'
#' @examples
#' set.seed(123)
#' dat <- data.frame( 
#' y = sample(1:2, 5, replace =T)
#' , x1 = 1:5
#' , x2 = letters[1:5]
#' , x3 = LETTERS[1:5]
#'   )
#' 
#' is_categorical(dat)
is_categorical <- function(x) {
  sapply(x, function(y) {is.factor(y) | is.character(y)})
}


# fix the model with blind attempt
#' Add unseen levels to fit object
#'
#' @param mod model object
#' @param test.dat `data.frame`, "new.data" on which to predict value from `mod`
#'
#' @return revised model object without missing factor levels
#' @export
#'
#' @examples
#' 
#' # make data
#' set.seed(123)
#' dat <- data.frame( 
#' y = sample(0:1, 5, replace =T)
#' , x1 = 1:5
#' , x2 = letters[1:5]
#' , x3 = LETTERS[1:5]
#'   )
#'   
#' # split into test and train
#' test <- dat[1, ]
#' train <- dat[2:5, ]
#' 
#' # fit model on test data
#' my_mod <- glm(y ~ ., data = train, family = "binomial")
#' 
#' # predict throws an error
#' # predict(my_mod, test)
#' 
#' # after fix, predict does not throw error
#' predict(fix.mod(my_mod, test), test)
#' 
fix.mod <- function(mod, test.dat){
  v.names = is_categorical(test.dat)
  mod$xlevels = Map(union, mod$xlevels,  lapply(test.dat[v.names], unique))
  return(mod)
}

Then, each time I ran predict(), I would "fix" the model to avoid some new factor level in one of my many categorical features throwing errors. My intuition is that this was actually a pretty conservative way to assess the variance of AUC, and stratified sampling would be less conservative (imply more stability than I could fairly expect, which seems echoed elsewhere on CV e.g. here).

Is there a hole in this thinking? Are there great risks to blindly "fixing" my fitted models before making predictions this way? Why isn't something like fix.mod() integrated into predict() methods in R already (or is it?)?

(Why) is stratified sampling preferable to "adding levels" to handle unseen factor levels in test data?

0 Answers0