Bumbling through cross-validation procedures when fitting random forests, I kept bumping into errors of this sort (in R
)
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor <some.factor> has new levels
As it happened, I had many high-cardinality, categorical features and only a few hundred observations.
In this excellent answer, the recommendation is to use a clever stratification technique when creating folds, which definitely has some sense to it, but in testing my modelling approach, I was way more curious if model performance would be stable with different training data configurations than how a specific model with specific hyperparameters, etc. performs (I was doing nested cross-validation).
Inspired by this answer, I wrote the following functions to "solve" my problems, evaluating AUC with repeated k-fold CV.
# subset a data.frame for categorical variables
#' Indices for categorical data
#'
#' @param x `data.frame`
#'
#' @return Logical vector of `length(x)`, `TRUE` for categorical, `FALSE`
#' otherwise
#' @export
#'
#' @examples
#' set.seed(123)
#' dat <- data.frame(
#' y = sample(1:2, 5, replace =T)
#' , x1 = 1:5
#' , x2 = letters[1:5]
#' , x3 = LETTERS[1:5]
#' )
#'
#' is_categorical(dat)
is_categorical <- function(x) {
sapply(x, function(y) {is.factor(y) | is.character(y)})
}
# fix the model with blind attempt
#' Add unseen levels to fit object
#'
#' @param mod model object
#' @param test.dat `data.frame`, "new.data" on which to predict value from `mod`
#'
#' @return revised model object without missing factor levels
#' @export
#'
#' @examples
#'
#' # make data
#' set.seed(123)
#' dat <- data.frame(
#' y = sample(0:1, 5, replace =T)
#' , x1 = 1:5
#' , x2 = letters[1:5]
#' , x3 = LETTERS[1:5]
#' )
#'
#' # split into test and train
#' test <- dat[1, ]
#' train <- dat[2:5, ]
#'
#' # fit model on test data
#' my_mod <- glm(y ~ ., data = train, family = "binomial")
#'
#' # predict throws an error
#' # predict(my_mod, test)
#'
#' # after fix, predict does not throw error
#' predict(fix.mod(my_mod, test), test)
#'
fix.mod <- function(mod, test.dat){
v.names = is_categorical(test.dat)
mod$xlevels = Map(union, mod$xlevels, lapply(test.dat[v.names], unique))
return(mod)
}
Then, each time I ran predict()
, I would "fix" the model to avoid some new factor level in one of my many categorical features throwing errors. My intuition is that this was actually a pretty conservative way to assess the variance of AUC, and stratified sampling would be less conservative (imply more stability than I could fairly expect, which seems echoed elsewhere on CV e.g. here).
Is there a hole in this thinking? Are there great risks to blindly "fixing" my fitted models before making predictions this way? Why isn't something like fix.mod()
integrated into predict()
methods in R
already (or is it?)?