In many applications of machine learning, the outcome vector is sparse - e.g. containing millions of 0s and a handful of 1s.
When the outcome vector is sparse, many of the training sets in k-fold cross-validation may exhibit no variation in the outcome. While some classifiers may produce predictions when there is no variation in the outcome (e.g. k-nearest neighbors would have perfect classification at all values of K), other models (e.g. logistic regression with an intercept and some covariates) are not identified. When models are not identified, there are no unique values that satisfy the objective (maximization of the likelihood, minimization of the sum of squares, etc.). Statistical software will throw an error and you don't have a model to predict with or an error for model comparison.
What are principled ways to deal with this sparsity and what are the consequences of the solutions?