How to conduct k-fold cross-validation with spare outcomes?

Question

In many applications of machine learning, the outcome vector is sparse - e.g. containing millions of 0s and a handful of 1s.

When the outcome vector is sparse, many of the training sets in k-fold cross-validation may exhibit no variation in the outcome. While some classifiers may produce predictions when there is no variation in the outcome (e.g. k-nearest neighbors would have perfect classification at all values of K), other models (e.g. logistic regression with an intercept and some covariates) are not identified. When models are not identified, there are no unique values that satisfy the objective (maximization of the likelihood, minimization of the sum of squares, etc.). Statistical software will throw an error and you don't have a model to predict with or an error for model comparison.

What are principled ways to deal with this sparsity and what are the consequences of the solutions?

@seanv507: Can you please explain your point a bit further because as I see it regularisation is *not* a core aspect of the solution here? As I think the issue described is associated with the response vector being sparsely populated; the OP data might have no multicollinearities, sparse expanatory features, etc. that would necessitate strong regularisation. (In general, I am a strong proponent of regularised models so I am curious what you meant!) — usεr11852, Feb 01 '20 at 00:37
I suspect an important misunderstand about what cross validation (or other resampling schemes) are doing from "While some classifiers may produce predictions when there is no variation in the outcome": a) any classifier will produce an outcome even for single test cases - if yours doesn't something is amiss with your programming. b) and the performance can be anywhere: a kNN on a test set that comprises only one class may be wrong for all of them. — cbeleites unhappy with SX, Feb 01 '20 at 15:20
@cbeleitessupportsMonica Consider the case of logistic regression with more parameters estimated than data points. Assume away any regularization to identify the model. I guess it is true that the classifier will produce an outcome even for single test cases. However, the model is not identified and we presumably don't get to the test case because we obtained a model with an infinite set of parameter values. The issue is NOT in the test set, it is in the training set. — user3614648, Feb 02 '20 at 02:33
@seanv507 If I'm not mistaken, adding a prior just means that we get the prior back in this case (the data contain no information) and so we don't even need the data for that model. Alternatively, it doesn't give a posterior - i'm actually not sure if that should identify the model, since the distribution of the likelihood is then improper. — user3614648, Feb 02 '20 at 02:41
Can you please define he problem a bit more carefully? I think that if you are focusing on a "sparse outcome" then it is a imbalanced problem (or a zero-inflated one in the context of regression). Just to be clear, sparsity usually refers to the vector of estimated coefficient (i.e. $Ax=b$, $x$ is mostly zeros, with few non-zero entries); this extending some naturally to Sparse PCA as well as $n << p$ regression problems. — usεr11852, Feb 02 '20 at 10:34
I am considering logistic regression etc. If you have a single outcome, ie 100% or 0%,then the solution is undefined: the intercept needs to be +/- infinity. If you also regularised the intercept, then you have a solution, see elements of statistical learning ( or bishops book). Wrt esl, one way to notice is that the problem is equivalent to finding optimal parameters in a fixed.neighbourhood of zero... So infinity is excluded. — seanv507, Feb 02 '20 at 18:46
I am not sure whether uniqueness is also relevant to your problem, but again regularisation will Give you unique solution, because of all poss solution there will be one of minimum norm — seanv507, Feb 02 '20 at 18:47
Going to leave the question open for now since people are having a hard time staying focused and no one is close to providing an answer. Matrices and vectors can be sparse and computer scientists and statistician using numerical analysis couldn't care less about whether it is a vector of coefficients or a vector of cats. Since people are obsessed with a "practical solution" of regularization when I am asking about a specific set of models that are not identified. I understand you don't understand uniqueness, but then perhaps you aren't understanding "identification" more generally. — user3614648, Feb 03 '20 at 13:25

score 1 · Answer 1 · answered Feb 01 '20 at 00:29

1

The simplest thing to try is using stratification. In the case of $k$-fold cross-validation this will ensure that, each fold will have (approximately) the same percentage of samples for each class as in the original sample.

A somewhat more involved solution would be to use over-sampling, under-sampling or a synthetic sample generation procedure like SMOTE or ROSE. If done carefully (i.e. we ensure that synthetic examples are only used during training, that our test examples still represent the class balance observed in the real data, as well as that our metric is relevant for what we want), it can be quite helpful. Please notice that class imbalance in itself is not a huge problem, CV.SE has a few great threads on the matter, the threads: When is unbalanced data really a problem in Machine Learning? and What is the root cause of the class imbalance problem? are great for a start.

answered Feb 01 '20 at 00:29

usεr11852

33,608
2
75
117

Thanks, I hadn't heard of stratified k-fold. I'm not sure if the unbalanced data problem is the same. Many classifiers (e.g. KNN) produce a unique classification for each observation even when there is zero outcome variation, but e.g. logistic regression won't produce a unique classification if e.g. the # of observations are fewer than parameters or there is zero variation in the outcome (consider a logistic regression with an intercept and one covariate). In that case, it doesn't make sense to talk about "poor performance." When outcomes are sparse, CV is likely to lead to such problems. – user3614648 Feb 02 '20 at 02:54
Additional thought: stratified k-fold doesn't address this problem because it may be mathematically impossible for a given K to create folds with given means. I know we could just throw out those folds, but do we lose the theoretical properties of CV in that case? – user3614648 Feb 02 '20 at 03:15
Stratified $k$-fold is relevant for the classification (so why "means"?). It is not really relevant for regression. For regression, there are particular models like zero-inflated/hurdle models that address a similar problem. Going back to logistic regression, the fact we won't have a "unique solution" is not a huge problem from a prediction perspective. In most cases, regularisation will help a lot and similarly despite the Hauck-Donner effect (e.g. https://stats.stackexchange.com/questions/45803, https://stats.stackexchange.com/questions/11109) Logistic Regression predictions are accurate. – usεr11852 Feb 02 '20 at 10:12
Of course, I agree that many classifiers will work fine with folds containing instances of only one class but do note that in those cases, the whole CV procedure's variance might be very large. I would advocate using Stratified K-fold mostly to save time (we want to do ~20 folds and the model is expensive to evaluate 100's of time) and/or if algorithmically this imbalance invalidates some of the tests. – usεr11852 Feb 02 '20 at 10:12
Zero-inflated and hurdle models are also not identified without variation in the outcome. The unique solution is a huge problem insofar as the model does not make a prediction, but rather an infinite number of predictions. While you can change the objective function or parameters you are estimating to identify *a* model, it is *not* then the same model. I'm not talking about practical differences - this is strictly regarding statistical theory. – user3614648 Feb 03 '20 at 13:09

How to conduct k-fold cross-validation with spare outcomes?

1 Answers1