1

Context

I've got an active learning problem with an event rate of about 1%. The data is a panel, individuals over time. We have a proxy label that is highly correlated with the true label within individuals, but not in time. In other words, if the proxy label is present for $i$, the true label is present for $i$, but at a different time $t$.

We lack the true label, and we're paying experts to label for us. We're going to split the data into training, validation, and test sets. We're going to fit models to the train set, choose between them on the validation set, and report performance using the test set.

As we label, we're going to predict the outcomes in our training set, and surface for labeling the examples about which our model is most uncertain.

For the training set, we're doing a stratified sample: 50% positive, 50% negative. However, it's generally considered desirable for the validation and test sets to reflect the unsampled population, and this is generally done by taking a random sample.

Problem

In our case, given our low event rate, we'd have <10 positive examples in each of the validation and test sets. Given high dimensionality, it's near-certain that these ~10 observations would be very different on at least a few dimensions, between the two sets. Not ideal for validation, not ideal for estimating generalization performance.

Potential Solution

To solve, perhaps I could artificially balance (stratify) the validation and test sets. To report performance in the overall sample, I could simply compute weighted versions of whatever accuracy measures, such that these metrics provided an unbiased estimate of the population accuracy.

HELP!

While this solution seems simple and intuitive to me,

  1. I lack citations. Has this been done before? My collaborators are worried about being criticized for doing something that isn't considered to be standard.
  2. Is there a better way to approach this problem?
generic_user
  • 11,981
  • 8
  • 40
  • 63

0 Answers0