Train/dev/test split with limited and skewed positive labels

Question

(Because of the sensitive nature of the actual project, I am using an analogy here. I hope it's clear, if not, please let me know!)

My goal is to classify images as cats or dogs (binary classification). I have a large data dataset with images of cats and dogs. We know cats and dogs both come in the same 10 different colors. Our dataset contains many, many examples of dogs in all colors, but relatively few (let's say 0.1%) examples of cats and in only 3 colors.

Our model, in real life, will encounter images of cats of all colors. Additionally, we expect about 5% of images to be a cat.

How can I prepare my train, development, and test sets to make sure a model can learn to generalize to recognize cats of all colors?

Following up from comments below. It appears to be you only care about cat/dog classification (the color is just to show that there's a spurious correlation in the available data). And, 99.9% of all the data are dogs (of some color) but what you care about identifying are the cats? Is this all correct? — TravisJ, Jul 27 '21 at 12:59
It is difficult to discuss in analogies when problems are difficult. I posted something like this a few days ago when someone else had to be coy about their actual problem: your best bet might be to vet a statistician for suitability to see sensitive information, have her sign a non-disclosure agreement, and work behind closed doors. The shame of this is that your problem with sampling bias is an interesting one that I would like to see discussed here. — Dave, Jul 27 '21 at 14:09
Note, however, that [class imbalance is not inherently a problem, since you are modeling probabilities, not hard classifications.](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he) — Dave, Jul 27 '21 at 14:16
@Dave If this question doesn't lead anywhere, I'll try to reframe to make things more transparent. Thanks for your advice! — stinodego, Jul 27 '21 at 16:01

score 0 · Accepted Answer · answered Jul 27 '21 at 21:46

I'm speaking under the following assumptions:

The data is such that 99.9% of the data is dogs and the remaining is cats.
There is some color (say red) for which every example you of that color is a dog--in principle cats could be red, but you don't have any such examples.
The concern is that, because of the lack of red-cat examples, a model could (probably would) learn that anything red is a dog.
The goal is to correctly identify cats (possibly with extra emphasis on red cats); presumably missing a cat is really bad.

There are a few things you could try (I make no guarantees that any will work, but these are the things I would try, in approximately the order I would try them).

Find or create "red cat" examples and add those to your dataset. Depending on the "real" application it may be really hard to find such examples or create them. Additionally, this could be expensive (in terms of time/energy). But, if it's possible, then this is probably the mostly likely route to success. Additionally, by finding/creating some examples you'll be able to quantitatively evaluate any model you produce. Without any "red cat" examples, you won't be able to get an accurate performance estimate.
Remove the "red" bias by removing all examples of red dogs from your dataset. Then, partition your train/val/test splits randomly (p fraction, q fraction, 1-p-q fraction of the data respectively, can assign examples randomly with those probabilities to the buckets). The advantage of doing this is that there is no longer a spurious red correlation. The disadvantage is now, not only are "red cats" outside your training distribution but also "red dogs" are outside the training distribution. This will likely degrade the performance on red dogs (previously you probably would have called everything red a dog, so you'd have 100% TP on red dogs) and possibly improve your performance on "red cats."
If identifying "red cats" is more important than just identifying cats, then I'd consider training two models--one to recognize dog/cat, one to classify color. I'd train the cat/dog classifier as in 1. (removing all red examples so as to not include the red bias on cat/dog prediction). I might try training the color classifier in two ways: once with all the training data (p fraction), once equally balanced between cat and dog examples. I'd also choose a model (for color classifier) that is extremely simple (very little "capacity") so that the 2nd model would struggle to learn features useful for cat/dog classification (try to get it to focus on color, which, from experience, is a relatively simple task). When inferencing, use both models. Hopefully, the first will at least be good at detecting cats and hopefully the second will be good at identifying color (in case "red cats" are extra important). Note: if you real task is not color, this simple model may not work as well.

If those don't work, it's going to be a stretch.

Get creative. Perhaps train some variation of an autoencoder on just the dog images. Characterize the embeddings the autoencoder generates so that you have a good understanding of where in that space the "dog" examples lie. Then, hope that any "cat" examples appear as outliers in the embedding space. And, use the autoencoder as a "anomaly detector" (cats being the anomaly).
Google anomaly detectors and see if you can make anything fit for your application. You might try this sooner than as a last resort, but in my experience, most of the ML literature is either hard to reproduce, or not nearly as broadly applicable as advertised (so I've had little success adapting a lot of things to real datasets). Which is why this step isn't (always) my first step.

Thank you for the well thought-out response, @TravisJ! Your point 1 is an interesting take that I hadn't considered. Taking anomaly detection as a starting point is also definitely a worthwhile approach. I will accept this answer as it is clearly high-effort, relevant, and the best one so far :) — stinodego, Jul 29 '21 at 17:55

Train/dev/test split with limited and skewed positive labels

1 Answers1