2

I am training a binary classification model with about 8000 observations in the training set and 500 in the test set (sets are dictated to me so I can't modify the split). In the training set the split between 1/0 is about 2/3=0 and 1/3=1. In the test set the split is 50/50. I notice when I test my model it is predicting things at about the same ratio as the training set distribution(2/3=0,1/3=1). I'm wondering if this is just because the model isn't optimized well yet, or if there's some fundamental problem with having these differences between test and training data distributions. And if so, if there are good ways to deal with a problem like that?

Edit: Some more information. I'm attempting to train a neural network on it, and the distribution of the actual population we would eventually use it for is unknown.

user2355903
  • 123
  • 5
  • Please say more about the type of model and whether the distributions of predictors also differ between train and test sets. Do you know what the split is in the overall population of interest that you would like to apply this model to? – EdM Sep 09 '20 at 18:47
  • I added some clarification, but as far as the distribution of predictors it is somewhat hard to say. Working on kind of high dimensional image data so the parameters are somewhat hard to build distributions for, at least to my knowledge. Open to ideas about how to explore that though. – user2355903 Sep 09 '20 at 22:59

1 Answers1

4

The usual idea about setting aside separate training and test sets is that they represent two independent samples from some underlying population of interest. With such large training and test sets having such wide disparities in class frequencies, that clearly isn't the case.

My first reaction is that you should explore this by playing with subsets of your training set, chosen to have different class frequencies. A search for the related method of oversampling, however, suggests that you will find the problem to be a poorly optimized model. This answer in particular is on point, saying in part (the entire answer is worth study):

... if the model does not describe reality correctly, it will minimize the deviation from the most frequently observed type of samples.

That seems to describe your situation pretty well. If you knew the class frequencies in the population of interest, a case-weighting approach for training might help. But you don't. This also raises a question about how useful your test set will be for evaluating model performance. What if the class ratio in the population of interest is more like 10/1 instead of the 2/1 or 1/1 you are now using? Besides getting a better-optimized model, it seems that exploration of the class distribution in the population of interest would be important.

Also, as with any classification scheme, what is your tradeoff going to be with respect to false-positive and false-negative class assignments? That should be more important than an accuracy score per se. A search on this site for misclassification cost will provide a good deal of information on such considerations.

EdM
  • 57,766
  • 7
  • 66
  • 187