A balanced training data but an imbalanced sampling

Question

I am doing a classic multi-classification problem on forest cover type prediction. After a quick look on the training set and some predictions on the test set, I found that the test set is mainly composed of two cover types of forest while in the training set all the 7 cover types are identically distributed.

So, the training set is "balanced" across all the cover types, but it doesn't reflect well the "real" proportion of different types in test data. For me, this seems to be an "imbalance" in the sense of sampling from the test data.

My question are:

Does this sampling influence the accuracy of prediction? (positively or negatively?) Why?
If negatively, how to improve?(eg. resampling, or adding weights on different types)

(A priori, I think this sampling will decrease the accuracy of prediction, with more false predictions on the less frequent types in test set) and I think it should be improved by resampling (an uniform sampling over the test data for example)

Any insight is welcome. Please point out even if you think my consideration is senseless.

Yes, it does affect things. I use this dataset as an example in one of my papers ( http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.7651&rep=rep1&type=pdf ) and show how dealing with the disparate priors can improve performance (even if you don't know what the operational priors actually are). See section 3.1 - the error rate goes down from about 40.5% to about 28.5% when the appropriate correction is made. — Dikran Marsupial, Jan 05 '22 at 20:20
One more question, could we say that the requirement of a balanced training set regardless of the nature of the test set is senseless? I have this question since there are many blogs and comments on the Internet which discuss "balance" without considering the test data. (or they maybe just assume the classes are equidistributed.) — sicheng mao, Jan 05 '22 at 23:29

score 1 · Answer 1 · answered Jan 05 '22 at 20:13

Does this sampling influence the accuracy of prediction? (positively or negatively?) Why?

Yes, oversampling can bias the preductions of the model as I have argued in this answer. If the various types of forest cover have been artificially over sampled so that the frequencies in the training data do not match the frequencies in nature then the probability estimates will suffer. This leads to decrease in out of sample metrics such as the Brier Score.

If negatively, how to improve?(eg. resampling, or adding weights on different types)

You'll need to hold out additional data in order to calibrate the model estimates, but that won't work if your training data is artificially over sampled. You need data which reflects the true frequencies.

A balanced training data but an imbalanced sampling

1 Answers1