I am doing a classic multi-classification problem on forest cover type prediction. After a quick look on the training set and some predictions on the test set, I found that the test set is mainly composed of two cover types of forest while in the training set all the 7 cover types are identically distributed.
So, the training set is "balanced" across all the cover types, but it doesn't reflect well the "real" proportion of different types in test data. For me, this seems to be an "imbalance" in the sense of sampling from the test data.
My question are:
- Does this sampling influence the accuracy of prediction? (positively or negatively?) Why?
- If negatively, how to improve?(eg. resampling, or adding weights on different types)
(A priori, I think this sampling will decrease the accuracy of prediction, with more false predictions on the less frequent types in test set) and I think it should be improved by resampling (an uniform sampling over the test data for example)
Any insight is welcome. Please point out even if you think my consideration is senseless.