I think one basic assumption of machine learning or parameter estimation is that the unseen data come from the same distribution as the training set. However, in some practical cases, the distribution of the test set will almost be different from the training set.
Say for a large scale multi-classification problem that tries to classify product descriptions into about 17,000 classes. The training set will have highly skewed class priors, such that some class might have many training examples, but some might have just a few. Suppose we are given a test set with unknown class labels from a client. We try to classify each product in the test set into one of the 17,000 classes, using the classifier trained on the training set. The test set would probably have skewed class distributions but probably very different from that of the training set, since they might be related to different business areas. If the two class distributions are very different, the trained classifier might not work well in the test set. This seems especially obvious with the Naive Bayes classifier.
Is there any principled way to handle the difference between the training set and a particular given test set for probabilistic classifiers? I heard about that "transductive SVM" does similar thing in SVM. Are there similar techniques to learn a classifier that performs best on a particular given test set? Then we can retrain the classifier for different given test sets, as is allowed in this practical scenario.