How to handle the difference between the distribution of the test set and the training set?

Question

I think one basic assumption of machine learning or parameter estimation is that the unseen data come from the same distribution as the training set. However, in some practical cases, the distribution of the test set will almost be different from the training set.

Say for a large scale multi-classification problem that tries to classify product descriptions into about 17,000 classes. The training set will have highly skewed class priors, such that some class might have many training examples, but some might have just a few. Suppose we are given a test set with unknown class labels from a client. We try to classify each product in the test set into one of the 17,000 classes, using the classifier trained on the training set. The test set would probably have skewed class distributions but probably very different from that of the training set, since they might be related to different business areas. If the two class distributions are very different, the trained classifier might not work well in the test set. This seems especially obvious with the Naive Bayes classifier.

Is there any principled way to handle the difference between the training set and a particular given test set for probabilistic classifiers? I heard about that "transductive SVM" does similar thing in SVM. Are there similar techniques to learn a classifier that performs best on a particular given test set? Then we can retrain the classifier for different given test sets, as is allowed in this practical scenario.

Dikran Marsupial · Accepted Answer · 2012-11-16T12:48:36.393

If the difference lies only in the relative class frequencies in the training and test sets, then I would recommend the EM procedure introduced in this paper:

Marco Saerens, Patrice Latinne, Christine Decaestecker: Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation 14(1): 21-41 (2002) (www)

I've used it myself and found it worked very well (you need a classifier that outputs a probability of class membership though).

If the distribution of patterns within each class changes, then the problem is known as "covariate shift" and there is an excellent book by Sugiyama and Kawanabe. Many of the papers by this group are available on-line, but I would strongly recommend reading the book as well if you can get hold of a copy. The basic idea is to weight the training data according to the difference in density between the training set and the test set (for which labels are not required). A simple way to get the weighting is by using logistic regression to predict whether a pattern is drawn from the training set or the test set. The difficult part is in choosing how much weighting to apply.

See also the nice blog post by Alex Smola here.

no problem, these sort of "non-standard" situations are really interesting, and covariate shift is a particularly useful area of research. — Dikran Marsupial, Nov 16 '12 at 15:28
Good to know that. Though "non-standard", it is realistic in practice. — Fashandge, Nov 16 '12 at 16:42

score 1 · Answer 2 · answered Sep 14 '16 at 10:40

1

I found an excellent tutorial about domain adaptation that might help explain this in more detail: http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/survey/da_survey.html The one solution that hasn't been mentioned here is based on ADABOOST. Here is the link to the original article: http://ftp.cse.ust.hk/~qyang/Docs/2007/tradaboost.pdf The basic idea is to use some of the new test data to update learning from the train data.This article is the tip of the iceburg about transfer learning-- where you take what you know from one task and apply it to another one.

answered Sep 14 '16 at 10:40

rentreg

11
1

1

Could you include some key summary points from the first tutorial in particular in case the link goes dead or its location changes? We have a problem with "link rot" here where the value of some of our older answers has degraded due to links stopping working, so it's nice if answers can be as self-contained as possible – Silverfish Sep 14 '16 at 12:09
Here's an archived link: https://web.archive.org/web/20170930145238/http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/survey/node1.html – Justas Oct 15 '19 at 19:50

How to handle the difference between the distribution of the test set and the training set?

2 Answers2

Linked