Is it in general helpful to add "external" datasets to the training dataset?

Question

Several people have already asked "is more data helpful?":

I would like to ask "is more external data helpful"? By external I mean data from a similar though not equal domain. For example, if we want to detect cars in Japan, I would consider a U.S. dataset as external since the average car (and street) look different there. Or a dataset taken with the same kind of objects but a different camera.

The reason I'm asking is that many papers seem to use external datasets with great success. For example, depth estimation methods additionally train on the CityScape dataset to perform prediction on the KITTI dataset, see paper. Similarly, external datasets are often used in kaggle competitions. Last, a 2014 paper reports the "surprising effect" that pretraining on the first half of ImageNet classes and then finetuning on the other half yields better results than training only on the second half of classes. On the other hand, this paper reports in Fig.2 that adding new datasets worsens the error. Thus, what is your experience? Are there any guidelines or interesting review articles? Or do you simply have to "try it out" always?

EDIT: To clarify, with "more data" I mean more rows (not more columns / features). More specifically, I am assuming a computer vision problem where more data corresponds to more images.

With that degree of generality ("always", "any" external data) you are unlikely to get meaningful answer. — Tim, Jun 29 '20 at 14:50
Transfer learning is of key importance in computer vision tasks. It is not exactly "adding rows", but still closely related. — Michael M, Jun 29 '20 at 16:27
@MichaelM: In my understanding transfer learning is extremely helpful for (1) getting a head start at training and (2) in case of a small target training dataset. However, let's assume we have a large "enough" target training dataset (~100k images) and that we don't care about training speed. Will we still gain accuracy if we add another external dataset to the training mix? On what factors does it depend? — gebbissimo, Jun 29 '20 at 16:55
Sounds like [meta-analysis](https://en.wikipedia.org/wiki/Meta-analysis) to me. Perhaps the same rules apply. — Carl, Jul 06 '20 at 09:24

score 9 · Answer 1 · answered Jun 29 '20 at 14:55

At some point, adding more data will result in overfitting and worse out-of-sample prediction performance. Always.

That papers report improved accuracy by leveraging additional data is not surprising at all. After all, people (both in academia and in industry) are heavily incentivized to report precisely this. Here is the relevant algorithm:

1. Pick an external dataset D.
2. Can you tell a story about how D *might* improve accuracy?
    If no: GOTO 1
3. Fit your model using D. Does it improve accuracy?
    If no: GOTO 1
4. Publish your accuracy improvement using D. Bonus points if you can get a press release.

Note how a publication only happens if accuracy improves. You don't see all the loops where accuracy didn't improve. This is called a "file drawer effect" (everything that is not successful ends up in a file drawer). The end result is a strong publication bias.

Note also that step 2 is crucial. An ability to tell a story about how the accuracy improvement might have come about is indispensable, because if you don't have such a story, it's too blatant that you went on a wild goose chase.

So: in order to know whether your external data actually did improve matters, you always need to keep from "overfitting on the test set", as the algorithm above does. If you follow this algorithm, don't be surprised if the "winner" does not perform as well in production as after this selection process (which in itself is an example of regression to the mean).

First of all thank you very much for the experienced answer! Concerning your statement "Adding more data will result in overfitting": Many people state the reverse so could you elaborate? https://stats.stackexchange.com/a/32525/179312 & https://stats.stackexchange.com/a/436018/179312 — gebbissimo, Jun 29 '20 at 15:08
First, note that "more data" can mean either more *rows* or more *columns*. More rows is good, more columns may be bad. This is what the first link in your comment says. "More external data" in principle exhibits the same ambivalence, but since it's more *external* data, the natural interpretation seems to be "adding more columns". So see above. Second, of course often some external data (i.e., columns) is useful. However, there always comes a point where adding yet more columns only leads to overfitting. — Stephan Kolassa, Jun 29 '20 at 15:13
Ah, okay, that makes perfect sense to me. I was only thinking in terms of more rows. Will clarify that in the question. — gebbissimo, Jun 29 '20 at 15:27

score 9 · Accepted Answer · answered Jun 30 '20 at 01:09

9

I think the examples you bring are mostly from computer vision/image recognition and that case external datasets are very likely to include similar signal/dynamics as the prior data at hand. A "car" is a "car" irrespective of its surroundings. A "good customer" or "abnormal shopping activity" is different in Luxembourg than it is in Moldova. Unless we actively account for "covariate shift" (input distribution changes) and/or "concept drift" (ie. correct output for a given input changes over time/space/etc.) then "more data is helpful" only if we are lucky. We should note that this includes computer vision too; for example if our additional data is biased in a way we are unaware and/or unable to control (e.g. the photos are always at night-time or are subjected to over-exposure) that will not necessarily help the generalisability of our model.

answered Jun 30 '20 at 01:09

usεr11852

33,608
2
75
117

Thank you very much, that was exactly the answer I was looking for. Do you know of any papers which analyze HOW similar the external dataset has to be (or where the limits are)? --- Maybe also a related question if I may: Should we rather pretrain on the external dataset and finetune on the target one or simply train on the joint set? – gebbissimo Jun 30 '20 at 07:27
1

I am glad I could help.I will check for any papers later tonight, I think some covariate shift literature might be helpful. Regarding your side-question: I would intuitively advocate the later (train on the joint set) unless we already know what is the test set (e.g. in a competition) but I am happy to corrected. I rarely pretrain as my models do not take days to run. – usεr11852 Jun 30 '20 at 08:21
1

I looked around I think this a nice sequence of papers on the matter: Shimodaira (2000) "*Improving predictive inference under covariate shift by weighting the log-likelihood function*", Huang et al. (2007) "*Correcting Sample Selection Bias by Unlabeled Data*" and Ioffe & Szegedy (2015) "*Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift*". Essentially H. et al. (2005) builds on Shimodaira (2000) on how the cost function should be and I&S (2015) offer a practical solution specific for DNN for images. – usεr11852 Jun 30 '20 at 20:57
Thank you so much, I will have a look at the papers this evening – gebbissimo Jul 01 '20 at 08:56

score 1 · Answer 3 · answered Jul 01 '20 at 09:26

It depends. One way to think about this problem is as follows. The data in your training and test/out-of-sample sets can be modeled as h(x) + noise. Here, the noise is the variability in your data that is not explained by some common (theoretically optimal) model h(x). The important thing here is that if your training and test data are sampled from entirely different/unrelated distributions, then ALL of your training data is noise, even if on their own, both training and test set data are very well structured.

What that means is that the more different the external dataset is to your test data, the greater the amount of noise in it. The greater the amount of noise, the easier it is to overfit (i.e. fit your model to noise - as defined above). For your car example, that would mean that a complex model might fit to the specifics of US number plates, which is not part of h(x) when it comes to detecting cars in Japan.

Having said that, if your goal is to make your model more robust (i.e. you want your car-in-Japan model to still work if the numberplate design is changed, or in some other way the distribution of your OOS data changes), then introducing the US dataset might help - in this case, the Japanese idiosyncrasies also become a part of 'noise' and in, e.g., cross-validation, you will be forced to come up with perhaps simpler models that pick up features that work both in the US and in Japan, making your model more general and therefore more robust.

So the answer is that it really depends on your data, on what the external data is, and on what you are trying to achieve.

Is it in general helpful to add "external" datasets to the training dataset?

3 Answers3

Linked