Several people have already asked "is more data helpful?":
- What impact does increasing the training data have on the overall system accuracy?
- Can increasing the amount of training data make overfitting worse?
- Will a model always score better on the training dataset than the test dataset?
I would like to ask "is more external data helpful"? By external I mean data from a similar though not equal domain. For example, if we want to detect cars in Japan, I would consider a U.S. dataset as external since the average car (and street) look different there. Or a dataset taken with the same kind of objects but a different camera.
The reason I'm asking is that many papers seem to use external datasets with great success. For example, depth estimation methods additionally train on the CityScape dataset to perform prediction on the KITTI dataset, see paper. Similarly, external datasets are often used in kaggle competitions. Last, a 2014 paper reports the "surprising effect" that pretraining on the first half of ImageNet classes and then finetuning on the other half yields better results than training only on the second half of classes. On the other hand, this paper reports in Fig.2 that adding new datasets worsens the error. Thus, what is your experience? Are there any guidelines or interesting review articles? Or do you simply have to "try it out" always?
EDIT: To clarify, with "more data" I mean more rows (not more columns / features). More specifically, I am assuming a computer vision problem where more data corresponds to more images.