I am new in Statistics and Data Science. I would like to use "academic" data for training and testing for overfitting. However, I would like to get the classifier accuracy from "real-world" data and do hyperparameter optimization based on the score obtained with this data (instead of the usual grid search with cross validation). Would it be correct?
-
Note what is a hyperparameter given here: http://stats.stackexchange.com/questions/133067/what-exactly-is-a-hyperparameter – Carl Mar 21 '17 at 09:51
1 Answers
It's not really clear what you are asking, but:
- any data you use for training your data, including hyperparameter selection, should not be used for testing
- any data you use to train models that you use in an academic paper, from whatever sources, should be cited; this includes data used "only" for hyperparameter selection
- use of additional data sources could be considered a type of augmentation, I suppose, but in any case whether you consider it an additional dataset, or augmentation, if you are using it in an academic paper, you need to state/cite these sources/augmentations. This lets people reproduce your results, and understand clearly what you did
If your goal is to train a model for eg an Android app, that you are writing yourself, and selling yourself, then citations would be less critical, unless you are using someone else's licensed/copyrighted data, which might require attribution/citation/license payments/etc. In that case, what you care about is generalization, runtime prediction performance. You'll want to keep a set of test data separate from whatever you used to evaluate hyperparameters etc, in order to evaluate your model. You can only use test data once in theory, or a few times, in practice, before it becomes "worn out", since you'll basically start to overfit your hyperparameters, and/or model selection, against your test set.

- 4,279
- 1
- 23
- 38