-1

I am new in Statistics and Data Science. I would like to use "academic" data for training and testing for overfitting. However, I would like to get the classifier accuracy from "real-world" data and do hyperparameter optimization based on the score obtained with this data (instead of the usual grid search with cross validation). Would it be correct?

user6903745
  • 117
  • 5
  • Note what is a hyperparameter given here: http://stats.stackexchange.com/questions/133067/what-exactly-is-a-hyperparameter – Carl Mar 21 '17 at 09:51

1 Answers1

1

It's not really clear what you are asking, but:

  • any data you use for training your data, including hyperparameter selection, should not be used for testing
  • any data you use to train models that you use in an academic paper, from whatever sources, should be cited; this includes data used "only" for hyperparameter selection
  • use of additional data sources could be considered a type of augmentation, I suppose, but in any case whether you consider it an additional dataset, or augmentation, if you are using it in an academic paper, you need to state/cite these sources/augmentations. This lets people reproduce your results, and understand clearly what you did

If your goal is to train a model for eg an Android app, that you are writing yourself, and selling yourself, then citations would be less critical, unless you are using someone else's licensed/copyrighted data, which might require attribution/citation/license payments/etc. In that case, what you care about is generalization, runtime prediction performance. You'll want to keep a set of test data separate from whatever you used to evaluate hyperparameters etc, in order to evaluate your model. You can only use test data once in theory, or a few times, in practice, before it becomes "worn out", since you'll basically start to overfit your hyperparameters, and/or model selection, against your test set.

Hugh Perkins
  • 4,279
  • 1
  • 23
  • 38