1

Given a Bernoulli Process, should my training set have a number of "1" examples in proportion to the process?

For example, a Bernoulli Process is "1" 10% of the time and "0" otherwise. In a training set of 1,000,000, should I use 10,000 "1" examples and 90,000 "0" examples?

Back-story: I have a large training set of 100 billion rows. I have about ~200,000 "1" cases, about .2% of the time. Training will take forever so I want to do a subset of this data. Taking a straight sequential chunk of this data, I'm afraid I won't have any "1" cases contained in the subset. But now I'm wondering if the way I sample this training data would affect my classifier.

nbui
  • 11
  • 2
  • 1
    For logistic regression see [Does down-sampling change logistic regression coefficients?](http://stats.stackexchange.com/q/67903/17230). More generally, see [Why downsample?](http://stats.stackexchange.com/q/122409/17230), & [How to handle the difference between the distribution of the test set and the training set?](http://stats.stackexchange.com/q/43716/17230). – Scortchi - Reinstate Monica Apr 05 '16 at 08:07

0 Answers0