Ratio of training/testing data different to real life?

Question

My company wants to build a model that will be used to predictive conversion that is usually about 2%.
However every sample we purchase (converted or unconverted) is expensive. So my question is:

How many converted samples do i need, 500 or 1000 ok?
How many unconverted to I need? The same number? I can't have as many as possible.

If i build a model using a 50/50 split, will that be OK to use on a real world sample of 98/2? Or do I have to do something like resample the unconverted to get a more real-world split?

Just wondering if there is any rule of thumb here? I'm not even sure the name for my problem, Domain adaptation or sampling bias?

Thank you.

I typically purchase as many samples as I need to equal 2000 if the thing I'm trying to predict. This rule if thumb is based on the fact that I made plenty of explanatory variables. In your example, this means I would be buying 2000/0.02=100,000. Regarding your concerns about splitting and oversampling, if you can save money by purchasing less but still getting 2000 y=1, then go for it, just remember you'll have to adjust your final model probabilities down to 2% to match the real world. I've personally not heard of such a thing, I'd be stuck buying 100K. — Josh, May 31 '17 at 10:34
Thanks Josh, I'm told I cant afford to do that. So i was wondering what a alternative approach might be. If i can afford 5,000, I assume an even split is best. However if only 1,000 positive records exit, will it be ok to purchase these and 4,000 negative. If i can fight to get 1,000 positive and 9,000 negative, will it really be much better? I realise how subjective this is but appreciate any feedback! — khhc, May 31 '17 at 10:43
Yes, always more the better. Since you're limited by budget, buy what you can and you can always address the challenges during the modeling process. If you only have the money for 4000, then try to get 2000 of each. If you can only get 1000 positives, then get as many negatives as possible. Your advantage is that you already know the rate is around 2%, so you don't need to worry about the sample rate as much, and can just focus on getting as much as possible. After you finish the modeling process, you can adjust the predicted probabilities down to near 2%. — Josh, May 31 '17 at 11:29
Thank you very much. Can i just ask about "adjust the predicted probabilities down to 2%". What exactly do you mean. If i train a model based on a 50-50 split, I can't just divide the predicted probabilities by 25 to get an idea of the real probability right?!! Sorry for asking the basics. — khhc, May 31 '17 at 13:48
After you build your model on a 50/50 dataset, let's pick an average observation and see that it produces a predicted probability of 50%. In real life, that person would only have a 2% probability, but it would indeed still produce a predicted p of 50% This could be a problem if you're using your probabilities for something where you need them to be accurate. Therefore, they all need to be adjusted. You can't simply subtract 48% from each one either, because you might get negatives. You can use this equation: P_i** = ( P_i* x R_0 x P_1) / ( (1-P_i*) (R_1)(P_0) + (P_i*)(R_0)(P_1) ) where — Josh, May 31 '17 at 22:26
You can use this equation: P_i** = ( P_i* x R_0 x P_1) / ( (1-P_i*) (R_1)(P_0) + (P_i*)(R_0)(P_1) ) where: P_i* is the unadjusted probability you get from your model R_0 and R_1 are the sample proportions of 1 and 0 respectively P_0 and P_1 are the original event and non_event rates (population rates) P_i** is the true probability — Josh, May 31 '17 at 22:27
Thank you for taking the time to write all that out Josh. I'm a beginner at all this. I get ML concepts, but best practises in real life are always a bit confusing to me and difficult to get an answer on. Thanks again. — khhc, Jun 05 '17 at 02:24

score 0 · Accepted Answer · answered May 31 '17 at 09:54

0

The number of samples you need depends on how complex a model you want to use (and on the distribution of the samples in your feature space). There is no simple rule here. How many features do you anticipate your model will use as input? If it is only a few, then ~1000 samples should be adequate.

50/50 split should be fine for training the model, but for your test set you want the distribution to be similar to the distribution you would see in the real world. But it depends on what you want the model to do. Look into the precision/recall tradeoff.

answered May 31 '17 at 09:54

rinspy

3,188
10
40

Thanks for the response. I always assumed training and testing had to be the same ratio because people always talk about splitting it randomly. So if i can only get 1,000 converted applications, but can get more than 1,000 unconverted (say 5,000), there is still benefit in doing that, right? – khhc May 31 '17 at 09:58
Random splits are usually performed, but are not always 'best practice'. For example, if you are concerned with the external validity (~ out of sample performance) of a prediction model, you might specifically want to test your model in a poopulation which has some characteristic different compared to the development set (think of sampling from a different region or timeperiod). The same goes for the occurence of outcome; if you want to see whether it affects model performance, you might opt to specifically have different ratios in develop and training sets. – IWS May 31 '17 at 10:04
Thanks, I will do that. So in the testing set, if i can have, lets say 1,000 positive and 1,000 negative results. That's OK. If i can have 1,000 positive and 5,000 negative that's better right? And even if i had 1,000 positive and 99,000, would that even be ideal or actually overkill as i read that some predictive algorithms don't work well with such imbalance? BTW the model i anticipate to be simple (<10 predictors) – khhc May 31 '17 at 10:24
This question may be relevant to your problem: https://stats.stackexchange.com/questions/9398/supervised-learning-with-rare-events-when-rarity-is-due-to-the-large-number-o – rinspy May 31 '17 at 15:08

Ratio of training/testing data different to real life?

1 Answers1