My company wants to build a model that will be used to predictive conversion that is usually about 2%.
However every sample we purchase (converted or unconverted) is expensive. So my question is:
- How many converted samples do i need, 500 or 1000 ok?
- How many unconverted to I need? The same number? I can't have as many as possible.
If i build a model using a 50/50 split, will that be OK to use on a real world sample of 98/2? Or do I have to do something like resample the unconverted to get a more real-world split?
Just wondering if there is any rule of thumb here? I'm not even sure the name for my problem, Domain adaptation or sampling bias?
Thank you.