I have a situation where I have a large amount of labeled data (~40 million records) with a binary outcome variable that has about 50% positive and 50% negative cases. The issue is that I know that the true proportion for these 40 million is more like 75% positive cases to 25% negative cases. So when I test my model I actually do not want to see that I have low false positives and false negatives, in fact I prefer to see some number of false positive cases.
Then I started to think, what about hyper-parameter tuning? For example I was using glmnet and the LASSO and using cross-validation to choose lambda and then thought, wait a minute, this is the lambda value that gives me the lowest classification error (which as I said before maybe is not actually what I want).
Am I correct in thinking that if I want to use cross validation to train my model I will have to tune with the goal of achieving the true known proportion rather than the lowest error?