I am working on a logistic regression model that attempts to predict failure events in a population of devices using the previous 10 days of data (60 features from sensor data). The failure event is rare; I would expect to find approximately 20-30 on a given day, and the total population size is > 10,000.
Here is where I may be mistaken: I have access to years-worth of historical data, so I thought that I would be slick and make the event 'non-rare'. That is, collect 1000 failure examples and 1000 non-failure examples, then estimate the model on that. This, I thought, would give the model enough information to determine the relationships between sensor readings and class membership.
However, I am starting to think that my intercept term (and others?) may be problematic because the real initial class probabilities are not respected.
Is the model corrupted by this disregard of the real event probabilities?