I have a large unbalanced dataset (the target has ~1500x more 0's than 1's) on which I train a logistic regression algorithm to predict the probability of success (Not a binary outcome but a real number between 0 and 1).
Some additional details:
- about 100,000,000 lines in my dataset
- I'm using Online Logistic Regression with a Stochastic Gradient Descent.
- I apply a decay factor to my learning rate
The average value of the target variable on the training set is about 0.0005. the average of the target variable in the test set is almost exactly the same as in the training set (~0.0005).
However, when I test my model on the test set, the overall average of my predictions is a bit out (~0.001). Should I expect from logistic regression to centre the results around the mean? I know that the main purpose of the intercept term is to do so, but what level of difference should I expect? Is the difference in observed vs. predicted mean negligible here?