Average of predicted values with logistic regression

Question

I have a large unbalanced dataset (the target has ~1500x more 0's than 1's) on which I train a logistic regression algorithm to predict the probability of success (Not a binary outcome but a real number between 0 and 1).

Some additional details:

about 100,000,000 lines in my dataset
I'm using Online Logistic Regression with a Stochastic Gradient Descent.
I apply a decay factor to my learning rate

The average value of the target variable on the training set is about 0.0005. the average of the target variable in the test set is almost exactly the same as in the training set (~0.0005).

However, when I test my model on the test set, the overall average of my predictions is a bit out (~0.001). Should I expect from logistic regression to centre the results around the mean? I know that the main purpose of the intercept term is to do so, but what level of difference should I expect? Is the difference in observed vs. predicted mean negligible here?

Related: [Properties of logistic regressions](http://stats.stackexchange.com/q/16533/2970) — cardinal, Aug 12 '14 at 23:19
I almost always refer people to this paper by Steyerberg: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3575184/ Does a better job than me in explaining "Calibration in the Large". But for a logistic model the predicted "Calibration in the Large" will always be same as sample in development set and different in test set. There is a statistical test for significance of the difference, but would need to do some searching to find. — charles, Aug 13 '14 at 00:28
The intercept is the same as the mean only when all predictors are zero. — purple51, Aug 13 '14 at 05:15
Thanks for your comment. Do you have any idea what can be the cause of this difference between the mean in predicted Vs. observed. Can a "bad" learning rate" (which would prevent convergence to occur) explain that? — Aymen, Aug 13 '14 at 17:20
What precisely are you averaging? Logistic regression produces estimated *log odds* but it does not specifically produce 0-1 predictions. The mean of log odds is not the same as the log odds of the mean. — whuber, Aug 13 '14 at 17:48
@whuber I am averaging the predicted probabilities (i.e. 1/(1+exp(-beta*x)), no 0 or 1 values) and comparing it to the mean of the target variable on the training set (mean of 0's and 1's). — Aymen, Aug 13 '14 at 17:57

Average of predicted values with logistic regression

0 Answers0