I have a binary classification problem (lets say, whether or not an observation will experience action x). I train a random forest model on a training set where about 50% have done action x and 50% have not. I test the model on a test set (again, about 50% did action x, 50% did not), and its about 85% or so accurate and has an overall error rate of about 15% or so. A year passes and I get new data and I want to see how the model performed. It predicted that about 9% of the data will experience action x. 9% of the data did in fact experience action x but it failed to accurately predict the individual observations that would experience action x. In order words, the individual observations it predicted would experience action x did not actually experience action x. And the individual observations that did in fact experience action x, the model did not predict it so.
So essentially what does it mean that the model gets the aggregate correct but fails to make accurate predictions at the micro level? Is there a mathematical explanation on how this might occur? Maybe something to do with aggregating the probabilities? Is it still useful at predicting totals?