Dear big data experts,
I’ve been confused with how to think about machine learning models trained through population-level big data that could not make good predictions at the individual level. I recently conducted a logistic regression on around 1 million users’ data to look at how different demographic features could explain a self-reported questionnaire score (there were 4 levels of scores, but one level was chosen by over 50%, which makes 50% a chance level). For example, the result gives me that the coefficient of age is 0.02 and the coefficient of gender is 0.9. Of course, with such big data, all p-values were significant. The goodness of fit is only around 55% (nearly chance level), meaning it won’t do a good job predicting scores for an individual. My question is focused on what’s the value of big data here even if it couldn’t make accurate individual-level predictions? Are the positive relationships between age and the self-reported score valuable information? And are the positive relationships between age and the self-reported score valuable information? I found it hard to think about these relationship trends getting out of the model while knowing that they don’t make good predictions on individual data.
Any thoughts or suggested reading will be appreciated.
Best,
Lily