Due to a lack of significance and the large size of the dataset (which had binomial responses with 20,000 responses out of a sample of 15,000,000) my peer has used random sampling to reduce the amount of data and import into our modelling software.
This adjusted dataset has the full 20,000 responses but under 1,000,000 tests.
A GLM was fitted to this data with many parameters that felt overfit to me.
My argument is that this random sampling does two things:
- It changes the relativities the GLM would find.
- It could lead to factors with low statistical significance appearing to have a high statistical significance. Thereby invalidating our statistical tests.
I think these effects have led to overfitting.
Please could somebody let me know if my concerns are justified?