1

Due to a lack of significance and the large size of the dataset (which had binomial responses with 20,000 responses out of a sample of 15,000,000) my peer has used random sampling to reduce the amount of data and import into our modelling software.

This adjusted dataset has the full 20,000 responses but under 1,000,000 tests.

A GLM was fitted to this data with many parameters that felt overfit to me.

My argument is that this random sampling does two things:

  1. It changes the relativities the GLM would find.
  2. It could lead to factors with low statistical significance appearing to have a high statistical significance. Thereby invalidating our statistical tests.

I think these effects have led to overfitting.

Please could somebody let me know if my concerns are justified?

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
Simon Todd
  • 11
  • 1
  • What do you mean by "the adjusted dateset?" – StatsStudent Feb 05 '15 at 22:43
  • The dataset whereby the negative responses have been shrunk from 15,000,000 to 1,000,000 randomly. – Simon Todd Feb 05 '15 at 22:44
  • OK. You refer to stratified sampling in your title, but you only mention a "random sampling" in the body of your post. Was a purely random sample drawn or was some sort of stratification indeed used? If stratification was employed, how was it employed? What did you stratify on? – StatsStudent Feb 05 '15 at 22:46
  • As I understand it, the stratification is that we kept all of the positive responses but randomly removed the negative responses to shrink the overall dataset. If my terminology is incorrect then please correct me! – Simon Todd Feb 05 '15 at 22:48
  • What attributes did you stratify on? Please check a standard text, e.g., MODEL ASSISTED SURVEY SAMPLING, Särndahl, Swensson, Wretman – Jan Galkowski Feb 06 '15 at 02:19
  • possible duplicate of [Does down-sampling change logistic regression coefficients?](http://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients) – Scortchi - Reinstate Monica Feb 06 '15 at 12:26
  • 20k responses might allow a rather complex model to be fitted without over-fitting. Typical rules of thumb suggest up to 1000 regression degrees of freedom would be unproblematic. Check by cross-validation or bootstrap validation. – Scortchi - Reinstate Monica Feb 06 '15 at 12:29
  • 1
    BTW "stratification" usually refers to sampling based on predictor values rather than response values. – Scortchi - Reinstate Monica Feb 06 '15 at 12:37

0 Answers0