I have labour market data for 9 million observations, for a single time period (i.e cross-section data). I am studying the determinant of wages in a single equation multiple regression with around 300 regressors. If $Y$ is the vector of wages (9 million times 1) and $X$ is a (9 million times 300) matrix, the model is $$Y = XB + e$$ For computational reasons, I want to scale down the sample. My intuition tells me that a subsample of size N large enough should keep the information/variability from the whole sample. My goal is to select a method to find N.
I am thinking on the following two methods:
1) Comparison of $R^2$:
- Select subsample of size $N_{0}$
- Estimate model with all regressors, and then without the regressors of interest (around 7 regressors).
- Record $R^2$ of both models, and define $D$ as the absolute difference between them.
- Run again for $N_{1} = N_{0} + 100$ (or any other incremental size).
- Keep iterating until $D$ is below a certain threshold $\mu$
2) Placebo variable significance:
- Create random indicator variable $Z$. By definition, it should have no effect on the dependent variable.
- Select subsample of size $N_{0}$
- Run regression adding variable $Z$
- Record p-value of coefficient for $Z$
- Run again for $N_{1} = N_{0} + 100$ (or any other incremental size).
- Keep iterating until p-value for coefficient of placebo variable is above a certain threshold $\phi$
The intuition for the second method is that a large enough sample size should be able to detect insignificance of Z. Yet, perhaps this sample size is different to the one needed to detect a positive effect? (as in approach 1)
Does any of these approaches make sense? Is there a better/faster/easier method?
Note: the post that ask about rules of thumb for sample size (here) does not relate to the problem of subsampling. Given my dataset, I can actually measure how well my subsample represents the original sample. The problem is then to find the best methodology to measure that fit.