Formal method to find optimal sub-sample size from large sample for multiple regression

Question

I have labour market data for 9 million observations, for a single time period (i.e cross-section data). I am studying the determinant of wages in a single equation multiple regression with around 300 regressors. If $Y$ is the vector of wages (9 million times 1) and $X$ is a (9 million times 300) matrix, the model is $$Y = XB + e$$ For computational reasons, I want to scale down the sample. My intuition tells me that a subsample of size N large enough should keep the information/variability from the whole sample. My goal is to select a method to find N.

I am thinking on the following two methods:

1) Comparison of $R^2$:

Select subsample of size $N_{0}$
Estimate model with all regressors, and then without the regressors of interest (around 7 regressors).
Record $R^2$ of both models, and define $D$ as the absolute difference between them.
Run again for $N_{1} = N_{0} + 100$ (or any other incremental size).
Keep iterating until $D$ is below a certain threshold $\mu$

2) Placebo variable significance:

Create random indicator variable $Z$. By definition, it should have no effect on the dependent variable.
Select subsample of size $N_{0}$
Run regression adding variable $Z$
Record p-value of coefficient for $Z$
Run again for $N_{1} = N_{0} + 100$ (or any other incremental size).
Keep iterating until p-value for coefficient of placebo variable is above a certain threshold $\phi$

The intuition for the second method is that a large enough sample size should be able to detect insignificance of Z. Yet, perhaps this sample size is different to the one needed to detect a positive effect? (as in approach 1)

Does any of these approaches make sense? Is there a better/faster/easier method?

Note: the post that ask about rules of thumb for sample size (here) does not relate to the problem of subsampling. Given my dataset, I can actually measure how well my subsample represents the original sample. The problem is then to find the best methodology to measure that fit.

I'm not sure I see why RMSE would change as a function of N. Generally, determining N is done by [tag:power-analysis] / finding the N required to achieve a given [tag:power]. — gung - Reinstate Monica, Jan 11 '16 at 17:28
Actually, the RMSE might decrease due to better fit, but it could also increase since I'm adding more observations. It might still be interesting to see how it evolves, but clearly it is not going to be as i expected in the first place. Regarding the power, that is usually studied in the contest of one variable. In my multivariable regression context, I want to use more information that the one provided by one variable. — luchonacho, Jan 11 '16 at 20:41
Power can be for any number of variables. You will certainly get a better fit (even overfit), if you add more *variables* adding N won't necessarily give a better fit. It should lead to a closer approximation of the population model / data generating process, but that is a different issue. — gung - Reinstate Monica, Jan 11 '16 at 20:47
You're right. I found a method using G*Power software, following http://www.ats.ucla.edu/stat/gpower/multreg.htm . I got a minimum sample size of 600. From there I got the following idea to re-check this: to run two regressions, one with the relevant regressors and one without them. There I get a measure of the difference in R2 (used to get sample sizes in that software). Then, I can iterate for increasing _N_. When the difference in R2 changes less than a given threshold, that is a stopping point. — luchonacho, Jan 12 '16 at 15:00

Formal method to find optimal sub-sample size from large sample for multiple regression

0 Answers0