Proper Sampling - can I collect a two-group sample this way without issues?

Question

I need to collect a two group sample for a comparison analysis (perhaps using logistic regression).

The population that I need to extract a sample from is all firms from country A with activities in country B. The firms are classified into two categories: having a subsidiary in country B (S), or not having a subsidiary in country B (NS). I expect the share of S firms to be small relative to NS firms (but I have no way of knowing for sure).

I already hold the entire population of S firms (because this data was available to me). However data on NS firms is not readily available and I have to collect that, and I will probably not get access to identify and collect all NB firms.

So my situation is I have the entire population of S firms, and need to collect enough NS firms for subsequent analysis to be significant. Most likely my final sample will consist of all S firms and some share of the population of NS firms. Without much experience in doing these kinds of studies, I can't help to think that there is some kind is bias/reliability issue when sampling this way (one group: entire group population, other group: some part of group population). I have learned that if it so happens that the population of NS firms is indeed much larger than S firms (again there is no way to know without data for the entire population of firms), and I e.g. end up with similar-sized samples of each group, there will be a case of oversampling the minority group. However I cannot find any remarks anywhere that consider this a problem for a comparison study, as a correct sample representation of the entire population is less important in this manner.

Is my concern justified? Or is it fine to do it that way for e.g. logistic regression? If not, how can I get around the issue?

For logistic regression [it's fine](http://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients/68726), though you'll have to be careful that the sample of NS firms is not biased with respect to any predictors. — Scortchi - Reinstate Monica, Oct 15 '13 at 19:41
What I mean is that ideally you'd pick firms from country A at random until you have enough doing business in country B for the NS group in your sample. If you were to start, say, looking at larger firms first then it should be no surprise to find when you compare NS with S that larger firms appear more likely not to have a subsidiary in country B. If you can't do a random sample for some reason then the compellingness of your conclusions is negatively correlated to the plausibility of such selection biases. — Scortchi - Reinstate Monica, Oct 17 '13 at 08:52
I get that there is some selection bias in my suggestion, I Am thinking of some ways to reduce this without making research impossible. I'll get back with some comments regarding this soon (just got back for vacation, sorry for the delay). — SuppaiKamo, Oct 26 '13 at 10:18
No selection bias would be expected *if* the share of the NS population used for the study is selected at random. — Scortchi - Reinstate Monica, Oct 27 '13 at 20:49

score 0 · Answer 1 · edited Apr 13 '17 at 12:44

If you are say comparing the means of the S vs. NS firms, then it absolutely does not matter whether you have 50 S firms vs. 50 or 2000 NS firms -- you get the means, you get their variances (the sampling variance of the former group will be zero though), you form the $t$-test as the difference in means divided by the sum of variances of these means.

If you want to do something more complicated, like run a regression, on the combined sample, then you can weight your firms with their inverse selection probabilities, as is always done in survey statistics world. If your dependent variable is whether the firm has a subsidiary (i.e., the variable on which the sampling is base), then, as shown multiple times in the literature (see overview here, and in the linked Scortchi's answer, the coefficients of the covariates will be estimated without (asymptotic) bias, while the intercept (that few people probably care about unless you need to figure out the exact population proportions, which you know already) is biased, but correctable.

Proper Sampling - can I collect a two-group sample this way without issues?

1 Answers1