How to draw a representative sample from a biased portion of a sampling frame?

Question

I have a question concerning sampling.

We are planning a study in which we aim to draw conclusions about the relationship between different sources of research funding and academic impact (measured through citations). We are aware that there are several complications with making this connection and we are taking other steps to deal with them, but I want to focus on only one of them here.

We plan to use data from four different funding agencies and know that data on around 700 projects and their principal investigators (PI’s) are available from the funding agencies. We then intend to scrape data on the PI’s publications and on the funding source through which the research presented in the publications was possible from available internet-based sources such as Scopus and Web of Science. We will then do a network analysis based on the publications. However, making the connection between the projects, their PI’s and the publications produced through funding from the project is complicated but the available data is incomplete.

Our final sample needs to contain data on PI’s publications and the funding source with which the research was conducted. Based on preparatory work, we expect that for these ~700 PI’s and their publications, data on funding source will only be available for perhaps 50% of them. We also expect this information to be subject to bias, for example in the sense that it is common to provide data on funding in the natural science journals but uncommon in social science journals (other aspects are likely to be age, gender, tenureship etc.).

How do we draw a representative sample of the 700 PI’s from the ~50% of them for which data is available?

Additional clarification of the nature of the data:

We will get the names of PI’s from the funding agencies websites (all of them have databases available online). We have asked two of the funding agencies for access to internal data so far, and they would assist us in choosing projects and provide data, but this is unlikely to differ from the data that is available online already.

In order to match and de-duplicate the names from the lists, we will use a database system for the scraping process. Persons (researchers) are stored in one table and connections between texts and/or projects are stored in another table. Each person is entered only once in the database, if the scraper finds it again it only adds another connection. However, we need to check for typos and other writing mistakes (e.g. left out middle names etc.). Therefore, names are split into first_names etc and Levenshtein distance is used to find very similar names. When the scraper finds something very similar (e.g. same last name, same first name, but different middle names) it asks how to proceed and we manually research if the two persons found are the same.

We have already looked up how many projects there are and the figure is approximately 700, which is thus not a target sample size but the actual sample size. The figure of 50% available data is an estimation from an initial and “pre-analysis” look at the data; as such it is a very vague estimate.

There are likely to be instances where we may know for a fact that there was a funding body (project funding) behind published research, but where we may not be able establish the connection well enough, or would have difficulty quantifying with which precision we could make the connection. For example, we might know that a researcher (PI) has received funding in years 1-3 and has published with a certain time lag, say in years 3-5, after that. We would be able to make a vague connection based on the topics of the publications and how they match the funded projects, but would not have the direct connection.

can you describe in more detail who you are going to get the names of the PIs? Is it off the funding agencies websited? Or from the agencies' internal data? How are you going to match and de-duplicate the names from these lists? Is 700 your target sample size? Where is your figure of 50% available data coming from? While the talk in the answers so far has been about non-response error, one can also frame the issue in terms of measurement error: you may know for a fact that there was a funding body behind published research, you just can't establish it well enough. — StasK, Mar 09 '15 at 22:55
please, please move these comments into the body of your question (I should have specified, my bad). — StasK, Mar 10 '15 at 17:17
@StasK :As you suggested, I have included my comments to your questions in the text of the original post — KML, Mar 11 '15 at 13:59

score 4 · Accepted Answer · edited Mar 09 '15 at 08:14

First, you should know "representative" is not a good term to describe efficient sampling : depending on what you're trying to measure, the optimal sample might very well be really different from the population.

Your case looks like a typical unit non-response problem. As you correctly noticed, non-response is a source of bias, if not correctly taken into account. You can use methods such as homogeneous response groups (HRG) to do so. There are many references on the subject, you can check out for example http://www.statcan.gc.ca/pub/12-539-x/2009001/response-reponse-eng.htm for more details.

Homogeneous response groups basically work like this :

Create a dummy "response" variable and model it using a logit regression. I suggest you put in your logit model variables that you think explain non-reponse the best : age, gender, tenureship, being natural science journals or social science journals, etc.
Order your units by predicted value and gather them into groups (there are a lot of methods to choose how to form the groups : all your typical clustering methods are good choices, but keep in mind that your groups cannot be too small if you want your final estimator to be robust)
For each member of each group, increase its weight by multiplying it by $\dfrac{\sum_{k \in sample~\cap~group} w_k}{\sum_{k \in responding~units~\cap~group} w_k}$ (the $w_k$ being the sampling weights of your units).

To illustrate this with an example, let's say in the first group/cluster of your sample, you have 20 units, each having a weight equal to 10. Of these 20 units, only 15 are kept in the final sample (respondants). Estimating with HRG means multiplying the weights of these units by $\dfrac{20}{15}$. Each of the 15 respondants will thus end up with a weight of $\dfrac{40}{3} \approx 13.3$

The HRG method is called a "reweighting method", and has nothing to do with imputation. It is a way to account for nonresponse that is used in case of unit nonresponse. To sum up, it means dropping the units for which you don't have funding information, and sharing the weights of these units out on the units you keep in your final sample. Those are the PI's for which you do have funding information, and are called the "responding" units.

Now you have to ask yourself this question : do you want to keep the units for which you don't have funding information in your final sample ? Are they going to be useful to your analysis in any way ?

If the answer is "no", it is a typical case of unit nonresponse, and the HRG method will provide you a (somewhat) unbiased estimator.
If the answer is "yes", you don't want to drop these units from your final sample : this is called item nonresponse, and I'm afraid there's no other choice left than imputation.

Do I understand it correctly, that this would mean imputing values also from PI’s with “nonresponses”, or would it only entail a sample of PI’s with “responses” (i.e. for whom information linking publication to funding sources is available) or? Using imputed values will be complicated, or unhelpful, because as a next step we will do a network analysis based on co-authorship. I probably should have mentioned this in the original posting. — KML, Mar 07 '15 at 12:51
Edited my answer to explain the difference between the HRG method and imputation. — Antoine R, Mar 07 '15 at 16:11
This is very helpful, but could you help me clarify the mathematical notation, perhaps spelling it out in wording? I am unfamiliar with set-theoretic notation. — KML, Mar 07 '15 at 18:51
Sure, edited my answer. Hope the example will make it clearer ! — Antoine R, Mar 08 '15 at 17:14

Steve Samuels · Answer 2 · 2015-03-10T16:54:03.463

HRG and multiple imputation are post-sampling techniques, whereas you asked about how to draw the sample. The standard sampling design to offset non-response/missing values is two phase sampling, also called "double sampling ( Lohr, 2009, pp. 336-338, and Chapter 12). See also Lumley (2010, Chapters 8 & 9).

In two-phase sampling for non-response, one takes an initial sample of size n, then stratifies the resulting sample into "responders" and "non-responders". In your case, one would stratify into those with reported and missing funding source. Suppose $n_r$ have funding source information after scraping and $n_m$ do not. Then treat the $n_m$ as a second population, and take a subsample of size $n_m^{(2)}$.

Then (and this is important) get the funding source information by more intensive (non-automated) inquiries than the original. Judging from your subject matter, much of the information should be available online; for some projects, you might need to contact the investigators directly.

Let $n_m^{(2)}= n_m/k$. It simplifies calculations to take $k$ an integer. To keep within your budget and time constraints, you will decide on $n_m^{(2)}$ after pilot testing.

The sample weights for your observations will be as follows. For the $n_r$ individuals with funding information at the first phase:

$$ W = \frac{N}{n} $$ For the $n_m^{(2)}$ sampled at the second phase: $$ W =\frac{N}{n}\,k $$

You can handle residual missing values at the second stage with MI or HRG. A modification, at the expense of some complexity, is to stratify the non-responders by project type for the second phase, e.g. natural science and social science. You might also choose a similar stratification at the first stage and present separate results for each group.

Analysis: The twophase() function. in Tom Lumley's R survey package can compute proper standard errors. Lohr (Chapter 12) also refers to a jackknife approach.

This design and analysis are admittedly more complicated and time-consuming than taking a simple random sample But in my experience, post-sampling fixes alone for such massively missing (or biased) data have not been satisfactory. If the two project types do not have the same funding sources, for example, then estimates for the social science contribution could be based on very few observations.

Reference:

Lohr, Sharon L. 2009. Sampling: Design and Analysis. Boston, MA: Cengage Brooks/Cole.

Lumley, Thomas. 2010. Complex surveys : a guide to analysis using R. Hoboken, N.J.: John Wiley.

We have indeed planned to also conduct a follow-up survey on respondents for which we do not have the connection between funding sources and publications. If I understand correctly, this would be a way to do this. However, wouldn't that be a complement to also doing an HRG, as suggested by Antoine R? It is not clear to me how they would be conflicting. Also, how would we determine k? Based on what criteria? — KML, Mar 10 '15 at 15:56
They do not conflict; you do HRG or MI on the missings that remain *after* phase 2. Choosing k: First decide how you will quantify the association between funding source and impact; and the minimum precision you require; then do an initial sample size calculation based on that. Make the sampling fraction as large as you can "afford", up to the point where the extra precision isn't worth it. A pilot test of phase 2 will help you to estimate per-project "cost", quantified, for example, as person-hours. — Steve Samuels, Mar 10 '15 at 17:29
I should have added: to do a sample size calc, you'll have to guess the contribution of the projects with missing data. For example, you might, for a start, t dichotomize your funding source & impact scores, then parameterize association as a difference between the proportions of projects with high scores in source categories 1 and 2. You'll then need to guess contributions of the non-missing and missing-source strata to the four cells. This is a lot of guessing. A pilot sample might help, or, most simply, just take as many phase 2 projects as you can afford. — Steve Samuels, Mar 10 '15 at 21:10
The two approaches definitely combine, and will help improve the quality of your estimators. As long as you can obtain funding information for some of the initially non-responding units, Steve's method will help very much lowering risks of bias variances. And using HRG on these data can only improve your estimators as well. — Antoine R, Mar 11 '15 at 13:13

How to draw a representative sample from a biased portion of a sampling frame?

2 Answers2

Linked