I have a question concerning sampling.
We are planning a study in which we aim to draw conclusions about the relationship between different sources of research funding and academic impact (measured through citations). We are aware that there are several complications with making this connection and we are taking other steps to deal with them, but I want to focus on only one of them here.
We plan to use data from four different funding agencies and know that data on around 700 projects and their principal investigators (PI’s) are available from the funding agencies. We then intend to scrape data on the PI’s publications and on the funding source through which the research presented in the publications was possible from available internet-based sources such as Scopus and Web of Science. We will then do a network analysis based on the publications. However, making the connection between the projects, their PI’s and the publications produced through funding from the project is complicated but the available data is incomplete.
Our final sample needs to contain data on PI’s publications and the funding source with which the research was conducted. Based on preparatory work, we expect that for these ~700 PI’s and their publications, data on funding source will only be available for perhaps 50% of them. We also expect this information to be subject to bias, for example in the sense that it is common to provide data on funding in the natural science journals but uncommon in social science journals (other aspects are likely to be age, gender, tenureship etc.).
How do we draw a representative sample of the 700 PI’s from the ~50% of them for which data is available?
Additional clarification of the nature of the data:
We will get the names of PI’s from the funding agencies websites (all of them have databases available online). We have asked two of the funding agencies for access to internal data so far, and they would assist us in choosing projects and provide data, but this is unlikely to differ from the data that is available online already.
In order to match and de-duplicate the names from the lists, we will use a database system for the scraping process. Persons (researchers) are stored in one table and connections between texts and/or projects are stored in another table. Each person is entered only once in the database, if the scraper finds it again it only adds another connection. However, we need to check for typos and other writing mistakes (e.g. left out middle names etc.). Therefore, names are split into first_names etc and Levenshtein distance is used to find very similar names. When the scraper finds something very similar (e.g. same last name, same first name, but different middle names) it asks how to proceed and we manually research if the two persons found are the same.
We have already looked up how many projects there are and the figure is approximately 700, which is thus not a target sample size but the actual sample size. The figure of 50% available data is an estimation from an initial and “pre-analysis” look at the data; as such it is a very vague estimate.
There are likely to be instances where we may know for a fact that there was a funding body (project funding) behind published research, but where we may not be able establish the connection well enough, or would have difficulty quantifying with which precision we could make the connection. For example, we might know that a researcher (PI) has received funding in years 1-3 and has published with a certain time lag, say in years 3-5, after that. We would be able to make a vague connection based on the topics of the publications and how they match the funded projects, but would not have the direct connection.