2

I have a dataset with library document holdings per publicationyear, which I can query for the frequency of terms, subjects, etc. That allows me to construct times-series of the "popularity" of certain academic terms in the overall library corpus.

However, there are various effects at play, at least a growth in overall document volume per year (e.g. dips in WOI, WOII, funding, etc.), growth of English-language documents (native language is not English), the growth of academic disciplines themselves (measured by generic terms like "sociology" and "psychology") and the popularity of certain specific subjects, e.g. "social stratification", "terrorism", etc.

I would like to dis-tangle these effects, e.g. present the growth of studies about "immigration", net of the growth of the discipline and overall document volume.

Is there a technique/package (preference for R) I can look into? Added difficulty is that I do not have direct access to the full 13mn document dataset, but can query for frequencies per year on various indices.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
mhermans
  • 168
  • 5

1 Answers1

4

Economists have been entertaining decompositions like that for several decades now. Some of the favorite types are labor discrimination (gap in the wages of males and females: is it fully explained by the differences in the occupations they take?) and poverty analysis (what are the contributions of the changes in inequality and overall economic growth into poverty changes?) I worked on this many years ago, and Google refers to Tony Shorrocks' paper on Shapley decomposition

The mechanics of Shapley decomposition is as follows. Each contributing factor can be thought of as being "on" or "off", i.e., have only two levels. E.g., in poverty analysis, I can have Sweden income distribution and US income distribution (two levels), Sweden mean income and US mean income (two levels), and two poverty line definitions (half of the median income or a fixed figure like USD 10000 in purchasing parity units). I consider all possible combinations of factors being "on" and "off" (with three factors, I will have 2^3 = 8 possible outcomes, such as poverty rates). I then collect for each factor the marginal effect of the change in the factor, keeping other things equal. Thus, for inequality effect, I would consider the difference in poverty rates induced by changing the inequality from the Swedish levels to the US levels, for four different baseline scenarios (= 2^2 combinations of the remaining two factors, income level and poverty definition). Then I define the contribution of inequality to the poverty rate as the average across those four scenarios. The paper I cited shows the nice mathematical properties of the decomposition, and we had a couple of other papers where we applied what I described above to poverty in Russia (over time and across regions).

In your analysis, you would have to come up with meaningful ways to create the counterfactual results (what would the change in the number of immigration studies be if there were no change in the language? if there were no changes between disciplines? etc.) If you can come up with such counterfactuals, then it will be quite straightforward to apply this methodology.

Another similar decomposition in regression context was proposed by Gary Fields. For a regression equation $y=\beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k + \epsilon$, it essentially boils down to covariance between $y$ and $\hat \beta_k x_k$, as far as I remember. I did not work with this decomposition that much, although I did have it implemented in Stata.

StasK
  • 29,235
  • 2
  • 80
  • 165