I have two datasets (1 million vs. 500 thousand). The first involves the citation counts of publications indexed by two prominent databases (group 1) (i.e. db1 and db2). The second contains the citation counts for the publications indexed only by db1 (group 2). Both samples have highly long tails.
The publications for the first group have been retrieved automatically by matching two databases based on paper titles. As for those in the second group, thousands of them might be still indexed by both databases, however, since this requires a heavy manual work and is almost impossible, I accepted them as not indexed by both sources.
After giving some introduction to my problem (hopefully it is clear), here is my issue. My objective is that the publications indexed by both dbs are cited more than the ones indexed by db1. To this end, I plan to compare the tails of two datasets to check which one has the heavier tail. However, since there are still publications in group 2 which might belong to group 1, I could not figure out how to approach this issue.
By checking both datasets, I have observed that group 1 values on the long tail are more dense compared to the values from group 2. Based on this information there are two points confusing me:
How might those values in group 2, which must belong to group 1, have an affect on the heaviness of the group 1's tail?
1) When they are poorly cited? 2) When they are highly cited?
My very primitive hunch is that if they are highly cited, they will be beneficial to my hypothesis, that is, papers indexed by both sources are cited more. On the other hand, if they are poorly cited, they won't have an affect on the tail of group 1. Considering this, can I compare the skewness or heaviness of the tails? If my approach is logical, how should I proceed? For ex. is the following a good start?