I've got a large collection of geotagged tweets, each linking to an article from a given author. For each author, I'd like to derive a number that describes how diverse is their list of tweeting countries.
Of course I could just count the number of countries represented, but I'd like to normalize by the number of tweets linking to each author. So for instance, an author attracting 1000 total tweets from 50 countries should rank lower in geographic diversity than another one tweeted in 50 countries, but only from 100 tweets.
A naive way would be to use tweets per country, but this seems less useful given that there are a limited number of countries to choose from: one's 150th country is less likely to show up than one's 15th, and the simple proportion doesn't reflect this.
I've got some vague ideas about using a binomial distribution, but would love to get a more experienced perspective.