Community detection in network

Question

I'm fairly new to the subject of network theory and community detection, and I'm trying to apply to some data that I have. To start, my data essentially looks like this:

Basically, what I have is a list of cities, people, and whether or not those people have been to each city. I have no data on how frequently a person visits a city, the order they visit cities in, or time between visits. Just whether or not they were there (technically speaking, a 0 does not guarantee they weren't there, just that they weren't detected there. For simplicity, I think it might be best to not worry about this at this point).

What I'm trying to do is use this information with community detection algorithms to see if I can identify how cities are clustered together without using any kind of geographic data. If you think about it, at the highest level, you might expect some kind of regional clustering of cities at the scale of a state or country. Then if there is some kind of regional clustering, then within each of those regions, the next level might be clusters of major urban areas made up of lots of cities. And of course, there might be solitary rural cities. My expectation is that people are more likely to visit areas that are more convenient to travel to, whether it's for work, recreation, shopping, etc, and that this can be used to identify community structure.

I look at this data and can see it being visualized as a graph in several different ways. It could easily be viewed as a hypergraph, or as a multigraph, or as a bipartite graph. For some of the stuff I've tried, I'm collapsing it into a complete weighted graph. What I've tried so far is creating a pairwise adjacency matrix of the cities with a single similarity or distance metric for each pair of cities (in my case, I've been using the Jaccard index). I have then been using this adjacency matrix with community detection algorithms in iGraph that try to maximize modularity. To a degree, this works. I can see the regional clustering that makes sense based on geographic features. However, trying to perform the same process within these regions does not seem to work as well. I also notice that individuals that occur at more cities tend to make things worse, and the community detection process works better when they are removed. However, from a randomized sampling standpoint, arbitrarily removing these people is terrible. I'm also not sure if these community detection algorithms are really intended to be used with complete graphs. On top of that, I don't understand modularity well enough yet to know if its limitations are coming into play.

Another interesting approach I've seen but haven't tried is using a simulated annealing algorithm with the data in a bipartite graph to maximize modularity.

I guess my question is, what approaches would you recommend for community detection with this type of data, and where there is the potential for hierarchical structure?

Do you want to force the clusters to be spatially contiguous? See [Nelson and Rae (2016)](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0166083) for one recent example of regionalization. I've had reasonable success using [this community detection](http://stats.stackexchange.com/a/140129/1036) clustering the original bipartite data, but sometimes reality is complex. — Andy W, Dec 20 '16 at 14:13
@Andy There is an expectation that the clusters will be spatially contiguous, but I don't want to force it, per se. The reason is that one of the next things I want to do is test how geographic features might be shaping the structure of the network, so I want to avoid incorporating it in this step to keep the two sets of data independent — anjama, Dec 20 '16 at 17:59
This is a simpler approach to modularity, which would perhaps be of interest. Newman, "Modularity and community structure in networks," (2006) http://www.pnas.org/content/103/23/8577.full — Sycorax, Dec 23 '16 at 03:40

score 1 · Answer 1 · answered Dec 23 '16 at 00:19

My favorite tool is Kovacs' generalized similarity defined for bipartite graphs in a mutually recursive way: two cities are similar if they have been visited by similar people; two people are similar if they visit similar cities. The similarities are in the range from -1 to 1. The solution is a pair of square similarity matrices (MP for people and for MC cities). At the next step, dichotomize the matrix MC by keeping only the most significant similarities, and apply any community detection algorithm to it. There is a Python module that calculates generalized similarity for a bipartite graph: https://github.com/dzinoviev/generalizedsimilarity.

Reference: Balázs Kovács, "A generalized model of relational similarity," Social Networks, 32(3), July 2010, pp. 197–211.

This idea is close to correspondence analysis. – kjetil b halvorsen Jun 26 '19 at 07:35 — kjetil b halvorsen, Jun 26 '19 at 07:35

score 1 · Answer 2 · answered Dec 23 '16 at 05:08

I would suggest spectral clustering for bipartite graphs proposed by Dhillon in the context of text mining (it works for all the application though, See page 4 for the algorithm) http://www.cs.utexas.edu/users/inderjit/public_papers/kdd_bipartite.pdf

This gives you the communities of both cities as well as people. You can ignore the people labels if you are not interested.

I would also visualize the network first(my favorite is drl layout of igraph package) to see wether any communities exits, this also gives you some idea on what to choose for the number of communities.

Community detection in network

2 Answers2