2

I have a simple, undirected graph where I'd like to detect "natural" subgraphs where vertices are connected intensively internally but sparsely externally. The problem is that I have no exact hint what could be here "intense" and "sparse", I wonder which graph statistics could give some hints here.

I'm also thinking to apply some kind of clustering on the vertices but no idea how to convert the "closeness" in a graph into feature for a node to apply traditional unsupervised method on these.

Fredrik
  • 671
  • 1
  • 5
  • 8
  • Do you have any distances associated with the graph? Weights? You could try to do this based on the number of connections. – user2974951 Sep 24 '18 at 12:41
  • 3
    Possible duplicate of [Partitioning of a graph](https://stats.stackexchange.com/questions/247175/partitioning-of-a-graph) or https://stats.stackexchange.com/questions/329361/clustering-network-usign-modularity-maximization-algorithm – Sycorax Sep 24 '18 at 15:47

3 Answers3

2

Newman's modularity clustering is a common way to get at this idea. He writes

A good division of a network into communities is not merely one in which there are few edges between communities; it is one in which there are fewer than expected edges between communities. If the number of edges between two groups is only what one would expect on the basis of random chance, then few thoughtful observers would claim this constitutes evidence of meaningful community structure. On the other hand, if the number of edges between groups is significantly less than we expect by chance, or equivalent if the number within groups is significantly more, then it is reasonable to conclude that something interesting is going on.

This idea, that true community structure in a network corresponds to a statistically surprising arrangement of edges, can be quantified by using the measure known as modularity (17). The modularity is, up to a multiplicative constant, the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random. (A precise mathematical formulation is given below.)

M. E. J. Newman, "Modularity and community structure in networks", PNAS

Note that Newman's article proposes a strategy of recursively partitioning a network. This can be slow for very large networks with many modular clusters. Instead, the Louvian algorihtm is a faster way to get at modularity; see "Fast unfolding of communities in large networks" by Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • 2
    +1. Louvian algorihtm all the way. – usεr11852 Sep 24 '18 at 20:40
  • @usεr11852 One of my classes in grad school introduced Newman's modularity & its algorithm but neglected to mention that the Louvian algorithm existed, so I wasted quite a bit of time implementing tricks to speed up Newman's modularity applied to a large graph. *sigh* Now I try to mention it whenever modularity clustering comes up. – Sycorax Sep 24 '18 at 20:46
  • Ooops... Well you now know! :) (I had to do study them from a Statistical Mechanics perspective.) – usεr11852 Sep 24 '18 at 21:04
2

+1 to both existing answers. Louvain's and the Triangle counting algorithms are great ways to go about this. Complementary to those, you might want to consider the following community detection algorithms (one newer and two oldies) early on in your analysis to understand a graph’s structure

  1. Label Propagation algorithm. This has been presented in "Near linear time algorithm to detect community structures in large-scale networks" by Raghavan et al (2007). This is related to the greater affinity propagation clustering approach where labels/information propagate between the nodes of the network forming clusters based on consensus.
  2. Connected Components / Union-find algorithm. This was presented by Galler & Fischer in 1964; it very fast and nowadays is mostly used as pre-processing step in large databases to detect disconnected components.
  3. Strongly Connected Components algorithm, introduced by Tarjan in 1972. It has similar utility as the Union-Find algorithm doing a depth-first search but it is particular for directed graphs.
usεr11852
  • 33,608
  • 2
  • 75
  • 117
1

This depends a little on your use case I guess:

If your nodes are people, you might be interested to look at the global clustering coefficient (lets call this GCC): number of closed triangles within a group of nodes compared to the number of open triangles within the same group. The idea behind that is, that a group of 3 people is more likely to build a close and friendly group if all are friends/followers with each ohter: in other words, the edges connecting the three form a closed triangle. If this is not the case, it is more likely that the two unconnected nodes either do not know or like each other which indicates less social cohesion.

In order to create a formal definition of the problem, we could go with this:

Given a Graph G=(V,E) and the integer k > 0, find k disjunct(? ... maybe not) connected subgraphs G_i=(V_i,E_i) i<=k, so that the sum of GCC(G_i) is max over all potential splits of G into the G_i.

I would start with something like this. If you want to read more on that topic google for "triadic closure".

  • 1
    You might be interested to know that math typesetting is implemented using a latex-like syntax. More information: https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference – Sycorax Sep 25 '18 at 18:55
  • Thank you, Sycorax! I already wondered how to do that properly but read that Latex Code is not rendered on all subsites of the StackExchange network. I will try that. – Elmar Macek Sep 26 '18 at 08:37