Pages

Sunday, 1 April 2012

Simple community detection algorithms

The detection of community networks is an important subject because of its ability to point out the presence of communities where we might have thought were none, to classify different communities on the basis of their origin, functions, etc., and to study their behaviour and change in different circumstances. They have a wide range of applications, and I'm interested in them because of their ability to mimic real-life social groupings. Although the behavioural equivalence of the theoretical and practical models are only approximate, they're fun to work with because of the simulations' capacity to provide valuable insights into the workings of social groups.

[caption id="attachment_22901" align="aligncenter" width="540" caption="A typical social network showing communities (each coloured differently)"][/caption]

Within any network, a community network is defined as a group of nodes that are more densely connected internally than they are to the rest of the network. Essentially, there are two simple ways to detect the presence of communities in complex structures. In both methods, the first step is identify what's called the edge betweennesses. The edge betweenness (EB) of an edge is defined as the shortest path between a pair of nodes that runs through it. If EB is high for a given edge, then it means that that edge belongs to many short paths between different pairs of nodes and is therefore an important connector. Edges with a high EB are always found between communities in complex structures and not within them.

To understand this concept better, imagine a group of people with names A, B, C and so on. If B is part of many communication-channels that are the shortest between different sets of people, then information is more likely to pass through B than with anyone else if those sets of people want to communicate with each other.

Removing such connectors between two pairs of nodes affects the arrangement of edges in the entire structure just as the removing of one person from a social network affects the relationships between all other persons in the network. When an edge is removed, the connection between any number of nodes is lost and the centrality of the entire network changes. Centrality is an indicator of the importance of a node, a measure of "how central" any node is with respect to the other nodes. Analogously, the central-most person in a network is he who influences others in the network the most. Just like EB possesses a graphical definition that's capable of making a computer understand it and work with it, centrality is formulated as the number links a node has. However, unlike EB, centrality doesn't have to be identified when the detection algorithm starts working.

[caption id="attachment_22899" align="aligncenter" width="492" caption="In the Green community, everyone is equally interconnected. Therefore, the centrality is evenly distributed. As far as information flow is concerned, 20 has the most incoming links and 24 has the most outgoing links. Since the only outgoing links of 20 is to 21, 20 has the highest node betweenness between any of 22, 23 and 24, and 21."][/caption]

In the first method, we start with an empty canvas: no nodes, no edges, no communities. However, as we add one node after another in the picture, the interactions between the nodes (which have predefined properties in keeping with the real situation) begin to take shape and the edges begin to acquire an EB closer and closer to what the value of their final EB will be. At each step, the computer calculates the EBs of all nodes and where the EBs are low, points those groups out as the communities. This particular method is slow because at every step, more calculations have to be performed, making the process slower for larger networks.

The second method works exactly the other way around: instead of beginning with an empty canvas, it takes the entire network and successively removes that edge which has the highest EB at that stage. After the removal of each edge, the EB of some number of the other edges changes, and this is the reason this method is faster than the previous one. All the EBs don't have to be calculated at each stage: only those whose EB stands to be affected because they run a path between the same pair of nodes. Even though this method, called the Girvan-Newman algorithm, may not seem too fast, it's at least faster than the other option.

[caption id="attachment_22900" align="aligncenter" width="496" caption="In 1977, W W Zachary recorded interactions in a karate club for two years. During observation, a conflict developed between the administrator and the instructor. As a result, the club broke into two. In wanting to know who joined which club, he found that there was close agreement between a theoretical model that showed that the split had occurred along a line on either side of which the administrator/instructor had highest centrality."][/caption]

Between these two methods, a lot of the complex networks can be analysed and communities spotted, but there is a third method that is much more interesting. This, the clique percolation method (CPM), accounts for the overlapping of communities and assessing the changes within a community and how much they affect farther-away regions of the community. This method doesn't make use of EB but something else called a k-clique. A k-clique is a fully connected sub-graph of k nodes: a k-clique with k = 3 is a triangle, a k-clique with k = 4 is a tetrahedron, and so on, where each vertex is a node.

The CPM algorithm proceeds by identifying all the maximum and maximal cliques. A maximum clique is the largest k-clique within a network. A maximal clique is that k-clique that cannot be extended by adding an adjacent node, where adjacency is defined by the sharing of k - 1 nodes between two k-cliques. Maximalism is also defined on the basis of a k-clique not being a subset of a larger clique. For example, if a tetrahedral 4-clique remains a tetrahedral 4-clique instead of changing into an octahedron upon the addition of a node from an adjacent 4-clique, then it is maximal.

Here's the analogue: There are 5 people, and 3 of them belong to two different groups, each group containing one of the remaining 2 people. If those 2 people managed to get connected directly but the configuration still remained as two groups instead of becoming one, then each group is a maximal 4-clique. This way, once all the maximum and maximal k-cliques have been identified, the structures of various communities emerges, where each community is a clique.

[caption id="" align="aligncenter" width="410" caption="A graph with 23 1-vertex cliques (its vertices), 42 2-vertex cliques (its edges), 19 3-vertex cliques (the light blue triangles), and 2 4-vertex cliques (dark blue). Six of the edges and 11 of the triangles form maximal cliques. The two dark blue 4-cliques are both maximum and maximal - From Wikipedia"][/caption]

If you have an aptitude for programming/algorithm-design, or if you're simply interested in looking at applications that show community detection algorithms at work, you might like this.

No comments:

Post a Comment