Method for Batch Identification of Closed Graphs

It’s the dead of night. I’m running a data processing batch that will take several minutes to finish. If it fails, I’d really like to restart it before turning in, so I’m thinking of re-writing my graph completion verification function to keep myself awake while I wait. Figured this would make a good post and maybe be useful to anyone else working social network analysis within a relational database framework. So here goes…

First, a little bit of background…in social network analysis, it is often useful to know when a sub-graph within the network is complete, meaning that every person has bidirectional communication with every other person in the group – denoting a team or clique. If we think of every person as a node, then a link exists between nodes A and B if there is bidirectional communication between those two nodes. Depending on the kind of social network we are looking at, this may mean that A and B follow each other (twitter), A and B are friends (facebook), A and B have inboxed each other (email), etc.

Note: Determining all complete subgraphs (cliques) within a network is an NP-complete problem. It is also fixed-parameter intractable and difficult to approximate, but there are a number of fascinating and adventurous tricks for rapidly arriving at a decent set of probable complete subgraphs. The Relevance engine I developed for Mozzo Analytics includes a few novel techniques for doing just this.

If we want to see if a given sub-graph is complete, we need to make sure that every node (person) in the group is linked with every other person. This can be done by iterating through every single pair of nodes and checking for bidirectional communication. But all of our data is contained within a relational database. And relational databases hate this kind of iteration. What they like is counting and unioning and intersecting and doing it in batch across the entire dataset. Using SQL, we can easily (for the DB anyway, the query was a beast to write) count the number of nodes in every potential clique and the number of bidirectional connections between those same nodes. If the number of connections matches the number we would have in a closed network of the given size, then we have closure. So we need a function that will give us the number of nodes (n) that can be closed given a certain number of connections (c):

n = f(c)

nodes connections
1 0
2 1
3 3
4 6
5 10
10 45
100 4950

If we have one node, we need zero connections. If we have two nodes, we need one connection. If we have three nodes, we need three connections; four nodes, six connections, five nodes, ten connections; etc. This is best visualized by drawing a circle for each of the nodes and a line between circles for each connection.

The method for finding the number of connections given a number of nodes can be seen when drawing these out. Starting from node 1 (arbitrarily numbered), we draw a connection to every other node. Then, from node 2, we draw a line to every node but node 1 (it’s already connected to node 2). From node 3, connect to all nodes except 1 and 2. etc.

So this is just a sum across an iteration: Sum of all integers from 1 to one less than the number of nodes. It looks like this (for 5 nodes): 4 + 3 + 2 + 1 = 10 connections.

The equation for calculating such a sum from 1 to x is x(x+1)/2

But we only need to go from 1 to total number of nodes-1. So our x is (n-1). Substituting this in to the equation above gives us:

(n-1)( (n-1) + 1 )/2 which simplifies to n(n-1)/2

Now we have the function for determining the number of connections that are required to close a given number of nodes:

c = f(n) = n(n-1)/2

For 5 nodes, this gives us: 5(5-1)/2 = 5(4)/2 = 20/2 = 10, which matches the table above.

But what we really need is the inverse of this function, not f(n), but f-1(c) : the number of nodes that can be closed by a given number of connections. f(n) would work, but for reasons of efficiency, f(c) would be better.

To get the inverse, we just need to take our equation for c and solve for n:

c = n(n-1)/2
2c = n(n-1)
2c = n^2 – n
2c + 1/4 = n^2 – n + 1/4
2c + 1/4 = (n – 1/2)(n – 1/2)
2c + 1/4 = (n – 1/2)^2
SQRT( 2c + 1/4 ) = n – 1/2
SQRT( 2c + 1/4 ) + 1/2 = n

Whew! So there it is, but I want to get rid of those ugly fractions and the mental hassle of dealing with roots of non integers  :

n = SQRT( 2c + 1/4 ) + 1/2
n = SQRT( ( 8c + 1 ) / 4 ) + 1/2
n = SQRT( 8c + 1 ) / SQRT( 4 ) + 1/2
n = SQRT( 8c + 1 ) / 2 + 1/2
n = ( SQRT( 8c + 1 ) + 1 ) / 2

Lets try it to see how many nodes can be closed by 10 connections:

( SQRT( 8(10) + 1 ) + 1 ) / 2
( SQRT( 80 + 1 ) + 1 ) / 2
( SQRT( 81 ) + 1 ) / 2
( 9 + 1 ) / 2
10 / 2

Bam! Got it!

One response to “Method for Batch Identification of Closed Graphs”

  1. Title…

    […]Every once inside a even though we choose blogs that we read. Listed below would be the most current internet sites that we select […]…