Monday, July 29, 2013

network splitting

Last week, Paul Rubin wrote an excellent post on Extracting a Connected Graph from an existing graph. Lately I've been performing related functions on data from OpenStreetMap, though without access to a solver. In my case I'm taking in arbitrary network data and splitting it into disconnected subnetworks. I thought it might be a good case study to show an algorithmic way doing this and some of the performance issues I ran into.

A small example can be seen below (click on the image to enlarge it). This shows a road network around the Las Vegas strip. There is one main (weakly) connected network in black. The roads highlighted in red are disconnected from the main network. We want code that will split these into connected subnetworks.

Say we have data that looks like the following. Instead of nodes, the numbers in quotes represent edges. Think of these as streets.

{
    "0": [1, 2, 3],
    "1": [9248, 9249, 9250],
    "2": [589, 9665, 9667],
    "3": [0, 5, 6],
    "4": [0, 5, 6],
    "5": [588],
    "6": [4, 8, 9],
    ...
}

Our basic strategy is the following:

  1. Start with every edge alone in its own subnetwork.
  2. For each connection, merge the networks of the source and destination edges.

#!/usr/bin/env python
import json
import sys
import time

class hset(set):
    '''A hashable set. Note that it only hashes by the pointer, and not by the elements.'''
    def __hash__(self):
        return hash(id(self))

    def __cmp__(self, other):
        return cmp(id(self), id(other))

if __name__ == '__main__':
    try:
        inputfile = sys.argv[1]
    except:
        print 'usage: %s network.json' % sys.argv[0]
        sys.exit()

    print time.asctime(), 'parsing json input'
    connections = json.load(open(inputfile))

    edge_to_net = {} # Edge ID -> set([edges that are in the same network])
    nets = set()     # Set of known networks

    print time.asctime(), 'detecting disconnected subgraphs'
    for i, (from_edge, to_set) in enumerate(connections.iteritems()):
        from_edge = int(from_edge)

        try:
            from_net = edge_to_net[from_edge]
        except KeyError:
            from_net = edge_to_net[from_edge] = hset([from_edge])
            nets.add(from_net)

        if not (i+1) % (25 * 1000):
            print time.asctime(), '%d edges processed / %d current subnets' % (i+1, len(nets))
        
        for to in to_set:
            try:
                to_net = edge_to_net[to]

                # If we get here, merge the to_net into the from_net.
                if to_net is not from_net:
                    to_net.update(from_net)
                    for e in from_net:
                        edge_to_net[e] = to_net
                    nets.remove(from_net)
                    from_net = to_net

            except KeyError:
                from_net.add(to)
                edge_to_net[to] = from_net

    print time.asctime(), len(nets), 'subnets found'

We run this against the network pictured above and it works reasonably quickly, finishing in about 7 seconds:

Mon Jul 29 12:22:38 2013 parsing json input
Mon Jul 29 12:22:38 2013 detecting disconnected subgraphs
Mon Jul 29 12:22:38 2013 25000 edges processed / 1970 current subnets
Mon Jul 29 12:22:44 2013 50000 edges processed / 124 current subnets
Mon Jul 29 12:22:45 2013 60 subnets found    

However, when run against a road network for an entire city, the process continues for several hours. What is the issue?

The inefficiency occurs from lines 46 to 50. In this we are frequently removing references to every element in a large set. Instead, it would be better to remove as few references as possible. Therefore, instead of merging from_net into to_net, we will determine which network is the smaller of the two and marge that one into the larger one. Note that this does not necessarily change the worst case time complexity of the algorithm, but it should make the code fast enough to be useful. The new version appears below.

#!/usr/bin/env python
import json
import sys
import time

class hset(set):
    '''A hashable set. Note that it only hashes by the pointer, and not by the elements.'''
    def __hash__(self):
        return hash(id(self))

    def __cmp__(self, other):
        return cmp(id(self), id(other))

if __name__ == '__main__':
    try:
        inputfile = sys.argv[1]
    except:
        print 'usage: %s network.json' % sys.argv[0]
        sys.exit()

    print time.asctime(), 'parsing json input'
    connections = json.load(open(inputfile))

    edge_to_net = {} # Edge ID -> set([edges that are in the same network])
    nets = set()     # Set of known networks

    print time.asctime(), 'detecting disconnected subgraphs'
    for i, (from_edge, to_set) in enumerate(connections.iteritems()):
        from_edge = int(from_edge)

        try:
            from_net = edge_to_net[from_edge]
        except KeyError:
            from_net = edge_to_net[from_edge] = hset([from_edge])
            nets.add(from_net)

        if not (i+1) % (25 * 1000):
            print time.asctime(), '%d edges processed / %d current subnets' % (i+1, len(nets))
        
        for to in to_set:
            try:
                to_net = edge_to_net[to]

                # If we get here, merge the to_net into the from_net.
                if to_net is not from_net:
                    # Update references to and remove the smaller set for speed.
                    if len(to_net) < len(from_net):
                        smaller, larger = to_net, from_net
                    else:
                        smaller, larger = from_net, to_net

                    larger.update(smaller)
                    for e in smaller:
                        edge_to_net[e] = larger
                    nets.remove(smaller)
                    edge_to_net[to] = larger
                    from_net = larger

            except KeyError:
                from_net.add(to)
                edge_to_net[to] = from_net

    print time.asctime(), len(nets), 'subnets found'

Indeed, this is significantly faster. And on very large networks it runs in minutes instead of hours or days. On the small test case used for this post, it runs in under a second. While this could probably be done faster, that's actually good enough for right now.

Mon Jul 29 12:39:55 2013 parsing json input
Mon Jul 29 12:39:55 2013 detecting disconnected subgraphs
Mon Jul 29 12:39:55 2013 25000 edges processed / 1970 current subnets
Mon Jul 29 12:39:55 2013 50000 edges processed / 124 current subnets
Mon Jul 29 12:39:55 2013 60 subnets found

2 comments :

  1. That seems like a classical example for a Union-Find data structure. A huge difference is that the classical union-find data structure does not update the entire set when a new edge is discovered. Instead, it keeps the information of edge_to_net in a tree, that has to be traversed to the top to find the actual set something belongs to now. Therefore, during an update, only the root node of the smaller set gains an up-pointer to the root node of the larger set. O(log n), no matter how large the sets are! (The tree is automatically somewhat "balanced" when smaller set are added to the larger ones.)

    ReplyDelete
    Replies
    1. Thanks for posting. I've actually been meaning to look into Union-Find since it was brought to my attention after this post. Perhaps that will be the subject of a follow-up.

      Delete