Skip to main content

Performs greedy agglomerative clustering on network-x graphs

Project description

Agglomerative clustering tool for network-x graphs

Clustering

Implements the algorithm described by: “Fast algorithm for detecting community structure in networks” M. E. J. Newman. 2004 http://arxiv.org/pdf/cond-mat/0309508v1.pdf This is a greedy agglomerative hierarchical clustering algorithm. The alogorithm efficiently clusters large number of nodes (this is one of the best scaling clustering algorithms) while producing a suggested number of clusters. See papers on scaling and accuracy questions regarding greedy Newman.

This implementation uses a heap to select the best pair to cluster at each iteration - A naive implementation considers all “n” edges in the graph (O(n)) - A heap reduces this search dramatically (O(log(n))

Dependencies

allset – for automatic module importing networkx – supported graphing library

Problems

  • The actual Modularity score does not exactly match the Modularity score of the example on the wikipedia page

  • http://en.wikipedia.org/wiki/Modularity_(networks)

  • Does not work for directed graphs (TODO operate on the undirected graph)

  • Does not work for negative graphs (TODO add this capability)

  • Does not handle disconnected components (unless than are components of size 1)

  • Clustering needs to move to a function call rather than an object holder (return dendrogram object)

  • Node relabeling is messy

  • Dendrogram crawling is used for two separate purposes which aren’t clearly defined/called

Attributes

NewmanGreedy objects store the following attributes * Supergraph - Duplicate of the original graph. Gets manipulated during the clustering: edges and nodes are condensed and reweighted - Cluster degree stored in node attribute - Number of connections between clusters stored in edge attribute (weighted edge) - Implicitly keeps track of the existing nodes which is required for the heap * Dendrogram - Stores the clustering history in a tree * Heap - Stores the modularity quality difference for combining each pair of existing nodes - Popping returns the node pair with the largest modulairty/quality difference, then smallest id1, then smallest id2 * Stored as a tuple (value, id1, id2) where the modularity/value is negated, and id1 is the smaller of the two * Processing the smaller IDs first means that smaller clusters will be chosen first during modularity tie breaking - Non-existing nodes are not actively removed from the heap, so pairs with non-existing nodes are ignored when popping * Quality History - Charts the Modularity score for each number of clustering

Author

Author(s): Ethan Lozano & Matthew Seal

Collaborator(s): Zubin Jelveh

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AgglomCluster-1.0.2.zip (20.7 kB view details)

Uploaded Source

File details

Details for the file AgglomCluster-1.0.2.zip.

File metadata

  • Download URL: AgglomCluster-1.0.2.zip
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for AgglomCluster-1.0.2.zip
Algorithm Hash digest
SHA256 93d94a3fd2b7d602ae925dec8ca7f5c838732ea640d94d146738fcf253f63afb
MD5 73cf9ad55d3358b09ae64f985115ceaf
BLAKE2b-256 e07cc47754f8075537d3995ec33ffab3a41a74edb07a31ad3812b687b5aa4a3d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page