a small set of graph functions to be used from pySpark on top of networkx and graphframes
Project description
splink_graph
splink_graph
is a small graph utility library meant to be used in the Apache Spark environment, that works with graph data structures
such as the ones created from the outputs of data linking processes (candicate pair results) of
Calculations are performed per cluster/connected component/subgraph in a parallel manner thanks to the underlying help from pyArrow
TL&DR :
Graph Database OLAP solutions are a few and far between. If you have spark data in a format that can be represented as a network/graph then with this package:
- Graph-theoretic metrics can be obtained efficiently using an already existing spark infrastucture without the need for a graph OLAP solution
- The results can be used as is for finding the needle (of interesting subgraphs) in the haystack (whole set of subgraphs)
- Or one can augment the available graph-compatible data as part of preprocessing step before the data-ingestion phase in an OLTP graph database (such as AWS Neptune etc)
- Another use is to provide support for feature engineering from the subgraphs/clusters for supervised and unsupervised ML downstream uses.
How to Install :
For dependencies and other important technical info so you can run these functions without an issue please consult
INSTALL.md
on this repo
Functionality offered :
For a primer on the terminology used please look at TERMINOLOGY.md
file in this repo
Cluster metrics
Cluster metrics usually have as an input a spark edgelist dataframe that also includes the component_id (cluster_id) where the edge is in. The output is a row of one or more metrics per cluster
Cluster metrics currently offered:
- diameter (largest shortest distance between nodes in a cluster)
- transitivity (or Global Clustering Coefficient in the related literature)
- cluster triangle clustering coeff (or Local Clustering Coefficient in the related literature)
- cluster square clustering coeff (useful for bipartite networks)
- cluster node connectivity
- cluster edge connectivity
- cluster efficiency
- cluster modularity
- cluster avg edge betweenness
- cluster weisfeiler lehman graphhash (in order to quickly test for graph isomorphisms)
Cluster metrics are really helpful at finding the needles (of for example clusters with possible linking errors) in the haystack (whole set of clusters after the data linking process).
Node metrics
Node metrics have as an input a spark edgelist dataframe that also includes the component_id (cluster_id) where the edge belongs. The output is a row of one or more metrics per node
Node metrics curretnly offered:
- Eigenvector Centrality
- Harmonic centrality
Edge metrics
Edge metrics have as an input a spark edgelist dataframe that also includes the component_id (cluster_id) where the edge belongs. The output is a row of one or more metrics per edge
Edge metrics curretnly offered:
- Edge Betweeness
- Bridge Edges
Functionality coming soon
- cluster modularity based on partitions created by edge-betweenness
- cluster modularity based on partitions created by spectral cut
- cluster modularity based on partitions created by label propagation
- later down the line : shallow embeddings of subgraphs/clusters
For upcoming functionality further down the line please consult the TODO.md
file
Contributing
Feel free to contribute by
-
Starting an issue.
-
Forking the repository to suggest a change, and/or
-
Want a new metric implemented? Open an issue and ask. Probably it can be.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file splink_graph-0.4.7.tar.gz
.
File metadata
- Download URL: splink_graph-0.4.7.tar.gz
- Upload date:
- Size: 11.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.6 CPython/3.7.6 Linux/4.9.0-7-amd64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 430b7d26dba050efc6f9de3c969532653eac389f8303a86dad91131731581205 |
|
MD5 | 2f041a73dfce01437aa6aaa61f56f947 |
|
BLAKE2b-256 | dcd592f611dae8ea105ad985f46c94261df93f3990d9691faf0522a075623df7 |
File details
Details for the file splink_graph-0.4.7-py3-none-any.whl
.
File metadata
- Download URL: splink_graph-0.4.7-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.6 CPython/3.7.6 Linux/4.9.0-7-amd64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14e6df3ce8951fcb6e81cf841487a8c5331c22d06d468b7506db46f9caa8de7b |
|
MD5 | 5a3c396062366e26c381ce35b3559515 |
|
BLAKE2b-256 | d05690598caa861dfca5d1714f559993d59088993e28110dc6699c8b227ef7df |