a small set of graph functions to be used from pySpark on top of networkx and graphframes
Project description
splink_graph
splink_graph
is a small graph utility library meant to be used in the Apache Spark environment, that works with graph data structures
such as the ones created from the outputs of data linking processes (candicate pair results) of
The main aim of splink_graph
is to offer a small set of functions that work on top of established graph packages like graphframes
and networkx
, that can help with the process of graph analysis of the output of probabilistic data linkage tools.
Functionality offered
For a primer on the terminology used please look at TERMINOLOGY.md file in this repo
Cluster metrics
Cluster metrics usually have as an input a spark edgelist dataframe that also includes the component_id (cluster_id) where the edge is in. The output is a row of one or more metrics per cluster
Cluster metrics currently offered:
- diameter
- transitivity
- cluster triangle clustering coeff
- cluster square clustering coeff
- cluster node connectivity
- edge connectivity
- cluster efficiency
- cluster modularity
- cluster avg edge betweenness
- cluster weisfeiler lehman graphhash
Cluster metrics are really helpful at finding the needle (of for example clusters with possible linking errors) in the haystack (whole set of clusters after the data linking process)
Node metrics
Node metrics have as an input a spark edgelist dataframe that also includes the component_id (cluster_id) where the edge is in. The output is a row of one or more metrics per node
Node metrics curretnly offered:
- Eigenvector Centrality
- Harmonic centrality
Edge metrics
Edge metrics have as an input a spark edgelist dataframe that also includes the component_id (cluster_id) where the edge is in. The output is a row of one or more metrics per edge
Edge metrics curretnly offered:
- Edge Betweeness
- Bridge Edges
Contributing
Feel free to contribute by
- Forking the repository to suggest a change, and/or
- Starting an issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file splink_graph-0.4.2.tar.gz
.
File metadata
- Download URL: splink_graph-0.4.2.tar.gz
- Upload date:
- Size: 8.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.6 CPython/3.7.6 Linux/4.9.0-7-amd64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3b120a22540520c102300374c2e86d522de6e900682ce349dd2f348f930e6b0 |
|
MD5 | 8ce7755e4af3aa587d2cc9879b985d83 |
|
BLAKE2b-256 | 754bdbf2c1a35c925c0b49d921b89bdb2b6972086c12531168ccd8e78869c1ce |
File details
Details for the file splink_graph-0.4.2-py3-none-any.whl
.
File metadata
- Download URL: splink_graph-0.4.2-py3-none-any.whl
- Upload date:
- Size: 10.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.6 CPython/3.7.6 Linux/4.9.0-7-amd64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0296cb095b8d9b2f7a6cfdec38cac48a6bce51fe0e7a1ffb907d3a8007273c50 |
|
MD5 | ba901bb58fdfd94a28ae9f21eb87ae2f |
|
BLAKE2b-256 | 9edb9b2b8886668b6f5d10a4ede68ee66ee00d1fd407eadb0268f2924104ffe2 |