a small set of graph functions to be used from pySpark on top of networkx and graphframes
Project description
splink_graph
splink_graph
is a small graph utility library in the Apache Spark environment, that works with graph data structures based on the graphframe
package,
such as the ones created from the outputs of data linking processes (candicate pair results) of
The main aim of splink_graph
is to offer a small set of functions that work on top of established graph packages like graphframes
and networkx
, that can help with
the process of data linkage
Using Pandas UDFs in Python: prerequisites
This package uses Pandas UDFs for certain functionality.Pandas UDFs are built on top of Apache Arrow and bring the best of both worlds: the ability to define low-overhead, high-performance UDFs entirely in Python.
With Apache Arrow, it is possible to exchange data directly between JVM and Python driver/executors with near-zero (de)serialization cost. However there are some things to be aware of if you want to use these functions. Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be compatible with previous versions of Arrow <= 0.14.1. This is only necessary to do for PySpark users with versions 2.3.x and 2.4.x that have manually upgraded PyArrow to 0.15.0. The following can be added to conf/spark-env.sh to use the legacy Arrow IPC format:
ARROW_PRE_0_15_IPC_FORMAT=1`
Another way is to put the following on spark .config
.config("spark.sql.execution.arrow.pyspark.enabled", "true")
.config("spark.executorEnv.ARROW_PRE_0_15_IPC_FORMAT", "1")
This will instruct PyArrow >= 0.15.0 to use the legacy IPC format with the older Arrow Java that is in Spark 2.3.x and 2.4.x. Not setting this environment variable will lead to a similar error as described in SPARK-29367 when running pandas_udfs or toPandas() with Arrow enabled.
So all in all : either PyArrow needs to be at most in version 0.14.1 or if that cannot happen the above settings need to be be active.
Terminology
Like any discipline, graphs come with their own set of nomenclature. The following descriptions are intentionally simplified—more mathematically rigorous definitions can be found in any graph theory textbook.
Graph
— A data structure G = (V, E) where V and E are a set of vertices/nodes and edges.
Vertex/Node
— Represents a single entity such as a person or an object,
Edge
— Represents a relationship between two vertices (e.g., are these two vertices friends on a social network?).
Directed Graph vs. Undirected Graph
— Denotes whether the relationship represented by edges is symmetric or not
Weighted vs Unweighted Graph
— In weighted graphs edges have a weight that could represent cost of traversing or a similarity score or a distance score
— In unweighted graphs edges have no weight and simply show connections . example: course prerequisites
Subgraph
— A set of vertices and edges that are a subset of the full graph's vertices and edges.
Degree
— A vertex/node measurement quantifying the number of connected edges
Connected Component
— A strongly connected subgraph, meaning that every vertex can reach the other vertices in the subgraph.
Shortest Path
— The lowest number of edges required to traverse between two specific vertices/nodes.
Contributing
Feel free to contribute by
- Forking the repository to suggest a change, and/or
- Starting an issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file splink_graph-0.3.4.tar.gz
.
File metadata
- Download URL: splink_graph-0.3.4.tar.gz
- Upload date:
- Size: 7.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.6 CPython/3.7.6 Linux/4.9.0-7-amd64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad4365fe3826354df932b9ab083d4c6a37b2da52102c489d819c94408e521fa8 |
|
MD5 | 8305ff6b7e58063a7dc4a378b451142d |
|
BLAKE2b-256 | f8a3ad0b51ca11201d6d8f8f430f41580d14953e0b232e74a2f4ef6a3226956e |
File details
Details for the file splink_graph-0.3.4-py3-none-any.whl
.
File metadata
- Download URL: splink_graph-0.3.4-py3-none-any.whl
- Upload date:
- Size: 7.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.6 CPython/3.7.6 Linux/4.9.0-7-amd64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3312376d0126e9056f558edc13bcff9baec35de7200d4e06761489f5ab94f04a |
|
MD5 | 88178c56c095033a745cad541248a0f9 |
|
BLAKE2b-256 | 226d7ad8844b2d1d91bbcae201b4bc4bd1b272ca2ef2a39108c614eafcacefde |