connectby hierarchy query in spark
Project description
pyspark-connectby
Spark currently does not support hierarchy query connectBy as of version 3.5.0. And there is a PR opened to support recursive CTE query. But that is still not available yet.
This is an attempt to add connectBy method to DataFrame
Concept
Hierarchy query is one of the important feature that many relational databases, such as Oracle, DB2, My SQL, Snowflake, Redshift, etc., would support directly or alternatively by using recursive CTE.
Example in Redshift:
select id, name, manager_id, level
from employee
start with emp_id = 1
connect by prior emp_id = manager_id;
With this library, you can use connectBy() on Dateframe:
from pyspark_connectby import connectBy
from pyspark.sql import SparkSession
schema = 'emp_id string, manager_id string, name string'
data = [[1, None, 'Carlos'],
[11, 1, 'John'],
[111, 11, 'Jorge'],
[112, 11, 'Kwaku'],
[113, 11, 'Liu'],
[2, None, 'Mat']
]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data, schema)
df2 = df.connectBy(prior='emp_id', to='manager_id', start_with='1')
df2.show()
With result:
+------+----------+-----+-----------------+----------+------+
|emp_id|START_WITH|LEVEL|CONNECT_BY_ISLEAF|manager_id| name|
+------+----------+-----+-----------------+----------+------+
| 1| 1| 1| false| null|Carlos|
| 11| 1| 2| false| 1| John|
| 111| 1| 3| true| 11| Jorge|
| 112| 1| 3| true| 11| Kwaku|
| 113| 1| 3| true| 11| Liu|
+------+----------+-----+-----------------+----------+------+
Note the pseudo columns in the query result:
- START_WITH
- LEVEL
- CONNECT_BY_ISLEAF
Installation
Python
Version >= 3.9, <3.14
$ pip install --upgrade pyspark-connectby
Usage
from pyspark_connectby import connectBy
df = ...
df.connectBy(prior='emp_id', to='manager_id', start_with='1') # start_with `emp_id` as 1
df.transform(connectBy, prior='emp_id', to='manager_id', start_with='1') # or by using df.transform() method
df.connectBy(prior='emp_id', to='manager_id') # without start_with, it will go through each node
df.connectBy(prior='emp_id', to='manager_id', start_with=['1', '2']) # start_with a list of top nodes ids.
Developer
Setup
java
java 17 or later
brew install openjdk@17
sudo ln -sfn /opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-17.jdk
export JAVA_HOME=$(/usr/libexec/java_home -v 17) # e.g in ~/.zshrc
poetry
pipx install poetry
poetry env list
poetry env use 3.13 # e.g to create env for python 3.13
tox
pipx install tox
pipx install uv
uv python install 3.9 3.10 3.11 3.12 3.13 # install multiple versions for python
Test
pytest
poetry run pytest
tox
Publish
poetry version patch
poetry version minor
poetry publish --build
tox -e release
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyspark_connectby-1.3.1.tar.gz.
File metadata
- Download URL: pyspark_connectby-1.3.1.tar.gz
- Upload date:
- Size: 4.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.13.5 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a69bb0139638f4ac17f0b59e602fd3797dd3fb8ead9acae7f0df3a56a4d2b1b4
|
|
| MD5 |
84ad799bb1b74181f4669230eecf4a4e
|
|
| BLAKE2b-256 |
4feb91ec9917f33b2818ea8cefee508ba3deaab0d452f3b6fa0790e4107a7a87
|
File details
Details for the file pyspark_connectby-1.3.1-py3-none-any.whl.
File metadata
- Download URL: pyspark_connectby-1.3.1-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.13.5 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8e0f4f5663745c15aac30e663a0ef149ef7e1a4412c6d3c14555c7e88411557
|
|
| MD5 |
43067fff97380957246be7bf948a61fa
|
|
| BLAKE2b-256 |
a1c5b3726a309b758aa0b1f3b9de83596dcd743d24c755049bb86962a01c29b3
|