Skip to main content

connectby hierarchy query in spark

Project description

pyspark-connectby

Spark currently does not support hierarchy query connectBy as of version 3.5.0. And there is a PR opened to support recursive CTE query. But that is still not available yet.

This is an attempt to add connectBy method to DataFrame

Concept

Hierarchy query is one of the important feature that many relational databases, such as Oracle, DB2, My SQL, Snowflake, Redshift, etc., would support directly or alternatively by using recursive CTE.

Example in Redshift:

select id, name, manager_id, level
from employee
start with emp_id = 1
connect by prior emp_id = manager_id;

With this library, you can use connectBy() on Dateframe:

from pyspark_connectby import connectBy
from pyspark.sql import SparkSession

schema = 'emp_id string, manager_id string, name string'
data = [[1, None, 'Carlos'],
        [11, 1, 'John'],
        [111, 11, 'Jorge'],
        [112, 11, 'Kwaku'],
        [113, 11, 'Liu'], 
        [2, None, 'Mat']
        ]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data, schema)
df2 = df.connectBy(prior='emp_id', to='manager_id', start_with='1')
df2.show()

With result:

+------+----------+-----+-----------------+----------+------+
|emp_id|START_WITH|LEVEL|CONNECT_BY_ISLEAF|manager_id|  name|
+------+----------+-----+-----------------+----------+------+
|     1|         1|    1|            false|      null|Carlos|
|    11|         1|    2|            false|         1|  John|
|   111|         1|    3|             true|        11| Jorge|
|   112|         1|    3|             true|        11| Kwaku|
|   113|         1|    3|             true|        11|   Liu|
+------+----------+-----+-----------------+----------+------+

Note the pseudo columns in the query result:

  • START_WITH
  • LEVEL
  • CONNECT_BY_ISLEAF

Installation

Python

Version >= 3.9, <3.14

$ pip install --upgrade pyspark-connectby

Usage

from pyspark_connectby import connectBy

df = ...

df.connectBy(prior='emp_id', to='manager_id', start_with='1')  # start_with `emp_id` as 1

df.transform(connectBy, prior='emp_id', to='manager_id', start_with='1')  # or by using df.transform() method

df.connectBy(prior='emp_id', to='manager_id')  # without start_with, it will go through each node

df.connectBy(prior='emp_id', to='manager_id', start_with=['1', '2'])  # start_with a list of top nodes ids. 

Developer

Setup

java

java 17 or later

brew install openjdk@17
sudo ln -sfn /opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-17.jdk
export JAVA_HOME=$(/usr/libexec/java_home -v 17)   # e.g in ~/.zshrc

poetry

pipx install poetry
poetry env list
poetry env use 3.13  #  e.g to create env for python 3.13

tox

pipx install tox
pipx install uv
uv python install 3.9 3.10 3.11 3.12 3.13   # install multiple versions for python

Test

pytest 
poetry run pytest
tox

Publish

poetry version patch
poetry version minor 
poetry publish --build
tox -e release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_connectby-1.3.1.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_connectby-1.3.1-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_connectby-1.3.1.tar.gz.

File metadata

  • Download URL: pyspark_connectby-1.3.1.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.13.5 Darwin/24.6.0

File hashes

Hashes for pyspark_connectby-1.3.1.tar.gz
Algorithm Hash digest
SHA256 a69bb0139638f4ac17f0b59e602fd3797dd3fb8ead9acae7f0df3a56a4d2b1b4
MD5 84ad799bb1b74181f4669230eecf4a4e
BLAKE2b-256 4feb91ec9917f33b2818ea8cefee508ba3deaab0d452f3b6fa0790e4107a7a87

See more details on using hashes here.

File details

Details for the file pyspark_connectby-1.3.1-py3-none-any.whl.

File metadata

  • Download URL: pyspark_connectby-1.3.1-py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.13.5 Darwin/24.6.0

File hashes

Hashes for pyspark_connectby-1.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a8e0f4f5663745c15aac30e663a0ef149ef7e1a4412c6d3c14555c7e88411557
MD5 43067fff97380957246be7bf948a61fa
BLAKE2b-256 a1c5b3726a309b758aa0b1f3b9de83596dcd743d24c755049bb86962a01c29b3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page