connectby hierarchy query in spark

These details have not been verified by PyPI

Project description

pyspark-connectby

Spark currently does not support hierarchy query connectBy as of version 3.5.0. And there is a PR opened to support recursive CTE query. But that is still not available yet.

This is an attempt to add connectBy method to DataFrame

Concept

Hierarchy query is one of the important feature that many relational databases, such as Oracle, DB2, My SQL, Snowflake, Redshift, etc., would support directly or alternatively by using recursive CTE.

Example in Redshift:

select id, name, manager_id, level
from employee
start with emp_id = 1
connect by prior emp_id = manager_id;

With this library, you can use connectBy() on Dateframe:

from pyspark_connectby import connectBy
from pyspark.sql import SparkSession

schema = 'emp_id string, manager_id string, name string'
data = [[1, None, 'Carlos'],
        [11, 1, 'John'],
        [111, 11, 'Jorge'],
        [112, 11, 'Kwaku'],
        [113, 11, 'Liu'], 
        [2, None, 'Mat']
        ]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data, schema)
df2 = df.connectBy(prior='emp_id', to='manager_id', start_with='1')
df2.show()

With result:

+------+----------+-----+-----------------+----------+------+
|emp_id|START_WITH|LEVEL|CONNECT_BY_ISLEAF|manager_id|  name|
+------+----------+-----+-----------------+----------+------+
|     1|         1|    1|            false|      null|Carlos|
|    11|         1|    2|            false|         1|  John|
|   111|         1|    3|             true|        11| Jorge|
|   112|         1|    3|             true|        11| Kwaku|
|   113|         1|    3|             true|        11|   Liu|
+------+----------+-----+-----------------+----------+------+

Note the pseudo columns in the query result:

START_WITH
LEVEL
CONNECT_BY_ISLEAF

Installation

Python Version >= 3.7

$ pip install --upgrade pyspark-connectby

Usage

from pyspark_connectby import connectBy

df = ...

df.connectBy(prior='emp_id', to='manager_id', start_with='1')  # start_with `emp_id` as 1

df.transform(connectBy, prior='emp_id', to='manager_id', start_with='1')  # or by using df.transform() method

df.connectBy(prior='emp_id', to='manager_id')  # without start_with, it will go through each node

df.connectBy(prior='emp_id', to='manager_id', start_with=['1', '2'])  # start_with a list of top nodes ids.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.3

Mar 6, 2024

1.1.2

Mar 6, 2024

1.1.1

Mar 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_connectby-1.1.3.tar.gz (3.2 kB view details)

Uploaded Mar 6, 2024 Source

Built Distribution

pyspark_connectby-1.1.3-py3-none-any.whl (4.0 kB view details)

Uploaded Mar 6, 2024 Python 3

File details

Details for the file pyspark_connectby-1.1.3.tar.gz.

File metadata

Download URL: pyspark_connectby-1.1.3.tar.gz
Upload date: Mar 6, 2024
Size: 3.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.11.5 Darwin/22.5.0

File hashes

Hashes for pyspark_connectby-1.1.3.tar.gz
Algorithm	Hash digest
SHA256	`b1e7800c12099cc70de7131c959b016179dcf64f843d93d390d147ddfd3cdd5e`
MD5	`ea416e5941e3ce74f25a97390b64acd1`
BLAKE2b-256	`a1e5be76c7db3e3e82fbd3d5cff1d2d8358ca4f93b7d0efb988c88772218f3a8`

See more details on using hashes here.

File details

Details for the file pyspark_connectby-1.1.3-py3-none-any.whl.

File metadata

Download URL: pyspark_connectby-1.1.3-py3-none-any.whl
Upload date: Mar 6, 2024
Size: 4.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.11.5 Darwin/22.5.0

File hashes

Hashes for pyspark_connectby-1.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`025aaf4a46da1ef974ea7133e4ef4beee2b6c8d38c0063c32e58cc34931ff93b`
MD5	`08c13e1b47c88abcbc3cf3f16443ed72`
BLAKE2b-256	`4a30215edf97990fd3d613a4731df5fbd54bd42487424c8078341a4f2f871ef9`

See more details on using hashes here.

pyspark-connectby 1.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

pyspark-connectby

Concept

Installation

Usage

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes