Skip to main content

connectby hierarchy query in spark

Project description

pyspark-connectby

Spark currently does not support hierarchy query connectBy as of version 3.5.0. And there is a PR opened to support recursive CTE query. But that is still not available yet.

This is an attempt to add connectBy method to DataFrame

Concept

Hierarchy query is one of the important feature that many relational databases, such as Oracle, DB2, My SQL, Snowflake, Redshift, etc., would support directly or alternatively by using recursive CTE.

Example in Redshift:

select id, name, manager_id, level
from employee
start with emp_id = 1
connect by prior emp_id = manager_id;

With this library, you can use connectBy() on Dateframe:

from pyspark_connectby import connectBy
from pyspark.sql import SparkSession

schema = 'emp_id string, manager_id string, name string'
data = [[1, None, 'Carlos'],
        [11, 1, 'John'],
        [111, 11, 'Jorge'],
        [112, 11, 'Kwaku'],
        [113, 11, 'Liu'], 
        [2, None, 'Mat']
        ]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data, schema)
df2 = df.connectBy(prior='emp_id', to='manager_id', start_with='1')
df2.show()

With result:

+------+----------+-----+-----------------+----------+------+
|emp_id|START_WITH|LEVEL|CONNECT_BY_ISLEAF|manager_id|  name|
+------+----------+-----+-----------------+----------+------+
|     1|         1|    1|            false|      null|Carlos|
|    11|         1|    2|            false|         1|  John|
|   111|         1|    3|             true|        11| Jorge|
|   112|         1|    3|             true|        11| Kwaku|
|   113|         1|    3|             true|        11|   Liu|
+------+----------+-----+-----------------+----------+------+

Note the pseudo columns in the query result:

  • START_WITH
  • LEVEL
  • CONNECT_BY_ISLEAF

Installation

Python Version >= 3.7

$ pip install --upgrade pyspark-connectby

Usage

from pyspark_connectby import connectBy

df = ...

df.connectBy(prior='emp_id', to='manager_id', start_with='1')  # start_with `emp_id` as 1

df.transform(connectBy, prior='emp_id', to='manager_id', start_with='1')  # or by using df.transform() method

df.connectBy(prior='emp_id', to='manager_id')  # without start_with, it will go through each node

df.connectBy(prior='emp_id', to='manager_id', start_with=['1', '2'])  # start_with a list of top nodes ids. 

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_connectby-1.1.3.tar.gz (3.2 kB view details)

Uploaded Source

Built Distribution

pyspark_connectby-1.1.3-py3-none-any.whl (4.0 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_connectby-1.1.3.tar.gz.

File metadata

  • Download URL: pyspark_connectby-1.1.3.tar.gz
  • Upload date:
  • Size: 3.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Darwin/22.5.0

File hashes

Hashes for pyspark_connectby-1.1.3.tar.gz
Algorithm Hash digest
SHA256 b1e7800c12099cc70de7131c959b016179dcf64f843d93d390d147ddfd3cdd5e
MD5 ea416e5941e3ce74f25a97390b64acd1
BLAKE2b-256 a1e5be76c7db3e3e82fbd3d5cff1d2d8358ca4f93b7d0efb988c88772218f3a8

See more details on using hashes here.

File details

Details for the file pyspark_connectby-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: pyspark_connectby-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 4.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Darwin/22.5.0

File hashes

Hashes for pyspark_connectby-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 025aaf4a46da1ef974ea7133e4ef4beee2b6c8d38c0063c32e58cc34931ff93b
MD5 08c13e1b47c88abcbc3cf3f16443ed72
BLAKE2b-256 4a30215edf97990fd3d613a4731df5fbd54bd42487424c8078341a4f2f871ef9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page