Utility functions to manipulate nested structures using pyspark
Project description
Nested fields transformation for pyspark
Motivation
Applying transformations for nested structures in spark is tricky. Assuming we have JSON data with highly nested structure:
[
{
"data": {
"city": {
"addresses": [
{
"id": "my-id"
},
{
"id": "my-id2"
}
]
}
}
}
]
To hash nested "id" field you need to write following spark code
import pyspark.sql.functions as F
hashed = df.withColumn("data",
(F.col("data")
.withField("city", F.col("data.city")
.withField("addresses", F.transform("data.city.addresses",
lambda c: c.withField("id",
F.sha2(c.getField("id"),
256)))))))
With the library the code above could be simplified to
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
from nestedfunctions.functions.terminal_operations import apply_terminal_operation
hashed = apply_terminal_operation(df, "data.city.addresses.id", lambda c, t: F.sha2(c.cast(StringType()), 256))
Instead of dealing of nested transformation functions you could specify terminal operation as 'lambda' and field hierarchy in flat format and library will generate spark codebase for you.
Install
To install the current release
$ pip install pyspark-nested-functions
Available functions
Whitelist
Preserving all fields listed in parameters. All other fields will be dropped
from nestedfunctions.functions.whitelist import whitelist
whitelisted_df = whitelist(df, ["addresses.postalCode", "creditCard"])
Drop
Recursively drop fields on any nested level (including child)
from nestedfunctions.functions.drop import drop
processed = drop(df, field="root_level.children1.children2")
Flattener
Return flattened representation of the data frame.
from nestedfunctions.spark_schema.utility import SparkSchemaUtility
flatten_schema = SparkSchemaUtility().flatten_schema(df.schema)
# flatten_schema = ["root-element",
# "root-element-array-primitive",
# "root-element-array-of-structs.d1.d2",
# "nested-structure.n1",
# "nested-structure.d1.d2"]
Nullify
Making field null on any nested level
from nestedfunctions.functions.nullify import nullify
processed = nullify(df, field="creditCard.id")
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pyspark_nested_functions-0.0.6.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 43c52980cc71854e0d3fb9e0b4a68a9ab242a22ba1715afc87e1aa641297df52 |
|
MD5 | 8c62fb20363c1f8e654b1d77931f6eae |
|
BLAKE2b-256 | 3fcf8b788baca7e0b0a41168f7ec4246426d0e8ab477a47ba2848852f1f4206a |
Hashes for pyspark_nested_functions-0.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d3c99aacdd462e88aa017acedb2cbc3ca7a8acc9d7239d994d3616129c2d1ab |
|
MD5 | c19eb8e060d1067c52a397f33683f963 |
|
BLAKE2b-256 | 8788b247c81ba6041170641c7d9324f058e8ed8da0e61de0e25ee88f011f30e0 |