pyspark-nested-functions

Utility functions to manipulate nested structures using pyspark

These details have not been verified by PyPI

Project links

Homepage

Project description

PYPI - Downloads PYPI - Python Version

Nested fields transformation for pyspark

Motivation

Applying transformations to nested structures is tricky in Spark. Assume we have below nested JSON data:

[
  {
    "data": {
      "city": {
        "addresses": [
          {
            "id": "my-id"
          },
          {
            "id": "my-id2"
          }
        ]
      }
    }
  }
]

To hash the nested id field you need to write the following PySpark code:

import pyspark.sql.functions as F

hashed = df.withColumn("data",
                       (F.col("data")
                        .withField("city", F.col("data.city")
                                   .withField("addresses", F.transform("data.city.addresses",
                                                                       lambda c: c.withField("id",
                                                                                             F.sha2(c.getField("id"),
                                                                                                    256)))))))

With the library the code above can be simplified to:

from nestedfunctions.functions.hash import hash_field
hashed = hash_field(df, "data.city.addresses.id", num_bits=256)

Install

To install the current release

$ pip install pyspark-nested-functions

Available functions

Add nested field

Adding a nested field called new_column_name based on a lambda function working on the column_to_process nested field. Fields column_to_process and new_column_name need to have the same parent or be at the root!

from nestedfunctions.functions.add_nested_field import add_nested_field
from pyspark.sql.functions import when
processed = add_nested_field(
      df,
      column_to_process="payload.array.booleanField",
      new_column_name="payload.array.booleanFieldAsString",
      f=lambda column: when(column, "Y").when(~column, "N").otherwise(""),
  )

Date Format

Format a nested date field from current_date_format to target_date_format.

from nestedfunctions.functions.date_format import format_date
date_formatted_df = format_date(
      df,
      field="customDimensions.value",
      current_date_format="y-d-M",
      target_date_format="y-MM"
  )

Drop

Recursively drop multiple fields at any nested level.

from nestedfunctions.functions.drop import drop

dropped_df = drop(
      df,
      fields_to_drop=[
        "root_column.child1.grand_child2",
        "root_column.child2",
        "other_root_column",
        ]
  )

Duplicate

Duplicate the nested field column_to_duplicate as duplicated_column_name. Fields column_to_duplicate and duplicated_column_name need to have the same parent or be at the root!

from nestedfunctions.functions.duplicate import duplicate
duplicated_df = duplicate(
      df,
      column_to_duplicate="payload.lineItems.comments",
      duplicated_column_name="payload.lineItems.commentsDuplicate"
  )

Expr

Add or overwrite a nested field based on an expression.

from nestedfunctions.functions.expr import expr
field = "emails.unverified"
processed = expr(df, field=field, expr=f"transform({field}, x -> (upper(x)))")

Field Rename

Rename all the fields based on any rename function.

(If you only want to rename specific fields filter on them in your rename function)

from nestedfunctions.functions.field_rename import rename
def capitalize_field_name(field_name: str) -> str:
  return field_name.upper()
renamed_df = rename(df, rename_func=capitalize_field_name())

Fillna

This function mimics the vanilla pyspark fillna functionality with added support for filling nested fields. The use of the input parameters value and subset is exactly the same as for the vanilla pyspark implementation as described here.

from nestedfunctions.functions.fillna import fillna
# Fill all null boolean fields with False
filled_df = fillna(df, value=False)
# Fill nested field with value
filled_df = fillna(df, subset="payload.lineItems.availability.stores.availableQuantity", value=0)
# To fill array which is null specify list of values
filled_df = fillna(df, value={"payload.comments" : ["Automatically triggered stock check"]})
# To fill elements of array that are null specify single value
filled_df = fillna(df, value={"payload.comments" : "Empty comment"})

Flattener

Return the flattened representation of the dataframe's schema.

from nestedfunctions.spark_schema.utility import SparkSchemaUtility

flattened_schema = SparkSchemaUtility().flatten_schema(df.schema)
# flattened_schema = ["root-element",
#                   "root-element-array-primitive",
#                   "root-element-array-of-structs.d1.d2",
#                   "nested-structure.n1",
#                   "nested-structure.d1.d2"]

Hash

Replace a nested field by its SHA-2 hash value. By default the number of bits in the output hash value will be 256 but a different value can be set.

from nestedfunctions.functions.hash import hash_field
hashed_df = hash_field(df, "data.city.addresses.id", num_bits=256)

Nullify

Making a field null on any nested level.

from nestedfunctions.functions.nullify import nullify

nullified_df = nullify(df, field="creditCard.id")

Overwrite nested field

Overwrites a nested field based on a lambda function working on this nested field.

from nestedfunctions.functions.terminal_operations import apply_terminal_operation
from pyspark.sql.functions import when
processed = apply_terminal_operation(
      df,
      field="payload.array.someBooleanField",
      f=lambda column, type: when(column, "Y").when(~column, "N").otherwise(""),
  )

Redact

Replace a field by the default value of its data type. The default value of a data type is typically its min or max value and can be found here.

from nestedfunctions.functions.redact import redact
redacted_df = redact(df, field="customDimensions.metabolicsConditions")

Whitelist

Preserving all fields listed in parameters. All other fields will be dropped

from nestedfunctions.functions.whitelist import whitelist

whitelisted_df = whitelist(df, ["addresses.postalCode", "creditCard"])

Predicate variations of above functions

Some of the above functions like hash, nullify and date_format have predicate variations. For these variations you can specify a single predicate_key/ predicate_value pair for which the function will be run. This is mainly handy when you only want to adapt a nested value when one of the root columns has a specific value.

License

Apache License 2.0

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.7

Mar 20, 2024

0.1.6

Feb 22, 2024

0.1.5

Jan 17, 2024

0.1.4

Jan 17, 2024

0.1.3

Jan 17, 2024

0.1.2

Jan 4, 2024

0.1.0

Dec 15, 2023

0.0.9

Aug 24, 2022

0.0.8

Jul 26, 2022

0.0.7

Jul 6, 2022

0.0.6

Jun 30, 2022

0.0.5

May 30, 2022

0.0.4

May 30, 2022

0.0.3

May 30, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_nested_functions-0.1.7.tar.gz (24.0 kB view details)

Uploaded Mar 20, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyspark_nested_functions-0.1.7-py3-none-any.whl (32.5 kB view details)

Uploaded Mar 20, 2024 Python 3

File details

Details for the file pyspark_nested_functions-0.1.7.tar.gz.

File metadata

Download URL: pyspark_nested_functions-0.1.7.tar.gz
Upload date: Mar 20, 2024
Size: 24.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for pyspark_nested_functions-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`f7947459879c360aa0dbf13d6ec5061871e247e5f188d72595a5de8abfc691bf`
MD5	`33f8952e1daea54552900f6a5505a54b`
BLAKE2b-256	`0dd122c21363be84366c7e9002a827cf93e6f97cdb29506a7905ae06c22e857e`

See more details on using hashes here.

File details

Details for the file pyspark_nested_functions-0.1.7-py3-none-any.whl.

File metadata

Download URL: pyspark_nested_functions-0.1.7-py3-none-any.whl
Upload date: Mar 20, 2024
Size: 32.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for pyspark_nested_functions-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b71125c7bd18a8040ab47090a39ae4edf6272e8ab2638aacbe1e4a66a422b120`
MD5	`14f793bada491c75c346ce1ccfa6aeb5`
BLAKE2b-256	`82dd0e900964f1db75ac690c165e7af1191c3899d8428e18b1d4b7b31b728241`

See more details on using hashes here.

pyspark-nested-functions 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Nested fields transformation for pyspark

Motivation

Install

Available functions

Add nested field

Date Format

Drop

Duplicate

Expr

Field Rename

Fillna

Flattener

Hash

Nullify

Overwrite nested field

Redact

Whitelist

Predicate variations of above functions

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes