Skip to main content

Spalah is a set of PySpark dataframe helpers

Project description

spalah

Spalah is a set of python helpers to deal with PySpark dataframes, transformations, schemas and Delta Tables.

The word "spalah" means "spark" in Ukrainian 🇺🇦

Installation

Use the package manager pip to install spalah.

pip install spalah

Examples of use

Spalah currently has two different groups of helpers: dataframe and datalake.

spalah.dataframe

slice_dataframe

from spalah.dataframe import slice_dataframe

df = spark.sql(
    'SELECT 1 as ID, "John" AS Name, struct("line1" AS Line1, "line2" AS Line2) AS Address'
)
df.printSchema()

""" output:
root
 |-- ID: integer (nullable = false)
 |-- Name: string (nullable = false)
 |-- Address: struct (nullable = false)
 |    |-- Line1: string (nullable = false)
 |    |-- Line2: string (nullable = false)
"""

# Create a new dataframe by cutting of root and nested attributes
df_result = slice_dataframe(
    input_dataframe=df,
    columns_to_include=["Name", "Address"],
    columns_to_exclude=["Address.Line2"]
)
df_result.printSchema()

""" output:
root
 |-- Name: string (nullable = false)
 |-- Address: struct (nullable = false)
 |    |-- Line1: string (nullable = false)
"""

Beside of nested regular structs it also supported slicing of structs in arrays, including multiple levels of nesting

flatten_schema

from spalah.dataframe import flatten_schema

# Pass the sample dataframe to get the list of all attributes as single dimension list
flatten_schema(df_complex_schema.schema)

""" output:
['ID', 'Name', 'Address.Line1', 'Address.Line2']
"""


# Alternatively, the function can return data types of the attributes
flatten_schema(
    schema=df_complex_schema.schema,
    include_datatype=True
)

""" output:
[
    ('ID', 'IntegerType'),
    ('Name', 'StringType'),
    ('Address.Line1', 'StringType'),
    ('Address.Line2', 'StringType')
]
"""

script_dataframe

from spalah.dataframe import script_dataframe

script = script_dataframe(df)

print(script)

""" output:
from pyspark.sql import Row
import datetime
from decimal import Decimal
from pyspark.sql.types import *

# Scripted data and schema:
__data = [Row(ID=1, Name='John', Address=Row(Line1='line1', Line2='line2'))]

__schema = {'type': 'struct', 'fields': [{'name': 'ID', 'type': 'integer', 'nullable': False, 'metadata': {}}, {'name': 'Name', 'type': 'string', 'nullable': False, 'metadata': {}}, {'name': 'Address', 'type': {'type': 'struct', 'fields': [{'name': 'Line1', 'type': 'string', 'nullable': False, 'metadata': {}}, {'name': 'Line2', 'type': 'string', 'nullable': False, 'metadata': {}}]}, 'nullable': False, 'metadata': {}}]}

outcome_dataframe = spark.createDataFrame(__data, StructType.fromJson(__schema))
"""

SchemaComparer

from spalah.dataframe import SchemaComparer

schema_comparer = SchemaComparer(
    source_schema = df_source.schema,
    target_schema = df_target.schema
)

schema_comparer.compare()

# The comparison results are stored in the class instance properties `matched` and `not_matched`

# Contains a list of matched columns:
schema_comparer.matched

""" output:
[MatchedColumn(name='Address.Line1',  data_type='StringType')]
"""

# Contains a list of all not matched columns with a reason as description of non-match:
schema_comparer.not_matched

""" output:
[
    NotMatchedColumn(
        name='name', 
        data_type='StringType', 
        reason="The column exists in source and target schemas but it's name is case-mismatched"
    ),
    NotMatchedColumn(
        name='ID', 
        data_type='IntegerType <=> StringType', 
        reason='The column exists in source and target schemas but it is not matched by a data type'
    ),
    NotMatchedColumn(
        name='Address.Line2', 
        data_type='StringType', 
        reason='The column exists only in the source schema'
    )
]
"""

spalah.dataset

Get delta table properties

from spalah.dataset import DeltaTableConfig

dp = DeltaTableConfig(table_path="/path/dataset")

print(dp.properties) 

# output: 
# {'delta.deletedFileRetentionDuration': 'interval 15 days'}

Set delta table properties

rom spalah.dataset import DeltaTableConfig

dp = DeltaTableConfig(table_path="/path/dataset")

dp.properties = {
    "delta.logRetentionDuration": "interval 10 days",
    "delta.deletedFileRetentionDuration": "interval 15 days"
}

and the standard output is:

2023-05-20 18:27:42,070 INFO      Applying check constraints on 'delta.`/tmp/nested_schema_dataset`':
2023-05-20 18:27:42,071 INFO      Checking if constraint 'id_is_not_null' was already set on delta.`/tmp/nested_schema_dataset`
2023-05-20 18:27:42,433 INFO      The constraint id_is_not_null has been successfully added to 'delta.`/tmp/nested_schema_dataset`

Please note that check constraints can be retrieved and set using property: .check_constraints

Check for more information in examples: dataframe, examples: datalake pages and related notebook

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spalah-1.1.4.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spalah-1.1.4-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file spalah-1.1.4.tar.gz.

File metadata

  • Download URL: spalah-1.1.4.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spalah-1.1.4.tar.gz
Algorithm Hash digest
SHA256 113e0443c0ce59526e19fe6f1b6e17a0112a305d7a6608ad1a8de4d150db7197
MD5 1a607697277634bbfd086c14ebbc8373
BLAKE2b-256 69bd899ecc132da04e779bde7f704235eecdbd6103988910baf6515e122c3063

See more details on using hashes here.

Provenance

The following attestation bundles were made for spalah-1.1.4.tar.gz:

Publisher: spalah_cd.yaml on avolok/spalah

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file spalah-1.1.4-py3-none-any.whl.

File metadata

  • Download URL: spalah-1.1.4-py3-none-any.whl
  • Upload date:
  • Size: 13.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spalah-1.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 9f029e6d462a8f8a3573f4b5afdc2f7412cfebf8a429b9d7680cf3d6e693a73f
MD5 0e582216b7f0a8e3d7ce595cd17fb4af
BLAKE2b-256 d10cfbb298cbd7900f63e62ba4dd3c407e5f8a08e5fdb49c5c1eb36e032c531b

See more details on using hashes here.

Provenance

The following attestation bundles were made for spalah-1.1.4-py3-none-any.whl:

Publisher: spalah_cd.yaml on avolok/spalah

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page