Spalah is a set of PySpark dataframe helpers

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

spalah

Spalah is a set of python helpers to deal with PySpark dataframes, transformations, schemas etc.

The word "spalah" means "spark" in Ukrainian 🇺🇦

Installation

Use the package manager pip to install foobar.

pip install spalah

Examples of use

SchemaComparer

from spalah.dataframe import SchemaComparer

schema_comparer = SchemaComparer(
    source_schema = df_source.schema,
    target_schema = df_target.schema
)

schema_comparer.compare()

# The comparison results are stored in the class instance properties `matched` and `not_matched`

# Contains a list of matched columns:
schema_comparer.matched

""" output:
[MatchedColumn(name='Address.Line1',  data_type='StringType')]
"""

# Contains a list of all not matched columns with a reason as description of non-match:
schema_comparer.not_matched

""" output:
[
    NotMatchedColumn(
        name='name', 
        data_type='StringType', 
        reason="The column exists in source and target schemas but it's name is case-mismatched"
    ),
    NotMatchedColumn(
        name='ID', 
        data_type='IntegerType <=> StringType', 
        reason='The column exists in source and target schemas but it is not matched by a data type'
    ),
    NotMatchedColumn(
        name='Address.Line2', 
        data_type='StringType', 
        reason='The column exists only in the source schema'
    )
]
"""

flatten_schema

from spalah.dataframe import flatten_schema

# Pass the sample dataframe to get the list of all attributes as single dimension list
flatten_schema(df_complex_schema.schema)

""" output:
['ID', 'Name', 'Address.Line1', 'Address.Line2']
"""


# Alternatively, the function can return data types of the attributes
flatten_schema(
    schema=df_complex_schema.schema,
    include_datatype=True
)

""" output:
[
    ('ID', 'IntegerType'),
    ('Name', 'StringType'),
    ('Address.Line1', 'StringType'),
    ('Address.Line2', 'StringType')
]
"""

script_dataframe

from spalah.dataframe import script_dataframe

script = script_dataframe(df)

print(script)

""" output:
from pyspark.sql import Row
import datetime
from decimal import Decimal
from pyspark.sql.types import *

# Scripted data and schema:
__data = [Row(ID=1, Name='John', Address=Row(Line1='line1', Line2='line2'))]

__schema = {'type': 'struct', 'fields': [{'name': 'ID', 'type': 'integer', 'nullable': False, 'metadata': {}}, {'name': 'Name', 'type': 'string', 'nullable': False, 'metadata': {}}, {'name': 'Address', 'type': {'type': 'struct', 'fields': [{'name': 'Line1', 'type': 'string', 'nullable': False, 'metadata': {}}, {'name': 'Line2', 'type': 'string', 'nullable': False, 'metadata': {}}]}, 'nullable': False, 'metadata': {}}]}

outcome_dataframe = spark.createDataFrame(__data, StructType.fromJson(__schema))
"""

slice_dataframe

from spalah.dataframe import slice_dataframe

df = spark.sql(
    'SELECT 1 as ID, "John" AS Name, struct("line1" AS Line1, "line2" AS Line2) AS Address'
)
df.printSchema()

""" output:
root
 |-- ID: integer (nullable = false)
 |-- Name: string (nullable = false)
 |-- Address: struct (nullable = false)
 |    |-- Line1: string (nullable = false)
 |    |-- Line2: string (nullable = false)
"""

# Create a new dataframe by cutting of root and nested attributes
df_result = slice_dataframe(
    input_dataframe=df,
    columns_to_include=["Name", "Address"],
    columns_to_exclude=["Address.Line2"]
)
df_result.printSchema()

""" output:
root
 |-- Name: string (nullable = false)
 |-- Address: struct (nullable = false)
 |    |-- Line1: string (nullable = false)
"""

Check for more information an examples page and related notebook

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

1.0.6

Jan 2, 2024

1.0.2

May 22, 2023

1.0.1

May 21, 2023

1.0.0

May 20, 2023

0.5.0

Jan 22, 2023

0.4.1

Oct 11, 2022

0.4.0

Oct 2, 2022

This version

0.3.1

Aug 5, 2022

0.3.0

Jul 17, 2022

0.2.0

Jul 11, 2022

0.1.0

Jul 10, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spalah-0.3.1.tar.gz (19.7 kB view hashes)

Uploaded Aug 5, 2022 Source

Built Distribution

spalah-0.3.1-py2.py3-none-any.whl (8.0 kB view hashes)

Uploaded Aug 5, 2022 Python 2 Python 3

Hashes for spalah-0.3.1.tar.gz

Hashes for spalah-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`44d056240cf2d8959c1936a34d822feb5a53226ccb410194534a8caab2b2cf53`
MD5	`1442d8d8fb7380417c850aacc53e8f6b`
BLAKE2b-256	`98895e5d9756485291b49d644eece1c98aef6fb498243074665d1e9729c50504`

Hashes for spalah-0.3.1-py2.py3-none-any.whl

Hashes for spalah-0.3.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`e06236687b0650f5ba668c19bc7747bf07873ce8c8198948bae1116bdcf7847c`
MD5	`141488230e4a91f1934cbdbd15702e30`
BLAKE2b-256	`5983ef8f0a8036e8d3591e827a39c981a81e1377c2fad0e197037f7a84a70d21`