A collection of utilities and tools for accelerating pyspark development and productivity.
Project description
Patek
A collection of reusable pyspark utility functions that help make development easier!
Installation
Patek is available on PyPI and can be installed with pip:
pip install patek
Usage
IO Helpers
Patek provides a set of IO helpers to quickly read and write data from/to various sources in PySpark.
Dynamic Delta Table Writer
The superDeltaWriter function allows you to write data to a Delta table using the merge capability without having to write out every single update and merge condition. This is useful when you have a large number of columns and/or a large number of update conditions.
from patek.io import superDeltaWriter
superDeltaWriter(sparkDataframe, ['key_column1'], 'delta/path', sparkSession, sparkContext, ['update_col1', 'update_col2'])
If update columns are not specified, the default is to update all non-key columns that exist in both the source and target tables. Also, if the target table does not exist, it will be created.
Funnel.io Schema to Spark Schema
The funnelSparkler function allows you to convert a Funnel.io schema to a Spark schema. This is useful to remove ambiguity when reading data from Funnel.io exports into spark dataframes, without having to manually define the schema.
from patek.io import funnelSparkler
dataframe = funnelSparkler('path/to/funnel_schema.json', 'path/to/funnel_export_data', sparkSession, sparkContext, data_file_type='csv')
Utility Functions
Patek provides a set of utility functions to help make development easier.
Determine Key Candidates
The determine_key_candidates function allows you to determine the key candidates for a given dataframe. This is useful when you have a large number of columns in a dataframe and you want to quickly determine which columns are good candidates for a key.
from patek.utils import determine_key_candidates
key_candidates = determine_key_candidates(sparkDataframe)
print(key_candidates)
# Output:
# a list containing single column key candidates: ['column1', 'column2', 'column3']
# a list containing composite key candidates: [['column1', 'column2'], ['column1', 'column3']]
Clean Column Names
The column_cleaner function allows you to clean column names in a dataframe. It removes special characters and replaces spaces with underscores.
from patek.utils import column_cleaner
# input dataframe columns: ['column?? 1', 'column: 2', 'column-3']
cleaned_dataframe = column_cleaner(sparkDataframe)
# output dataframe columns: ['column_1', 'column_2', 'column_3']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file patek-0.5.2.tar.gz
.
File metadata
- Download URL: patek-0.5.2.tar.gz
- Upload date:
- Size: 9.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f73d142fd304093bb0ec146be8d15e94fc9429fbcff50421d5699dcda7d121e |
|
MD5 | cfb1b95dde6dab496ef886b4822f853d |
|
BLAKE2b-256 | ef4299beea3f0586066e81f284c1c7c4ac1395b495eaaeb7db4ce60e0d1c6769 |